Adaptive Experiments and a Rigorous Framework for Type I Error Verification and Computational Experiment Design

05/19/2022
by   Michael Sklar, et al.
0

This PhD thesis covers breakthroughs in several areas of adaptive experiment design: (i) (Chapter 2) Novel clinical trial designs and statistical methods in the era of precision medicine. (ii) (Chapter 3) Multi-armed bandit theory, with applications to learning healthcare systems and clinical trials. (iii) (Chapter 4) Bandit and covariate processes, with finite and non-denumerable set of arms. (iv) (Chapter 5) A rigorous framework for simulation-based verification of adaptive design properties.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/25/2014

Algorithms for multi-armed bandit problems

Although many algorithms for the multi-armed bandit problem are well-und...
11/03/2019

Bayesian adaptive N-of-1 trials for estimating population and individual treatment effects

This article presents a novel adaptive design algorithm that can be used...
03/17/2019

On Multi-Armed Bandit Designs for Phase I Clinical Trials

We study the problem of finding the optimal dosage in a phase I clinical...
03/04/2019

Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials

The stochastic multi-armed bandit problem is a well-known model for stud...
08/20/2018

A General Framework of Multi-Armed Bandit Processes by Switching Restrictions

This paper proposes a general framework of multi-armed bandit (MAB) proc...
08/20/2018

A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions

This paper proposes a general framework of multi-armed bandit (MAB) proc...
02/18/2022

Adaptivity and Confounding in Multi-Armed Bandit Experiments

We explore a new model of bandit experiments where a potentially nonstat...

2.1 Introduction

The first topic of this chapter is concerned with “Adaptive Subgroup Selection in Confirmatory Clinical Trials" addressed in Section 2.2.4, where we describe master protocols for precision-guided drug development and efficacy/safety testing. The second topic of this chapter is "Group Sequential and Adaptive Designs of Confirmatory Trials of New Treatments" as addressed in Section 2.2.2, and Section 2.2.3 considers related advances in statistical analysis of trials for regulatory submission. Section 2.2.1 provides the literature review and background of efficient adaptive designs, and in the remainder of this section we discuss the background of master protocols.

2.1.1 Targeted Therapies in Oncology and FDA’s Drug Guidance

Precision medicine considers the “individual variability in genes, environment, and lifestyle” of a patient to better prevent or treat illness [NIHprecision, garrido2018proposal]. In his State of the Union Address in January 2015, President Barack Obama launched a precision medicine initiative, that was to focus first on the improvement of cancer therapies. Experts agreed that oncology was “the clear choice,” owing to recent advances in diagnostic technology, computational capability, and scientific understanding of cancers, which remain a leading cause of morbidity and mortality worldwide [collins2015new]. Targeted therapies and immuno-oncology (IO) agents were among the forerunners of transformative new medicines, and have heavily utilized innovative statistical methods to meet the clinical development challenges inherent to personalized medicines [de2010translating, snyder2014genetic].

Targeted therapies have established their benefit over conventional cytotoxic therapy across multiple tumors [hodi2010improved, borghaei2015nivolumab, postow2015immune]. However, a large unmet medical need remains for most malignancies. Patients are seeking better options urgently. The comprehensive evaluation of new investigational targeted therapies in oncology, in a timely and resource efficient manner, is infeasible with conventional large randomized trials [ersek2018implementing, ersek2019critical]. To match the right therapy with the right patients, the number of scientific questions that need to be answered during clinical development has increased substantially. Traditionally, oncology drug development comprises a series of clinical trials where each study’s objective is to establish the safety and efficacy of a single investigational therapy over the current standard of care (SOC) in a broad study population [redman2015master, berry2015brave]. A targeted therapy’s safety and benefit over the SOC needs to be established for a long list of considerations specific to the biomarker-defined subpopulation and pathology, including safety, therapy sequence, drug combinations, combination dosing, and the contribution of individual drug components. The reality of developing “precision medicines” is that there are fewer subjects, who are harder to find, which may jeopardize study completion and extend timelines. The costs of trials have increased with more extensive tissue sample collection, biomarker assessment and tumor imaging, more expensive comparator drugs, and the generally rising cost of medical care. Recent advances in tumor sequencing and genomics affords a more detailed understanding of the underlying biology and pathology [lima2019recent]. Although focusing in on molecularly defined subpopulations, this actually expands the reach of targeted therapies across tumors and lines of therapy which can be matched by specific gene signatures or biomarkers such as high expression of microsatellite instability (MSI-hi), or PD-1/PD-L1.

In 2019, 3,876 immuno-therapy compounds were in clinical development, 87% of which were oncology agents. This marks a 91% increase over the 2,030 compounds in development in 2017 [xin2019immuno]. With additional information emerging at an increasing pace, it is expected that today’s clinical protocols will require revisions tomorrow, and may need to accommodate a potential change in SOC, emerging information on safety and efficacy of similar compounds, and a better understanding of the fundamental tumor biology [hirsch2013characteristics, xin2019immuno]. Therefore, clinical study protocols are required that learn faster from fewer study subjects, expedite the evaluation of novel therapies, use resources judiciously, enable robust hypotheses evaluation, are operationalizable across most clinics, and afford sufficient flexibility to answer multiple research questions and respond to emerging information. Master protocols have emerged to address this challenge. The term “master protocol” refers to a single overarching design that evaluates multiple hypotheses, with the objective to improve efficiency and uniformity through standardization of procedures in the development and evaluation of different interventions [renfro2018definitions]. In 2001, the first clinical trial to use a “master protocol” was the study B2225, an Imatinib Targeted Exploration. [mcarthur2005molecular, park2019systematic]. However, the uptake of master protocols was slow. In 2005, STAMPEDE became only the second study to employ a master protocol design, and by 2010, there were still fewer than 10 master protocol-guided studies in the public domain. The subsequent decade from 2010 to 2019 saw a rapid growth resulting in a 10-fold increased use of master protocols in clinical studies [park2019systematic]. A recent catalyst was the validation of the “master protocol” approach by the regulators. In 2018, the FDA issued a draft guidance for industry that advises on the use of master protocols in support of clinical development, titled “Efficient Clinical Trial Design Strategies to Expedite Development of Oncology Drugs and Biologics” [lal]. In 2019, 83 clinical trials were in the public domain that utilized master protocols.

2.1.2 Umbrella, Platform, and Basket Trials

The aforementioned rapid growth was also catalyzed by the successful immuno-oncology therapies targeting CTLA-4 in 2011 (ipilimumab), and PD-1 in 2014 (pembrolizumab and nivolumab); see oiseth2017cancer. Moreover, clinical development teams were faced with the need to explore quickly and efficiently a broader set of malignancies. Basket trials (59%) accounted for the largest portion of master protocols in 2019, followed by umbrella trials (22%), and platform trials (19%). The growth rate of platform trials outpaced umbrella and basket trials in the late 2010s. The majority of master protocol studies (92%) focus on oncology, and 83% enrolled adult populations [park2019systematic]. Recently, CDER communicated that the “FDA modernizes clinical trials with master protocols” citing good practice considerations, which is expected to further encourage industry to utilize master protocols to rapidly deliver their drug pipelines; see food2018master. Basket trials, umbrella trials, and platform trials are implementation structures of clinical studies, and their designs are defined within a master protocol. Each trial variant provides specific flexibility in the clinical development process, and has its advantages and disadvantages [renfro2018definitions, cecchini2019challenges]. The study design and statistical consideration will need to be weighed on a case-by-case basis so that the clinical hypotheses can be answered as directly as possible. Key design choices required of clinical development teams are whether to study multiple investigational drugs in one protocol, include a control arm, open multiple cohorts to test for multiple biomarkers, and whether to add or stop treatment arms during the course of the trial. Statistical analysis choices include whether to use Bayesian or frequentist methods to evaluate efficacy,how best to randomize subjects, the selection of appropriate futility and early success criteria, and what covariates to control [renfro2018definitions, food2018master, renfro2017statistical, mandrekar2015improving].

Basket trials investigate a single drug, or a single combination therapy, across multiple populations based on the presence of specific histology, genetic markers, prior therapies, or other demographic characteristics [food2018master]. They may include expansion cohorts and are especially well-suited for “signal-finding.” They frequently comprise single-arm, open-label Phase I or II studies, enroll 20-50 subjects per sub-study, and use two- or multi-stage decision gates to rapidly screen multiple populations for detecting large efficacy signals (by combining multiple tumor types in one protocol) with acceptable safety profiles [park2019systematic, renfro2018definitions]. Unlike umbrella and platform trials in which the Recommended Phase II Dose (RP2D) has been pre-established, a basket trial may enroll first-in-human cohorts for whom the RP2D may be established alongside any safety and efficacy signals [cecchini2019challenges]. While often exploratory, basket trials can have registrational intent. An example is Keynote158 which studied pembrolizumab in solid tumors with high microsatellite instability (MSI-H). The simplicity of the basket protocol and its relatively small size needs to be weighed against the design’s lack of control groups and limited information for sub-populations based on pooled sample analyses. Basket trial protocols may be amended to include additional tumor types and study populations [renfro2017statistical], ineffective cohorts can be excluded in a response-adaptation approach, and new cohorts can be added, but such changes often require a protocol amendment and subsequent patient reconsenting plus retraining of study personnel. cecchini2019challenges give a comprehensive review of the challenges from the perspectives of the study sponsor, regulator, investigator, and institutional review boards, and discuss the increased operational complexity and increased cost that accompany the reduction in development time. Despite these limitations, basket protocols have become the most widely used master protocols, as they offer the smallest and fastest option, with a median study size of 205 subjects and a 22.3-month study duration [park2019systematic]. Statistical methods which are often used to analyze these trials include frequentist sequential [leblanc2009multiple, park2019systematic] and hierarchical Bayesian [berry2006bayesian, thall2003hierarchical] methods, and the recent approaches that control the family-wise error rates for multi-arm studies [chen2016statistical], response adaptive randomization [ventz2017bayesian, lin2017comparison], calibrated Bayesian hierarchical testing and subgroup design (Chu and Yuan, 2018a,b), robust exchangeability [neuenschwander2016robust], modification of Simon’s two stage design to improve efficiency [cunanan2017efficient], and combination of frequentist and Bayesian approaches [lin2017comparison].

Umbrella and platform trials are master protocols with exploratory or registrational intent that match biomarker-selected subgroups with subgroup-specific investigational treatments, and may include the current standard of care for the disease setting as a shared control group. They aim at identifying population subgroups that derive the most clinically meaningful benefit from an investigational therapy, and may enable a smaller, faster, and more cost-effective confirmatory phase III study. Umbrella trials are often phase II or phase II/III, have an established RP2D for each investigational therapy, and frequently include biomarker enriched cohorts [renfro2018definitions]. The totality of umbrella trial data enables inference on the predictive and prognostic potential of the studied biomarkers within the given disease setting [renfro2018definitions]. While the study of specific biomarker subsets is a key focus, the inclusion of rare populations can lead to accelerated regulatory approval to fill an unmet need but may result in long accrual and trial durations. It is possible to add or remove investigational treatments and subgroups, but the required protocol amendments can cause considerable logistic challenges for sponsors and investigators [cecchini2019challenges]. The pre-planned and algorithmic addition or exclusion of treatments during trial conduct is what distinguishes platform trials from umbrella trials [angus2019adaptive]. Platform trials frequently include futility criteria and interim analyses, which provide guidance on whether to expand or discontinue a given investigational therapy. Platform trial cohorts often have an established RP2D for each investigational therapy, and may be expanded directly to a registrational Phase III trial while retaining the flexibility to keep other populations in the study [renfro2018definitions]. Recommendations to continue or discontinue treatments are often derived by using Bayesian and Bayesian hierarchical methods [saville2016efficiencies, hobbs2018controlled]

. Some protocols leverage response-adaptive randomization to increase the probability that subjects are assigned to the likely superior treatment for their biomarker type, which may provide ethical and cost advantages over conventional randomization

[berry2006bayesian, wen2017response]. Umbrella and platform trials are 2-5 fold larger and longer than the average basket trial [park2019systematic], and it is important to weigh the benefits of a smaller Phase I basket trial, which may be amended to provide sufficient data for accelerated approval of a novel therapy as demonstrated by Keynote-001 [kang2017pembrolizumab], versus the longer and more comprehensive evaluation of multiple investigational agents and subgroups. Another disadvantage co-travelling with the larger size, duration and cost of umbrella and platform trials is the potential change in the treatment landscape and SOC, which may necessitate subsequent modifications to bridge between the control and therapy arms [lai2015adaptive, cecchini2019challenges, renfro2018definitions].

2.2 Group Sequential and Adaptive Designs of Confirmatory Trials of New Treatments

As pointed out by bartroff2013sequential, in standard designs of clinical trials comparing a new treatment with a control (which is a standard treatment or placebo), the sample size is determined by the power at a given alternative, but it is often difficult to specify a realistic alternative in practice because of lack of information on the magnitude of the treatment effect difference before actual clinical trial data are collected. On the other hand, many trials have Data and Safety Monitoring Committees (DSMCs) who conduct periodic reviews of the trial, particularly with respect to incidence of treatment-related adverse events, hence one can use the trial data at interim analyses to estimate the effect size. This is the idea underlying group sequential trials in the late 1970s, and one such trial was the Beta-blocker Heart Attack Trial (BHAT) that was terminated in October 1981, prior to its prescheduled end in June 1982; see bartroff2013sequential. BHAT, which was a multicenter, double-blind, randomized placebo-controlled trial to test the efficacy of long-term therapy with propranolol given to survivors of an acute myocardial infarction (MI), drew immediate attention to the benefits of sequential methods not because it reduced the number of patients but because it shortened a 4-year study by 8 months, with positive results for a long-awaited treatment for MI patients. The success story of BHAT paved the way for major advances in the development of group sequential methods in clinical trials and for the widespread adoption of group sequential design. Sections 3.5 and 4.2 of bartroff2013sequential describe the theory developed by Lai and Shih (2004) for nearly optimal group sequential tests in exponential families to provide a definitive method amidst the plethora of group sequential stopping boundaries that were proposed in the two decades after BHAT, as reviewed in bartroff2013sequential.

Lai and Shih’s theory is based on (a) asymptotic lower bounds for the sample sizes of group sequential tests that satisfy prescribed type I and type II error probability bounds, and (b) group sequential generalized likelihood ratio (GLR) tests with modified Haybittle-Peto boundaries that can be shown to attain these bounds. Noting that the efficiency of a group sequential test depends not only on the choice of the stopping rule but also on the test statistics, Lai and Shih use GLR statistics that have been shown to have asymptotically optimal properties for sequential testing in one-parameter exponential families and can be readily extended to multiparameter exponential families for which the type I and type II errors are evaluated at

and , respectively, where is a continuously differentiable function on the natural parameter space such that Kullback-Leibler information number is increasing in for every ; see Bartroff et al. (2013, Sections 3.7 and 4.2.4). An important consideration in this approach is the choice of the alternative (in the one-parameter case, or in the multiparameter exponential families). To test , suppose the significance level is and no more than observations are to be taken because of funding and administrative constraints on the trial. The FSS (fixed sample size) test that rejects if has maximal power at any alternative . Although funding and administrative considerations often play an important role in the choice of , justification of this choice in clinical trial protocols is typically based on some prescribed power at an alternative “implied" by . The implied alternative is defined by that and can be derived from the prescribed power at . It is used to construct the futility boundary in the modified Haybittle-Peto group sequential test (Bartroff et al., 2013, pp.81-85).

2.2.1 Efficient Adaptive Designs

Using Lai and Shih’s theory of modified Haybittle-Peto group sequential tests, Bartroff and Lai (2008a,b) developed a new approach to adaptive design of clinical trials. In standard clinical trial designs, the sample size is determined by the power at a given alternative, but in practice, it is often difficult for investigators to specify a realistic alternative at which sample size determination can be based. Although a standard method to address this difficulty is to carry out a preliminary pilot study, the results from a small pilot study may be difficult to interpret and apply, as pointed out by wittes1990role, who proposed to treat the first stage of a two-stage clinical trial as an internal pilot from which the overall sample size can be re-estimated. The specific problem they considered actually dated back to Stein’s (1945) two-stage procedure for testing hypothesis versus the two-sided alternative

for the means of two independent normal distributions with common, unknown variance

. In its first stage, Stein’s procedure samples

observations from each of the two normal distributions and computes the usual unbiased estimate

of . The second stage samples observations from each population, where denotes the greatest integer function, is the prescribed type I error probability, is the upper

-quantile of the

-distribution with degrees of freedom, and is the prescribed power at the alternatives satisfying

. The null hypothesis

is then rejected if

Modifications of the two-stage procedure were provided by wittes1990role, lawrence1992sample, and herson1993use, which represent the “first generation" of adaptive designs. The second generation of adaptive designs adopts a more aggressive method to re-estimate the sample size from the estimate of (instead of the nuisance parameter ) based on the first-stage data. In particular, fisher1998self considers the case of normally distributed outcome variables with known common variance . Letting be the sample size for each treatment and , he notes that after pairs of observations , where . Let and be the new total sample size for each treatment. Under ,

hence the test statistic is standard normal. Whereas Fisher uses a “variance spending" approach as is the remaining part of the total variance that has not been spent in the first stage, proschan1995designed use a conditional Type I error function with range to define a two-stage procedure that rejects in favor of if the second-stage -value exceeds , where is the first-stage -value. The type I error of the two-stage test can be kept at if where and are the density function and distribution function, respectively, of the standard normal distribution.

Assuming normally distributed outcomes with known variances, Jennison and Turnbull (2006 a,b) introduced adaptive group sequential tests that choose the th group size and stopping boundary on the basis of the cumulative sample size and the sample sum over the first groups, and that are optimal in the sense of minimizing a weighted average of the expected sample sizes over a collection of parameter values, subject to prescribed error probabilities at the null and a given alternative hypothesis. They showed how the corresponding optimization problem can be solved numerically by using backward induction algorithms, and that standard (non-adaptive) group sequential tests with the first stage chosen appropriately are nearly as efficient as their optimal adaptive tests. They also showed that the adaptive tests proposed in the preceding paragraph performed poorly in terms of expected sample size and power in comparison with the group sequential tests. tsiatis2003inefficiency attributed this inefficiency to the use of the non-sufficient “weighted” statistic. Bartroff and Lai’s (2008a,b) approach to adaptive designs, developed in the general framework of multiparameter exponential families, uses efficient generalized likelihood ratio statistics in this framework and adds a third stage to adjust for the sampling variability of the first-stage parameter estimates that determine the second-stage sample size. The possibility of adding a third stage to improve two-stage designs dated back to lorden1983asymptotic, who used crude upper bounds for the type I error probability that are too conservative for practical applications. Bartroff and Lai overcame this difficulty by using new methods to compute the type I error probability, and also extended the three-stage test to multiparameter and multi-arm settings, thus greatly broadening the scope of these efficient adaptive designs. Details are summarized in Chapter 8, in particular Sections 8.2 and 8.3, of Bartroff et al. (2013), where Section 8.4 gives another modification of group sequential GLR tests for adaptive choice between the superiority and non-inferiority objectives of a new treatment during interim analyses of a clinical trial to test the treatment’s efficacy, as in an antimicrobial drug developed by the company of one of the coauthors of lai2006modified.

2.2.2 Adaptive Subgroup Selection in Confirmatory Trials

Choice of the patient subgroup to compare the new and control treatments is a natural compromise between ignoring patient heterogeneity and using stringent inclusion-exclusion criteria in the trial design and analysis. lai2014adaptive introduce a new adaptive design to address this problem. They first consider trials with fixed sample size, in which patients are randomized to the new and control treatments and the responses are normally distributed, with mean for the new treatment and for the control treatment if the patient falls in a pre-defined subgroup for , and with common known variance . Let denote the entire patient population for a traditional randomized controlled trial (RCT) comparing the two treatments, and let be the prespecified subgroups. Since there is typically little information from previous studies about the subgroup effect size for , Lai et al. (2014) begins with a standard RCT to compare the new treatment with the control over the entire population, but allows adaptive choice of the patient subgroup , in the event is not rejected, to continue testing with so that the new treatment can be claimed to be better than control for the patient subgroup if is rejected. Letting and , the probability of a false claim is the type I error

(2.1)

for . Subject to the constraint , they prove the asymptotic efficiency of the procedure that randomly assigns patients to the experimental treatment and the control, rejects if for , and otherwise chooses the patient subgroup with the largest value of the generalized likelihood ratio statistic among all subgroups and rejects if , where is the mean response of patients in from the treatment (control) arm and is the corresponding sample size. After establishing the asymptotic efficiency of the procedure in the fixed sample size case, they proceed to extend it to a 3-stage sequential design by making use of the theory of Bartroff and Lai reviewed in the preceding paragraph. They then extend the theory from the normal setting to asymptotically normal test statistics, such as the Wilcoxon rank sum statistics. These designs which allow mid-course enrichment using data collected, were motivated by the design of the DEFUSE 3 clinical trial at the Stanford Stroke Center to evaluate a new method for augmenting usual medical care with endovascular removal of the clot after a stroke, resulting in reperfusion of the area of the brain under threat, in order to salvage the damaged tissue and improve outcomes over standard medical care with intravenous tissue plasminogen activator (tPA) alone. The clinical endpoints of stroke patients are the Rankin scores, and Wilcoxon rank sum statistics are used to test for differences in Rankin scores between the new and control treatments. The DEFUSE 3 (Diffusion and Perfusion Imaging Evaluation for Understanding Stroke Evolution) trial design involves a nested sequence of subsets of patients, defined by a combination of elapsed time from stroke to start of tPA and an imaging-based estimate of the size of the unsalvageable core region of the lesion. The sequence was defined by cumulating the cells in a two-way (3 volumes 2 times) cross-tabulation as described by Lai et al. (2014, p. 195). In the upper left cell, , which consisted of the patients with a shorter time to treatment and smallest core volume, the investigators were most confident of a positive effect, while in the lower right cell with the longer time and largest core area, there was less confidence in the effect. The six cumulated groups, give rise to corresponding one-sided null hypotheses, for the treatment effects in the cumulated groups.

Shortly before the final reviews of the protocol for funding were completed, four RCTs of endovascular reperfusion therapy administered to stroke patients within 6 hours after symptom onset demonstrated decisive clinical benefits. Consequently, the equipoise of the investigators shifted, making it necessary to adjust the intake criteria to exclude patients for whom the new therapy had been proven to work better than the standard treatment. The subset selection strategy became even more central to the design, since the primary question was no longer whether the treatment was effective at all, but for which patients should it be adopted as the new standard of care. Besides adapting the intake criteria to the new findings, another constraint was imposed by the NIH sponsor, which effectively limited the total randomization to 476 patients. The first interim analysis was scheduled after the 200 patients, and the second interim analysis after an additional 140 patients. DEFUSE 3 has a Data Coordinating Unit and an independent Data and Safety Monitoring Board (DSMB). Besides examining the unblinded efficacy results prepared by a designated statistician at the data coordination unit, which also provided periodic summaries on enrollment, baseline characteristics of enrolled patients, protocol violations, timeliness and completeness of data entry by clinical centers, and safety data. During interim analyses, the DSMB would also consider the unblinded safety data, comparing the safety of endovascular plus IV-tPA to that of IV-tPA alone, in terms of deaths, serious adverse events, and incidence of symptomatic intracranial hemorrhage.

In June 2017 positive results of another trial, DWI or CTP Assessment with Clinical Mismatch in the Triage of Wake-Up and Late Presenting Stokes undergoing Neuro-intervention with Trevo (DAWN), which involved patients and treatments similar to those of DEFUSE 3, were announced. Enrollment in the DEFUSE 3 trial was placed on hold; an early interim analysis of the 182 patients enrolled to date was requested by the sponsor (NIH); see albers2018thrombectomy who say: “As a result of that interim analysis, the trial was halted because the prespecified efficacy boundary had been exceeded." As reported by the aforementioned authors, DEFUSE 3 “was conducted at 38 US centers and terminated early for efficacy after 182 patients had undergone randomization (92 to the endovascular therapy group and 90 to the medical-therapy group)." For the primary and secondary efficacy endpoints, the results show significant superiority of endovascular plus medical therapies. The DAWN trial “was a multicenter randomized trial with a Bayesian adaptive-enrichment design" and was “conducted by a steering committee, which was composed of independent academic investigators and statisticians, in collaboration with the sponsor, Stryker Neurovascular" [nogueira2018thrombectomy]. Early termination of DEFUSE 3 provides a concrete example of importance of a flexible group sequential design that can adapt not only to endogenous information from the trial but also to exogenous information from advances in precision medicine and related concurrent trials.

We conclude this section with recent regulatory developments in enrichment strategies for clinical trials and in adaptive designs of confirmatory trials of new treatments. In March 2019, the FDA released its Guidance for Industry on Enrichment Strategies for Clinical Trials to Support Determination of Effectiveness of Human Drugs and Biological Products. In November 2019, CDER and CBER of FDA released its Guidance for Industry on Adaptive Designs for Clinical Trials of Drugs and Biologics, which was an update of the 2010 CDER’s Guidance for Industry on Adaptive Designs.

2.3 Analysis of Novel Confirmatory Trials

This section describes some advances in statistical methods for the analysis of the novel clinical trial designs of confirmatory trials in Section 2.2. It begins with hybrid resampling for inference on primary and secondary endpoints in Section 2.3.1. Section 2.3.2 considers statistical inference from multi-arm trials for developing and testing biomarker-guided personalized therapies.

Hybrid Resampling for Primary and Secondary Endpoints

tsiatis1984exact

developed exact confidence intervals for the mean of a normal distribution with known variance following a group sequential test. Subsequently,

chuang1998resampling, chuang2000hybrid noted that even though is a pivot in the case of , is highly non-pivotal for a group sequential stopping time, hence the need for the exact method of tsiatis1984exact, which they generalized as follows. If is indexed by a real-valued parameter , an exact equal-tailed confidence region can always be found by using the well-known duality between hypothesis tests and confidence regions. Suppose one would like to test the null hypothesis that is equal to . Let be some real-valued test statistic. Let be the -quantile of the distribution of under the distribution . The null hypothesis is accepted if . An exact equal-tailed confidence region with coverage probability consists of all not rejected by the test and is therefore given by . The exact method, however, applies only when there are no nuisance parameters and this assumption is rarely satisfied in practice. To address this difficulty, Chuang and Lai (1998, 2000) introduced a hybrid resampling method that “hybridizes" the exact method with Efron’s (1987) bootstrap method to construct confidence intervals. The bootstrap method replaces the quantiles and by by the approximate quantiles and obtained in the following manner. Based on , construct an estimate of . The quantile is defined to be -quantile of the distribution of with generated from and , yielding the confidence region with approximate coverage probability . For group sequential designs, the bootstrap method breaks down because of the absence of an approximate pivot, as shown by chuang1998resampling. The hybrid confidence region is based on reducing the family of distributions to another family of distributions , which is used as the “resampling family" and in which is the unknown parameter of interest. Let be the -quantile of the sampling distribution of under the assumption that has distribution . The hybrid confidence region results from applying the exact method to and is given by

(2.2)

The construction of (2.2) typically involves simulations to compute the quantiles as in the bootstrap method.

Since an exact method for constructing confidence regions is based on inverting a test, such a method is implicitly or explicitly linked to an ordering of the sample space of the test statistic used. The ordering defines the -value of the test as the probability (under the null hypothesis) of more extreme values (under the ordering) of the test statistic than that observed in the sample. Under a total ordering of the sample space of , Lai and Li (2006) call a th quantile if , which generalizes Rosner and Tsiatis’ exact method for randomly stopped sums

of of independent normal random variables with unknown mean

. For the general setting where a stochastic process , in which denotes either discrete or continuous time, is observed up to a stopping time , Lai and Li (2006) define to be a th quantile if

(2.3)

under a total ordering for the sample space of . For applications to confidence intervals of a real parameter , the choice of the total ordering should be targeted toward the objective of interval estimation. Let be real-valued statistics based on the observed process . For example, let be an estimate of based on . A total ordering on the sample space of can be defined via as follows:

(2.4)

in which is defined from in the same way as is defined from and which has the attractive feature that the probability mechanism generating needs only to be specified up to the stopping time in order to define the quantile. Bartroff et al. (2013, p.164) remark that if then the Lai-Li ordering is equivalent to Siegmund’s ordering and also to the Rosner-Tsiatis ordering, but “the original Rosner-Tsiatis ordering requires (or the stochastic mechanism generating them to be completely specified" and has difficulties “described in the last paragraph of Sect. 7.1.3 if this is not the case."

Bartroff et al. (2013, Sections 7.4 and 7.5) describe how this ordering can be applied to implement resampling for secondary endpoints together with applications to time-sequential trials which involve interim analyses at calendar time , with (the prescribed duration of the trial), and which have time to failure as the primary endpoint; Lai et al. (2009) have also extended this approach to inference on secondary endpoints in adaptive or time-sequential trials.

2.3.1 Statistical Inference from Multi-Arm Trials for Developing and Testing Biomarker-Guided Personalized Therapies

lai2013group first elucidate the objectives underlying the design and analysis of these multi-arm trials that attempt to select the best of

treatments for each biomarker-classified subgroup of cancer patients in Phase II studies, with objectives that include (a) treating accrued patients with the best (yet unknown) available treatment, (b) developing a biomarker-guided treatment strategy for future patients, and (c) demonstrating that the strategy developed indeed has statistically significantly better treatment effect than some predetermined threshold. The group sequential design therefore uses an outcome-adaptive randomization rule, which updates the randomization probabilities at interim analyses and uses GLR statistics and modified Haybittle-Peto rules to include early elimination of inferior treatments from a biomarker class. It is shown by

lai2013group to provide substantial improvements, besides being much easier to implement, over the Bayesian outcome-adaptive randomization design used in the BATTLE (Biomarker-integrated Approaches of Targeted Therapy for Lung Cancer Elimination) trial of personalized therapies for non-small cell lung cancer. An April 2010 editorial in Nature Reviews in Medicine points out that BATTLE design, which “allows researchers to avoid being locked into a single, static protocol of the trial" that requires large sample sizes for multiple comparisons of several treatments across different biomarker classes, can “yield breakthroughs, but must be handled with care" to ensure that “the risk of reaching a false positive conclusion" is not inflated. As pointed out by Lai et al. (2013, pp.651-653, 662), targeted therapies that target the cancer cells (while leaving healthy cells unharmed) and the “right" patient population (that has the genetic or other markets for the sensitivity to the treatment) have great promise in cancer treatments but also challenges in designing clinical trials for drug development and regulatory approval. One challenge is to identify the biomarkers that are predictive of response and another is to develop a biomarker classifier that can identify patients who are sensitive to the treatments. We can address these challenges by using recent advances in contextual multi-arm bandit theory, which we summarize below.

The -arm bandit problem, introduced by robbins1952some for the case , is prototypical in the area of stochastic adaptive control that addresses the dilemma between “exploration" (to generate information about the unknown system parameters needed for efficient system control) and “exploitation" ( to set the system inputs that attempt to maximize the expected rewards from the outputs.) Robbins considered the problem of which of populations to sample from sequentially in order to maximize the expected sum . Let be the history (or more formally, the -algebra of events) up the time . An allocation rule is said to be “adaptive" if for . Suppose has density function when , and let . Let be the mean of the th population, which is assumed to be finite. Then

(2.5)

where is the total sample size from population . If the population with the largest mean were known, then obviously one should sample from it to receive expected reward , where . Hence maximizing the expected sum is equivalent to minimizing the regret, or shortfall from :

(2.6)

in which the second equality follows from (2.5) and shows that the regret is a weighted sum of expected sample sizes from inferior populations. Making use of this representation in terms of expected sample sizes, lai1985asymptotically derive an the asymptotic lower bound, as , for the regret of uniformly good adaptive allocation rules:

(2.7)

where and is the Kullback-Leibler information number; an adaptive allocation rule is called “uniformly good" if for all and . They show that the asymptotic lower bound (2.7) can be attained by the “upper confidence bound" (UCB) rule that samples from the population (arm) with the largest upper confidence bound, which incorporates uncertainty in the sample mean by the numbers of observations sampled from the arm (i.e., width of a one-sided confidence interval.)

New applications and advances in information technology and biomedicine in the new millenium have led to the development of contextual multi-arm bandits, also called bandits with side information or covariates, while the classical multi-arm bandits reviewed above are often referred to as “context-free" bandits. Personalized marketing (e.g., Amazon) uses web sites to track a customer’s purchasing records and thereby to maket products that are individualized for the customer. Recommender systems select items such as movies (e.g., Netflix) and news (e.g., Yahoo) for users based on the users’ and items’ features(covariates). Whereas classical -arm bandits reviewed above aim at choosing sequentially so that is as close as possible to , contextual bandits basically replace by , where is the covariate of the th subject, noting that analogous to (2.5),

(2.8)

Assuming to be i.i.d. with distribution , we can define , and the regret

(2.9)

for Borel subsets of the support of , where , noting that the measure is absolutely continuous with respect to , hence in (2.9) is its Radom-Nikodym derivative with respect to . For contextual bandits, an arm that is inferior at may be the best at . Therefore the uncertainty in the sample mean reward at does not need to be immediately accounted for, and adaptive randomization (rather than UCB rule) can yield an asymptotically optimal policy.

To achieve the objectives (a), (b) and (c) in the first paragraph of this subsection, Lai et al.[lai2013group, pp.654-655] use contextual bandit theory which we illustrate below with groups of patients and treatments, assuming normally distributed responses with mean and known variance 1 for patients in group receiving treatment . Using Bartroff and Lai’s adaptive design (2008a,b) reviewed in Section 2.1, let denote the total sample size up to the time of the th interim analysis, denote the total sample size from group in those patients, and let be the total sample size from biomarker class receiving treatment up to the th interim analysis. Because it is unlikely for patients to consent to being assigned to a seemingly inferior treatment, randomization in a double blind setting (in which the patient and the physician both do not know whether treatment or control is assigned) is needed for informed consent. Contextual bandit theory suggests assigning the highest randomization probability between interim analyses and to (which is the MLE of ) and eliminating treatment from the set of of surviving treatments at the th interim analysis if the GLR statistic exceeds 5, where but , with a randomization scheme in which

(2.10)

in which denotes the cardinality of a finite set and . Equal randomization (with randomization probability ) for the treatments is used up to the first interim analysis. In context-free multi-arm bandit theory, this corresponds to the -greedy algorithm which has been shown by auer2002finite to provide an alternative to the UCB rule for attaining the asymptotic lower bound for the regret. lai2013group introduce a subset selection method for selecting a subset of treatments at the end of the trial to be used for future patients, with an overall probability guarantee of to contain the best treatment for each biomarker class, and such that the expected size of the selected subset is as small as possible in some sense. They also develop a group sequential GLR test with prescribed type I error to demonstrate that the developed treatment strategy improves the mean treatment effect of SOC by a given margin.

2.3.2 Precision-Guided Drug Development and Basket Protocols

Janet Woodcock, director of FDA’s Center for Drug Evaluation and Research (CDER) and current Acting Commissioner of the FDA, published in 2017 a seminal paper on master protocols of “mechanism-based precision medicine trials," affordable in cost, time, and sample size, to study multiple therapies, multiple diseases, or both; see woodcock2017master. Table 2 of the paper lists six such trials to illustrate the concept: (i) B2225, a Phase II basket trial, (ii) BRAF V600, an early Phase II basket trial, (iii) NCI-Match, a Phase I followed by Phase II umbrella trial, (iv) BATTLE-1, a Phase II umbrella trial, (v) I-SPY 2, a Phase II platform trial, and (vi) Lung-MAP, a Phase II-III trial with a master protocol to study 4 molecular targets for NSCLC initially, to be trimmed to 3 targets for the PHASE III confirmatory trial. We have discussed the BATTLE (respectively, I-SPY) trials for therapies to treat NSCLC (respectively, breast cancer) in Section 3.2. For NCI-Match, a treatment is given across multiple tumors sharing a common biomarker; see conley2014molecular and do2015overview. hyman2015vemurafenib describe the BRAF V600 basket trial, after noting that (a) BRAF V600 mutations occur in almost 50% of cutaneous melanomas and result in constitutive activation of downstream signaling through the MAPK (mitogen-activated protein kinase) pathway, based on previous studies reported by davies2002mutations and curtin2005distinct; (b) Vemurafenib, a selective oral inhibitor of BRAF v600 kinase produced by Roche-Genentech, has been shown to improve survival of patients with BRAF V600E mutation-positive metastic melanoma according to chapman2011improved; and (c) efforts by the Cancer Genome Atlas and other initiatives have identified BRAF V600 mutations in non-melanoma cancers [de2010effects, van2011cetuximab, weinstein2013cancer, kris2014using]. They point out that “the large number of tumor types, low frequency of BRAF V600 mutations, and the variety of some of the (non-melanoma) cancers make disease specific studies difficult (unaffordable) to conduct." hyman2015vemurafenib therefore use six “baskets" (NSCLC, ovarian, colorectal, and breast cancers, multiple myeloma, cholangiocarncinoma) plus a seventh (“all-others") basket which “permitted enrollment of patients with any other BRAF V600 mutation-positive cancer" in their Phase II basket trial of Vemurafenib. The Phase II trial uses Simon’s two-stage design “for all tumor-specific cohorts in order to minimize the number of patients treated if vemurafenib was deemed ineffective for a specific tumor type." The primary efficacy endpoint was response rate at week 8. “Kaplan-Meier methods were used to estimate progression-free and overall survival. No adjustments were made for multiple hypothesis testing that could result in positive findings."

In the BRAF V600 trial, 122 adults received at least one dose of Vemurafenib (20 for NSCLC, 37 for colorectal cancer, 5 for multiple myeloma, 8 for cholangocarcinoma, 18 for ECD or LCH, 34 for breast, ovarian, and “other" cancers), and 89% of these patients had at least one previous line of therapy. Vemurafenib showed (a) “efficacy in BRAF V600 mutation-positive NSCLC” compared to standard second-line docetexal in molecularly unselected patients, and (b) for ECD or LCH “which are closely related orphan diseases with no approved therapies,” the response rate was 43% and none of the patients had disease progression while receiving therapy, despite a median treatment duration of 5.9 months. hyman2015vemurafenib point out that “one challenge in interpreting the results of basket studies is drawing inferences from small numbers of patients.” Following up on this point, berry2015brave discusses other challenges for inference from basket trials. In particular, he points out that even though patients have the same biomarker, different tumor sites and tumor types may have different response rates and simply pooling trial results across tumor types may mislead interpretation. On the other hand, different tumors may have similar response rates and hierarchical Bayesian modeling can help borrow information across these types to compensate for the small sample sizes.

We include here another basket trial led by our former Stanford colleague, Dr. Shivaani Kummar. She collaborated with investigators at Loxo Oncology in South San Francisco, and other investigators at UCLA, USC, Harvard, Cornell, Vanderbilt, MD Anderson, and Sloan Kettering, to design and conduct a basket trial involving seven specified cancer types and an eighth basket (“other cancers”) to evaluate the efficacy and safety of larotretinib, a highly selective TRK inhibitor produced by Loxo Oncology in South San Francisco, for adults and children who had TRK fusion-positive cancers. A total of 55 patients were enrolled into one of three protocols and treated with larotretinib: a Phase I study involving adults, a Phase I-II study involving adults and children, and a Phase II study involving adolescents and adults with TRK fusion-positive tumors. The Phase II study uses the recommended dose of the drug twice daily. The dose-escalation Phase I study and the Phase I portion of the Phase I-II study do not require the subjects to have TRK fusions although the combined analysis only includes “patients with prospectively identified TRK fusions.” The primary endpoint for the combined analysis was the overall response assessed by an independent radiology committee. Secondary endpoints include duration of response, progression-free survival, and safety. At the data-cutoff date 7/17/2017, the overall response rate was 75%, and 7 of the patients had complete response while 34 had partial response; see drilon2018efficacy. In the accompanying editorial of that issue in NEJM, andre2018developing says that “this study is an illustration of what is likely to be the future of drug development in rare genomic entities” and that according to the Magnitude of Clinical Benefit Scale for single-arm trials recently developed by the European Society of Medical Oncology, “studies that show rates of objective response of more than 60% and a median progression-free survival of more than 6 months, as the study conducted by Drilon et al. does, are considered to have the highest magnitude of clinical benefit” in line with the pathway for single-arm trials of treatments of rare diseases with well-established natural histories to receive approval from regulatory agencies. andre2018developing also mentions that the study by Drilon et al. “did not find any difference in efficacy among the 12 tumor histotypes (including those in the all-other basket),” proving a successful “trans-tumor approach” in the case of TRK fusions with larotrectinib, but that “some basket trials have not shown evidence of trans-tumor efficacy of targeted therapies, notably BRAF inhibitors.” He points out the importance of developing “statistical tools to support a claim that a drug works across tumor types” and to provide “a more in-depth understanding of the failure of some targets in a trans-tumor approach.”

BioPharma Dive, a company in Washington, D.C. that provides news and analysis of clinical trials, drug discovery, and development, FDA regulations and approvals, for biotech and biopharmaceutical corporations, has a 2019 article sponsored by Paraxel, a global provider of biopharmaceutical services headquarted in Waltham, MA, highlighting that “in the past five years, we’ve seen a sharp increase in the number of trials designed with a precision medicine approach," and that “in 2018 about one of every four trials approved by the FDA was a precision medicine therapy;" see [BPDthatworks]. Moreover, “developing these medicines requires changes to traditional clinical trial designs, as well as the use of innovative testing procedures that result in new types of data," and “the FDA has taken proactive steps to modernize the regulatory framework" that "prioritizes novel clinical trials and real-world data solutions to provide robust evidence of safety and efficacy at early stages." The February 12, 2020, news item of BioPharma Dive is about Merck’s positive results for its cancer drug Keytruda, when combined with chemotherapy, in breast cancer patients on whom a certain amount of tumor and immune cells express a protein that make Keytruda truly effective for this difficult-to-treat form of breast cancer called “triple negative." The news item the following day is that the FDA granted BMS’s CAR-T treatment (called liso-cel) of a type of lymphoma priority reviews, setting up a decision by August 17, 2020; see BPDBMS, BPDMerck. Liso-cel was originally developed by the biotech company Juno Therapeutics before its acquisition by Celgene in 2018. In Jan 2019, BMS announced its $74 billion acquisition of Celgene and completed the acquisition in November that year after regulatory approval by all the government agencies required by the merger agreement.

Since 2016, Stanford University has held an annual drug discovery symposium, focusing on precision-guided drug discovery and development. We briefly describe here the work of Brian Kobilka, one of the founding conference organizers and the director of Kobilka Institue of Innovative Drug Discovery (KIDD) at The Chinese University of Hong Kong, Shenzhen, and his former mentor and Nobel Prize co-winner Robert Lefkowitz. In a series of seminal papers from 1981 to 1984 published by Lefkowitz and his postdoctoral fellows at the Howard Hughes Medical Institute and Departments of Medicine and Biochemistry at Duke University, the -subtypes of the pharmacologically important -andrenergic receptor (AR) were purified to homogeneity and demonstrated to retain binding activity. Dixon, Sigal, and Strader of Merck Research Laboratories subsequently collaborated with Lefkowitz, Kobilka and others on their team at Duke to derive an amino-acid sequence of peptides which indicated significant amino-acid homology with bovine rhodopsin and were able to find a genomic intronless clone in 1986. In his Dec. 2012 Nobel Lecture [lefkowitz2013brief], Lefkowitz highlights the importance of the discovery, saying: "Today we know that GPCRs (G protein coupled receptors), also known as seven transmembrane receptors, represent by far the largest, most versatile and most ubiquitous of the several families of plasma membrane receptors Moreover, these receptors are the targets for drugs accounting for more than half of all prescription drug sales in the world [pierce2002seven]." Kobilka highlights in his Nobel lecture [kobilka2013structural] his efforts to understand the structural basis of AR using advances in X-ray crystallography and later in electron microscopy to study the crystal structure of AR. He concludes his Nobel lecture by saying: “While the stories outlined in this lecture have advanced the field, much work remains to be done before we can fully understand and pharmacologically control signaling by these fascinating membrane proteins." This work is continued at the Kobilka Institute of Innovative Drug Discovery and by his and other groups at Stanford, Lefkowitz’s group at Duke, and other groups in other centers in academia and industry, in North America, Asia, and Europe.

2.3.3 Discussion and New Opportunities for Statistical Science

woodcock2017master

point out new opportunities for statistical science in the design and analysis of master protocols; “With multiple questions to address under a single protocol, usually in an area of unmet need, and an extensive infrastructure in place to handle data flow, master protocols are a natural environment for considering innovative trial designs. The flexibility to allow promising new therapies to enter and poor-performing therapies to discontinue usually requires some form of adaptive design, but the level of complexity of those adaptations can vary according to the objectives of the master protocol.” They also point out that “two types of innovation are hallmarks of master protocols: the use of a trial network with infrastructure in place to streamline trial logistics, improve data quality, and facilitate data collection and sharing; and the use of a common protocol that incorporates innovative statistical approaches to study design and data analysis, enabling a broader set of objectives to be met more effectively than would be possible in independent trials”. Recent advances in hidden Markov models and MCMC schemes that we are developing for cryo-EM analysis at Stanford is another example of new opportunities for statistical science in drug discovery. This will be coupled with innovative designs for regulatory submission. It is an exciting interdisciplinary team effort, merging statistical science with other sciences and engineering.

3.1 Introduction and Background

In this section we review multi-arm bandit theory with covariate information, also called “contextual multi-arm bandits," to pave the way for it to have major impact on the future of clinical research, as the medical community grapples with the challenges of generating and applying knowledge at point of care in fulfillment of the concept of the “learning healthcare system" (LHS) [chamberlayne1998creating]. “A learning healthcare system is one that is designed to generate and apply the best evidence for the collaborative healthcare choices of each patient and provider; to drive the process of discovery as a natural outgrowth of patient care; and to ensure innovation, quality, safety, and value in health care” [olsen2007institute]. The first branch of Tze Lai’s work discussed below deals with methods for incorporating true experimental strength into efforts to explore the comparative effects of different treatments, while exploiting what is learned to improve outcomes in patients.

3.1.1 The Multi-Armed Bandit Problem

The name “multi-arm bandit" suggests a row of slot machines, which in the 1930s were nicknamed “one-armed bandits." (Presumably the name is inspired by their pull-to-play levers and the often large house edge.) For a gambler in an unfamiliar casino, the “multi-arm bandit problem" would refer to a particular challenge: to maximize the expected winnings over a total of

plays, moving between machines as desired. The distribution of payouts from pulling each arm may be unknown and different for each machine. How should the gambler play? Research into the MAB problem and its variants has led to foundational insights for problems in sequential sampling, sequential decision-making, and reinforcement learning.

Mathematical analysis of the MAB problem has been motivated by medical applications since thompson1933likelihood, with different medical treatments playing the role of bandit machines. Subsequent theory has found wide application across disciplines including finance, recommender systems, and telecommunications [bouneffouf2019survey]. According to whittle1979discussion, the bandit problem was considered by Allied scientists in World War II, but it “so sapped [their minds] that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage." It was lai1985asymptotically who gave the first tractable asymptotically efficient solution.

Given a set of arms , Lai and Robbins frame the question: How should we sample , ,… sequentially from the arms in order to achieve the greatest possible expected value of the sum as ? They model each sample from arm as an independent draw from a population from a family of densities indexed by parameter . Then, they formalize the space of (possibly random) strategies , defining to be an adaptive allocation rule if it is a collection of random variables that makes the arm selection at each timestep, . Thus, each is a random variable on , where the event (“arm is chosen at time ") belongs to the -field generated by prior decisions and observations . In this framework, lai1985asymptotically define the cumulative regret of an adaptive allocation, rule which measures the strategy’s expected performance against the best arm, equivalent to

where is the expected value of arm , and . lai1985asymptotically give a strategy that achieves an expected cumulative regret of order , and provide a matching lower bound to show it is nearly optimal. This strategy creates an upper confidence bound (UCB) for each arm, where the estimated return is given a bonus for uncertainty. A simple example of a UCB is the UCB1 o auer2002finite, which at round , picks the arm maximizing

where the rewards are in , is the average of the observed rewards from arm , and is the number of samples observed from arm . Typically, UCBs are designed so that inferior arm(s) are discarded with minimal investment, and the best arm(s) are guaranteed to remain in play; a key contribution of lai1985asymptotically was to show how such statements can be quantified using Chernoff bounds (or other concentration inequality arguments), and then converted into an upper bound on the cumulative regret. Their approach has been generalized and extended to yield algorithms and regret guarantees across a variety of applications, with UCBs acting as a guiding design principle.

The richness of the bandit problem has generated a multitude of other approaches. By adding to the above model a prior distribution for the arm parameters , the bandit problem can be framed as a Bayesian optimization over to find the allocation strategy that minimizes the expected regret . This optimization can, in principle, be solved with dynamic programming (as in cheng2007optimal); however, dynamic programming does not scale well to large or complicated experiments, because the number of possible states explodes. Using results from whittle1980multi, villar2015multi show how the computation can be reduced considerably by framing the optimal solution as an index policy.

When solving for the optimal strategy is not feasible, the heuristic solution of Thompson sampling is a popular choice, with good practical and theoretical performance

[chapelle2011empirical, kaufmann2012thompson, russo2016information]. The decision rule proposed by thompson1933likelihood is an adaptive allocation rule, where , given all data observed prior to time , is nondeterministic and chooses arm with probability equal to its posterior chance of being the best arm. That is, with probability , where

is the posterior probability distribution given

, and is the index of the best arm (which is a random variable). If the best arm is not unique, the tie should be broken to ensure the uniqueness of . In fact, a Thompson allocation can be performed with just one sample from the posterior , as shown in the following workflow:

1 Assume a likelihood model parametrized by , such that determines the arm means by ;
2 Assume a prior ;
3 for each sample  do
4   

    Draw from the posterior a sample of the vector of arm means

 ; ; Allocate to the arm corresponding to the largest entry of :
(breaking ties at random);
5       Receive from arm the next payoff ;
6       Given the new observation, update posterior to
7 end for
Algorithm 1 Bayesian Workflow with Thompson Sampling

Exact sampling from the posterior is not always tractable. A popular technique for sampling the posterior approximately is the Markov Chain Monte Carlo (MCMC) method. The convergence properties of MCMC to the posterior distribution, and in particular the number of steps that must be run to achieve accurate sampling, are well understood only in special cases

[diaconis2009markov, dwivedi2018log]. Where theory falls short, practitioners may appeal to a variety of diagnostics tools to provide evidence of convergence to the posterior [roy2020convergence].

There are many other approaches to the bandit problem, including epsilon-greedy [sutton1998introduction], knowledge gradient [ryzhov2012knowledge], and information-directed sampling (russo2014learning).

3.1.2 Contextual MABs and Personalized Medicine

For an LHS that continuously seeks to improve and personalize treatment, the important question is not which treatment is best, but for whom each treatment is best. To address this question, one must augment the bandit model with information about each patient. Calling this side information “covariates" or “contexts," one arrives at the CMAB problem.

CMABs have found great success in the internet domain for problems such as serving ads, presenting search results, and testing website features. In contrast, applications in medicine have lagged (with the prominent exception of mobile health [greenewald2017action, xia2018price]). The design of trials in an LHS brings new challenges to the CMAB framework, such as ethical requirements, small sample sizes (roughly patients, in comparison to clicks for internet applications), requirements for medical professionals to inspect and understand processes, feedback times, and demand for generalizable conclusions. In section 3.4 we return to this topic. Section 3.2 considers adaptive randomization in an LHS. Section 3.3 discusses inference for MABs in an LHS.

3.2 Adaptive Randomization in an LHS

In an LHS, the arms of an MAB are treatments and the rewards are patient outcomes. Thus, minimizing the cumulative regret corresponds to maximizing patients’ measured quality of care, a primary function of the LHS. However, typically, there is a secondary goal of learning from a trial: useful takeaways may include confidence intervals for the treatment effects, developing a treatment guide, or making recommendations for non-participating patients in parallel with the trial.

The goals of regret minimization and knowledge generation, often framed as “exploitation vs. exploration," are indeed in fundamental conflict: bubeck2011pure formalized a notion of exploration-based experiments, where recommendations are made outside the trial. They define the simple regret to be

where is the expectation of the recommended arm after round , and is the expectation of the best arm. Bubeck, Munos, and Stoltz show that upper bounds on the cumulative regret lead to lower bounds on , and vice versa. In this sense, algorithms that minimize the cumulative regret occupy an extreme point of a design space: they maximize the welfare of trial patients, but sacrifice knowledge about inferior treatments. At the other extreme point of the design space, an ideal trial for knowledge generation, with two arms of equal variance, will split the sample sizes equally, consigning half of the patients to the inferior treatment.

Most practical implementations of adaptive randomization in clinical trials use modified bandit algorithms. A common prescription is to lead with a first phase of equal randomization. Or, allocation probabilities may be shrunk toward in some fashion. wathen2017simulation discuss the design options of restricting allocations to [.1, .9], leading with a period of equal randomization to prevent the algorithm from “getting stuck" on a worse arm, and altering the Thompson sampling to allocate with probability proportional to , for . villar2015multi consider forced sampling of the control arm every patients. kasy2019adaptive modify the Thompson sampling to tamp down selection of the best arm(s), asymptotically leading to equal randomization between the best candidates. lai2013group give a design that maintains a preferred set of arms, randomizing equally between them, and adaptively drops arms from this set at interim analyses. These various design choices and algorithmic tweaks are typically investigated and tuned by simulation. Even without explicit modification to the standard bandit approach, most medical applications will have a delay between the treatment assignment and the observation of an outcome; the resulting reduction in available information leads to more exploration for most algorithms.

There are many benefits to using nearer-to-equal randomization probabilities. First, balancing sample sizes between a pair of arms serves inference goals such as increased power of hypothesis tests, shorter confidence intervals, and more accurate future recommendations. Second, closer-to-equal randomization may improve the information for interim decisions such as early stopping and sample size re-estimation. Third, without tuning, there may be an unacceptably high chance of sending a majority of patients to the wrong arm [thall2015statistical]. Fourth, more equal randomization can help detect violated assumptions, such as time trends or a model misspecification. Fifth, the possibility of violated assumptions suggests treating data as slightly less informative. Finally, probabilities nearer are helpful for inverse-probability weighting and randomization tests.

On the other hand, when a treatment is strongly disfavored for a patient, ethical health care requires setting its randomization chance to zero. This may be achieved by thresholding allocation probabilities according to some rule, or suspending or dropping treatment arms at interim analyses. Furthermore, more equal randomization comes at an opportunity cost to the welfare of trial participants. Practical trial design in an LHS must seek a balance between these competing objectives of knowledge generation and participant welfare.

3.2.1 Inference for MABs in an LHS

The LHS may desire several forms of knowledge from an adaptive randomization trial, including confidence intervals for the outcomes of arms (and their differences), guarantees about selecting arms correctly, and recommendations for treatments in non-participating patients.

Frequentist inference under adaptive randomization designs can be challenging. Owing to adaptive sampling, the distribution of standard estimates for the mean of an arm is typically nonGaussian, and not pivotal with respect to the treatment effect. Concentration techniques for UCBs, such as Chernoff bounds, can be applied for confidence bounds that may hold uniformly over possible stopping times [jamieson2014best, zhao2016adaptive, karnin2013almost]. The concentration approach has been extended to FDR control with the always-valid p-values framework [johari2015always, yang2017framework]. Furthermore, self-normalization techniques from de la pena2008self permit extensions to large classes of distributions. However, confidence intervals from concentration bounds may be conservative, slack by a constant or logarithmic factor of width.

In confirmatory trial design, adaptivity may be managed by dividing the trial into segments, each having constant randomization probabilities so that Gaussian theory can be used (with numerical integration for stopping boundaries to compute the type-I error and power at fixed alternatives). lai2013group and shih2013sequential show how to do this for their MAB-inspired designs. Alternatively, korn2011outcome suggest block-randomization and block-stratified analysis. Compared to the constantly changing allocation strategies of the standard bandit algorithms, discretization of strategy can come at a moderate or minimal cost, depending on the design and goals.

For analyzing MAB designs with a constantly updating allocation strategy, a key idea for constructing valid frequentist p-values is the randomization test. The randomization test assumes the sharp null hypothesis that the treatment has exactly zero effect, and relies on probabilistic randomization in the allocation algorithm to generate power. In exchange, with other minimal assumptions, it grants valid p-values, even in the presence of time trends and other confounders in the patient population [simon2011using]. To form confidence intervals, a sharp additive model for the treatment effect may be considered. Confidence bounds then follow by inverting the randomization test, as in ernst2004permutation.

Another tool for constructing confidence intervals is hybrid resampling, by lai2006confidence. This procedure considers families of different shifts and scales of the observed data, and simulates via resampling to infer which distributions are consistent with the observed treatment effects. Lai and Li show that for group sequential trials, confidence intervals from hybrid resampling can have more accurate coverage than that of standard normal approximations.

hadad2019confidence

suggest a double-robust estimation approach. In addition to using an augmented inverse-probability weight (AIPW) model, they propose further adaptively re-weighting the data to force the treatment effect estimate into an asymptotically Gaussian distribution. Double-robust estimation may help to correct for time trends or other confounding. However, data re-weighting comes at a cost to efficiency, as pointed out by

tsiatis2003inefficiency.

Finally, if one assumes a prior and enters the Bayesian framework, posterior inference is a highly flexible approach to analysis. Because Bayes’ rule decouples the experimenter’s allocation decisions from the rest of the likelihood, the standard Bayesian workflow can be applied to the data without concern for the adaptivity of the design [berger1988likelihood]

. Subject to typical caveats on prior selection and accurate posterior sampling, posterior inference can yield Bayes factors for testing, credible intervals for treatment effects, and decision analysis for treatment recommendations.

3.2.2 Linear And More General Models for the Reward in Personalized Treatments in LHS

We now return to the contextual MABs (CMABs) for the reward in personalized treatments in LHS introduced in Section 3.1.2. First, we focus on a correctly specified linear model in Section 3.4.1. This assumption derives some justification from the features of an LHS: assuming that covariates are continuous and low dimensional, the patient population of greatest interest is expected to occupy a small region of the covariate domain, owing to the systematic filtering of equipoise requirements and further shrinking of the population under experimental focus as “exploiting" increases. Additionally, the conditional expectation of the response is typically a smooth function of the covariates. Therefore, assuming both smoothness of conditional expectation and locality of the studied population, Taylor’s theorem implies approximate correctness of the linear model. Similar arguments can be applied to logistic models and other smooth model classes.

Linear Models for the Reward

If at step we observe a context vector of length , sample from arm , and receive reward , we may consider the following simple linear model for the expected reward:

where is an unknown parameter vector of length . The LinUCB algorithm of li2010contextual brings the UCB of lai1985asymptotically to this linear model. Assuming the linear model parameters are not shared between arms and that contexts do not depend on the arm chosen (see li2010contextual for the general case,) they suggest estimating

for each arm using a ridge regression

. That is, if is a design matrix whose rows are the contexts of the individuals previously assigned to arm before time and is a vector of their rewards, the ridge estimator with tuning parameter is

Next, li2010contextual construct a UCB for the expected reward around the ridge regression prediction, suggesting the confidence interval

where is set to one and is a tuning parameter. This confidence interval implicitly assumes a correctly specified linear model and independence of given , an assumption which is typically broken by the allocation mechanism unless ( is independent and identically distributed (i.i.d.) for all . Nevertheless, analogously to the basic UCB algorithm, they propose the LinUCB algorithm, which chooses the arm with the highest UCB,

.

LinUCB is easy to implement and has proven popular in applications, inspiring further improvements and competitors. chu2011contextual analyze a theoretical fix to LinUCB and give a regret analysis for a modified algorithm of order . They also give a nearly-matching general lower bound for the problem of order .

Alternatively, Abbasi-Yadkori2011, working within a more general framework called “linear bandits" or “linear stochastic bandits," construct self-normalized confidence sets for the arm parameters. In the linear bandit, rather than choosing among a discrete set of arms, one chooses the context from a set , and the rewards are modeled as . Note that model (3.2.2) can be embedded within the linear bandit by sufficiently increasing the dimensions of and and taking as an appropriate finite set of vectors. Abbasi-Yadkori2011 assume that, conditioned on data prior to time , is mean-zero and -sub-Gaussian for some . Further, it is assumed that , for some . Then, defining as a matrix whose rows consist of the contexts , for , defining the reward vector as a vector of length of the corresponding rewards , for , and denoting , for all , one may write the ridge estimator as

Abbasi-Yadkori2011 then derive the confidence set

where is a matrix weighted 2-norm. The collection of these sets, , provides uniform confidence that , regardless of an adaptive mechanism for the context choice. Abbasi-Yadkori2011 leverage this confidence approach into a strategy that generalizes the UCB. They follow the underlying principle of “optimism in the face of uncertainty" to select the context

and prove regret guarantees for the linear bandit with this algorithm. For a -arm trial designer, a key takeaway is that uniform confidence sets offer an approach to model inference (noting that practical use requires strong modeling assumptions, a choice of , and bounds for the unknown parameters and ).

A different approach to the CMAB problem is to generalize the -greedy algorithm: periodic exploration can be used to estimate a model, and to verify that estimates based on adaptive data collection are not far off. Under the simple linear model (3.2.2), goldenshluger2013linear propose maintaining two sets of linear model estimates: , estimated on a small amount of equal-randomized data, and , based on all of the (adaptively allocated) data. If the estimated rewards from equal randomization are well separated, the arm with the largest estimate is chosen. Else, the arm with the largest value of is chosen. Under strong assumptions including arms, i.i.d samples, and a margin condition that ensures that the decision boundary between the arms is sharp, that is,

they derive a cumulative regret bounded by . bastani2019online improve these bounds and extend this approach to high-dimensional sparse linear models using penalization. bastani2020mostly also show that under certain conditions, a pure greedy approach can yield rate-optimal regret.

3.2.3 More General Models for the Reward

The Bayesian workflow for the MAB naturally extends to linear models and beyond. russo2014post show that for several classes of well-specified Bayesian problems with contexts, Thompson sampling achieves near-optimal performance and behaves like a problem-adaptive UCB. A variety of competitive risk bounds have been proven for Thompson sampling [agrawal2012analysis, agrawal2013thompson, kaufmann2012thompson, korda2013thompson]. In empirical studies, Thompson sampling often outperforms competitors by a small margin [scott2010modern, chapelle2011empirical, dimakopoulou2017estimation].

An alternative for the nonBayesian is what we call “pseudo-Thompson bootstrapping." Given a black box algorithm that models the outcomes under each arm, the idea is to bootstrap-resample data to generate variation in the model’s estimates. Pretending that this resampling distribution is a posterior, one can drop the estimated “probabilities" of arm superiority into the Thompson rule and hope to recover its performance advantages. While this technique approximates Thompson sampling for some known cases [eckles2014thompson]

, its general theoretical properties remain unclear. The main appeal of the approach is to offer a wrapper for popular estimation algorithms for large data sets, including regression trees, random forests, and neural networks

[elmachtoub2017practical, osband2016deep]).

vaswani2019old propose the RandUCB algorithm, which gives LinUCB nondeterministic allocation probabilities by perturbing the confidence bound randomly in a way that somewhat resembles bootstrapping. For the linear model, RandUCB can be viewed as a generalization of Thompson sampling under a Gaussian model. Vaswani et al. also prove competitive regret guarantees for RandUCB.

Finally, there are nonparametric methods that leverage the smoothness of the expected response. rigollet2010nonparametric discretize space into buckets, and run MABs on each of them independently. lu2010contextual give a contextual bandit that clusters data adaptively and provides guarantees under Lipschitz assumptions. lai2020

perform a local linear regression and pair it with

-greedy randomization and arm elimination, meeting minimax lower bounds on regret under certain regularity conditions, which will be discussed further in the next chapter where we provide new advances in CMABs, further discussion and references.

4.1 Introduction

In this chapter we describe the work of KimLaiXu in greater detail and discuss several major ideas there which we extend in this chapter to provide for-reaching generalization of the CMAB problem and obtain definitive solutions that are remarkably simple and nonparametric. Our investigation was inspired by a seminal paper of Larry Shepp and his coauthors in 1996, who considered a non-denumerable set of arms for the bandit process; see berry1997. Earlier, yakowitz1991nonparametric, yakowtiz1992theory, and yakowitz1995nonparametric

also considered nonparametric CMABs in the setting of a non-denumerable set of Markov decision processes. Section 4.3 not only unifies these approaches but also provides a definitive asymptotic theory. Key to this theory is section 4.2 on the “transformational" insights of the aforementioned work of

KimLaiXu.

4.2 From Index Policies in K-Armed Bandits to arm randomization and elimination rules for CMABs

Kim, Lai and Xu KimLaiXu have recently developed a definitive nonparametric -armed contextual bandit theory for discrete time . We now extend the theory to the general framework of , in which or , is the bandit process, is the indicator of the arm selected to generate and is the covariate process such that and are -measurable; is assumed to be càdlàg for the case . There are three key ingredients in this nonparametric -armed contextual bandit theory, which we consider in the next three subsections.

4.2.1 Lower bound of the regret over a covariate set

As in KimLaiXu, the covariate vectors are assumed to be stationary with a common distribution so that letting and , the regret of an adaptive allocation rule over can be expressed as

(4.1)

where is the Radon-Nikodym derivative of the measure with respect to . An adaptive allocation rule is called “uniformly good" over if

(4.2)

in analogy with classical (context-free) multi-armed bandit theory reviewed above. Under certain regularity conditions on the nonparametric family generating the data, it is shown in Supplement S1 of KimLaiXu that consists of a least favorable parametric subfamily (a cubic spline with even spacing between knots of the order

for univariate covariates and tensor products of these univariate splines for multivariate covariates) such that the regret over

that contains leading arm transitions has lower bound of the order of .

4.2.2 Epsilon-Greedy Randomization in lieu of UCB or Index Policy

The UCB rule in lai1987, based on the upper confidence bound

(4.3)

(,) in which is the Kullback-Leibler information number, and is the MLE of up to stage , to approximate the index policy of gittins1979bandit and whittle1980multi in classical (context-free) parametric multi-armed bandits, basically samples from an inferior arm until the sample size from it reaches a threshold defined by (4.3) involving the Kullback–Leibler information number. For contextual bandits, an arm that is inferior at may be best at another . Hence the index policy that samples at stage from the arm with the largest upper confidence bound (which modifies the sample mean reward by incorporating its sampling variability at ) can be improved by deferral to future time when it becomes the leading arm (based on the sample mean reward up to time ). Instead of the UCB rule, Kim, Lai and Xu KimLaiXu use the -greedy algorithm in reinforcement learning sutton2018reinforcement for nonparametric contextual bandits as follows. Let denote the set of arms to be sampled from and

(4.4)

where is the regression estimate of based on observations up to time , , and is used to lump treatments with effect sizes close to that of the apparent leader into a single set . At time , choose arms randomly with probabilities for and for , where denotes the cardinality of a finite set . The set is related to the arm elimination scheme described in the next subsection. The estimate uses local linear regression with bandwidth of the order that has been shown by Fan fan1993 to have minimax risk rate for univariate covariates and by Ruppert and Wand ruppert1994 for multivariate covariates.

4.2.3 Arm Elimination via Welch’s Test

First note that (4.4) lumps treatments whose effect sizes are close to that of the apparent leader into a single set of leading arms . Such lumping is particularly important when the covariates are near leading arm transitions at which a leading arm can transition to an inferior one due to transitions in the covariate values. The choice in KimLaiXu is especially effective in the vicinity of leading arm transitions, as will be explained in the next paragraph. Hence the transition does not change its status as a member of the set of leading arms so that the -greedy randomization algorithm still chooses it with probability .

We next describe the arm elimination criterion of KimLaiXu. Choose , for some integer . For , eliminate the surviving arm if

(4.5)

where , is given in (4.3), and is the square of the Welch -statistic based on ; that is,

(4.6)

where and if , which corresponds to the local linear regression estimate of under the null hypothesis that is not significantly below