Abstract
Purpose: Twostage singlearm trial designs are commonly used in phase II oncology to infer treatment effects for a binary primary outcome (e.g., tumour response). It is imperative that such studies be designed, analysed, and reported effectively. However, there is little available evidence on whether this is the case, particularly for key statistical considerations. We therefore comprehensively review such trials, examining in particular quality of reporting.
Methods: Published oncology trials that utilised “Simon’s twostage design” over a 5 year period were identified and reviewed. Articles were evaluated on whether they reported sufficient design details, such as the required sample size, and analysis details, such as a confidence interval (CI). The articles that did not adjust their inference for the incorporation of an interim analysis were reanalysed to evaluate the impact on their reported point estimate and CI.
Results: Four hundred and twenty five articles that reported the results of a single treatment arm were included. Of these, 47.5% provided the five components that ensure design reproducibility. Only 1.2% and 2.1% reported an adjusted point estimate or CI, respectively. Just 55.3% of trials provided the final stage rejection bound, indicating many trials did not test the hypothesis the design is constructed to assess. Reanalysis of the trials suggests that reported point estimates underestimated treatment effects and that reported CIs were too narrow.
Conclusion: Key design details of twostage singlearm trials are often unreported. Whilst inference is regularly performed, it is rarely done so in a way that removes the bias introduced by the interim analysis. In order to maximise their value, future studies must improve the way that they are analysed and reported.
Keywords: Adaptive design; Cancer; Estimation; Oncology; Phase II; Simon.
Introduction
Whilst randomised trial designs are becoming more commonplace in phase II oncology^{1}, recent analyses indicate that singlearm trial designs remain the most widely utilised in this setting^{2}. Furthermore, the primary outcome variable is often dichotomous in phase II^{2}; typically chosen as objective response^{1} via RECIST^{3}. Within the available class of singlearm trial designs for a binary primary outcome, what is commonly referred to as “Simon’s twostage design”^{4} is generally preferred^{5}. The reasons for this preference are many, but principally Simon’s proposal has been seen to provide a constructive means of formally testing for an efficacy signal via a simple design that requires only small sample size^{1}.
The habitual use of Simon’s twostage design has seen much research be conducted in to its effective utilisation. Recent work in this area includes methodology to account for deviation from the planned design^{6, 7, 8}, explorations of new criteria for simultaneously optimising the design and analysis^{9}, and evaluations of the value of such trials within wider phase II development plans^{10}. Indeed, an extensive list of publications have now addressed how to handle numerous issues that can arise in trials that use this design. Nonetheless, it is not known to what extent the advice provided in these publications has permeated through to practice.
Several authors have evaluated the reporting of phase II oncology trials without differentiating by design. For example, Grellety et al.^{11} reviewed 156 phase II oncology trials published in 2011, assessing quality of reporting using two proposed scores. One of these, the Key Methodological Score (KMS), consisted of 3 items: provision of (i) a clear definition of a criterion of principal judgement, (ii) a clear justification of the number of patients included, and (iii) a clear definition of the population on which the principal/secondary judgement criteria were evaluated. They found that the median KMS was 2/3, whilst only 16.1% of the analysed studies had a KMS of 3/3. Furthermore, LangrandEscure et al.^{2} reviewed 557 phase II and phase II/III oncology trials published in 201015 in three highimpact oncology journals, also appraising quality of reporting using the KMS. They concluded just 26.2% of articles had a KMS of 3/3. They additionally found that a sample size calculation was missing in 66% of the analysed articles.
These findings are concerning, but it is possible that they only scratch the surface of the issues in the use of twostage singlearm designs in practice. In particular, to date, no paper has sought to ascertain the degree to which precise specific components of the design of such trials (e.g., the sample size required in each stage) is included in published reports. Moreover, no research has evaluated the frequency with which trialists have heeded the recommendations of the many articles that argue for the need for inference to be adjusted to account for the interim analysis. Finally, the extent to which deviation from the planned design occurs in practice, or the impact of this on typeI and typeII errorrates, is unknown. Given the extent of the phase II evidence base that comes from trials utilising twostage singlearm designs, it is paramount that such studies be designed, analysed, and reported effectively. Therefore, with so little known about whether this is the case, we sought to systematically review a large number of trials that utilised Simon’s twostage design to ascertain issues in design, analysis, and reporting.
Methods
Twostage groupsequential singlearm trial designs for a binary primary outcome
We review phase II trials that used Simon’s twostage design. We therefore briefly summarise the statistical aspects of such trials.
The design evaluates a binary primary outcome, from patient , assumed to be distributed as . Thus,
is the probability of success for the primary outcome. The following hypothesis is tested:
, with the typeI errorrate controlled to
when . The trial is powered to level when . Here, and are respectively commonly referred to as the maximal success probability that does not warrant further investigation and the minimal success probability that allows further investigation of the treatment. Often, is based on the historical success probability for the current standardofcare.The design includes a single interim analysis for futility (or a nogo decision) and is indexed by four values: and . In stage 1, outcomes for patients are accumulated. If the trial terminates for futility with not rejected. Otherwise, outcomes for a further patients are gathered. Then, is rejected if and not rejected otherwise. The design parameters, , and , are typically chosen as those that minimise some optimality criteria, amongst the combinations that meet the typeI error and power requirements. Simon^{4} suggested two optimality criteria: (i) nulloptimal, to minimise the expected sample size when and (ii) minimax, to minimise the maximal sample size . Other optimality criteria have since been proposed^{12, 13, 14}.
Posttrial inference could be performed using methods developed for onesample proportions. For example, a confidence interval (CI) could be computed using the ClopperPearson^{15} approach. Depending on the stage of termination, a point estimate for could be given as or . However, it is well known that the inclusion of an interim analysis necessitates that adjusted inference be performed in order to compute values that are consistent with the decision on , to acquire CIs with the desired coverage, and to reduce the bias in the point estimate^{16}. Many such adjusted methods have been proposed, including that of Jung et al.^{17} for values, Jennison and Turnbull^{18} for CIs, and Jung and Kim^{19} for point estimates. A selection of methods for handling deviation from the planned design (i.e., scenarios in which the interim or final analysis is conducted with a sample size different from or respectively) have also been developed^{6, 7, 8, 20}. We provide details on all methods used later in the Supplementary Materials.
Literature review
Further details on the literature review are included in the Supplementary Materials. Key points are given here.
Inclusion criteria
To identify articles for potential inclusion, PubMed was searched on February 21 2018 using the following term: (“2013/01/01”[Date  Publication] : “2017/12/31”[Date  Publication]) AND Clinical Trial[Publication Type] AND (phase II[Title/Abstract] OR phase 2[Title/Abstract]) AND (cancer[All Fields] OR oncology[All Fields]). This returned 5344 articles for review.
The key inclusion criteria were: (i) fulllength articles (i.e., no short communications), (ii) primary publications on a trial’s complete results (i.e., no secondary or preliminary analyses), and (iii) reports results for at least one treatment arm that was designed and analysed using “Simon’s twostage design”.
Five hundred and thirtyfour of the 5344 returned articles (10.0%) were randomly selected for evaluation for inclusion by MJG and APM. The authors agreed on inclusion for 520 articles (97.3%). Given the highlevel of agreement and the low number of reasons for disagreement, the remaining articles were assessed for inclusion by MJG only, with discussion with APM where required.
Data extraction
Data on each of the questions listed in Table 1 was extracted by MJG for each arm, in each article, deemed eligible for inclusion. To establish the reliability of this extraction, data extracted by MJG was compared with that independently extracted on 58 arms by APM. Across 14 questions requiring nonbinary value extraction (e.g., ‘Q5. What was the value of ?’) the duplicate extractions agreed 96.2% of the time. Across a wider set of 26 questions, including those requiring only binary value extraction, the duplicate extractions agreed 94.3% of the time. The final extracted data is reported as percentages or through figures as appropriate.
Question  Statement 

Q1  How many included treatment arms are reported in this article? 
Q2  What type of cancer was the article focused on? 
Q3  Does the article use the phase “Simon twostage”, or similar, or cite Simon (1989)^{4}? 
Q4  What was the chosen design optimality criteria? 
Q5  What was the value of ? 
Q6  Was a justification given for the value of ? 
Q7  What was the value of ? 
Q8  What was the value of ? 
Q9  What was the value of ? 
Q10  What was the value of ? 
Q11  What was the value of ? 
Q12  What was the value of ? 
Q13  What was the value of ? 
Q14  Was a target recruitment goal larger than stated? 
Q15  What stage did the trial terminate? 
Q16  If ‘2’ to Q15, did they explicitly state the criteria for progression to stage 2 was met? 
Q17  If ‘2’ to Q15, what was the realised sample size in stage 1? 
Q18  If ‘2’ to Q15, what was the number of successes in stage 1? 
Q19  Did the stage of termination have to be inferred from the enrolled/analysed sample size? 
Q20  What was the total enrolled sample size? 
Q21  Did they report a point estimate, value, or confidence interval? 
Q22  If ‘Yes’ to Q21, what was the sample size assumed in the analysis? 
Q23  If ‘Yes’ to Q21, what was the total number of successes assumed in the analysis? 
Q24  If ‘Yes’ to Q21, did they report a point estimate? 
Q25  If ‘Yes’ to Q24, did they state they used an adjusted point estimate? 
Q26  If ‘Yes’ to Q24, what was the reported point estimate? 
Q27  If ‘Yes’ to Q21, did they report a value? 
Q28  If ‘Yes’ to Q27, did they state they used an adjusted value? 
Q29  If ‘Yes’ to Q27, what was the reported value? 
Q30  If ‘Yes’ to Q21, did they report a confidence interval? 
Q31  If ‘Yes’ to Q30, did they state they used an adjusted confidence interval? 
Q32  If ‘Yes’ to Q30, what confidence interval level was used? 
Q33  If ‘Yes’ to Q30, what was the reported confidence interval’s lower limit? 
Q34  If ‘Yes’ to Q30, what was the reported confidence interval’s upper limit? 
Trial reanalyses
Given absence of evidence is not evidence of absence, included articles that did not state they reported an adjusted point estimate (Q25) were reanalysed where possible (i.e., subject to reporting the required design components). This enabled evaluation of which of seven possible point estimates (the unadjusted estimate and six adjusted estimates) the reported point estimate (Q26) was consistent with, to the reported number of decimal places.
Equivalent computations were conducted for those articles that did not state they reported an adjusted CI (Q30); the reported CI (Q3234) was compared for its consistency with four unadjusted and two adjusted CIs.
The reanalyses were limited to those trials (i) adjudged to have terminated in stage 2, as point estimate and CI procedures do not in general adjust when a trial terminates in stage 1 and (ii) that reported the number of successes and sample size assumed in the analysis, as these are required to calculate unadjusted point estimates and CIs. To reanalyse using the adjusted inferential procedures, and were additionally required to have been reported clearly, as these are requisite to adjusted inference.
Results
Included articles
Five hundred articles were deemed eligible for inclusion. Four hundred and twentyfive of these reported the results of a single eligible treatment arm. The remaining 75 articles reported results for an additional 204 eligible treatment arms (arms per article: median 2, range [2,15]).
To remove the need to account for skew caused by the quality of the articles reporting multiple included treatment arms, we discuss here the findings for only the 425 included articles that reported the results of a single eligible treatment arm. Findings for the remaining 75 articles are given in the Supplementary Materials.
Table 2 provides descriptors on the 425 articles. At least 15.8% of the articles came from each allowed publication year, with a decline in inclusion by year potentially indicating decreasing use of the considered design type. The included articles were published in 100 distinct journals, with more than ten articles included for nine journals. The type of cancer under consideration varied widely, though lung and lymph cancers together accounted for 26.4% of the articles.
One hundred and ten trials (25.9%) were judged to have terminated in stage 1, in contrast to 298 in stage 2 (70.1%). Amongst the 298 judged to have terminated in stage 2, only 80 (26.4%) stated the criteria had been met for progression to stage 2; indicating that this judgement often had to be based on the enrolled and analysed sample sizes. For 17 articles (4.0%) it was not possible to ascertain when the trial terminated; this was caused by the final sample size being between and , or because neither of the planned stagewise sample sizes were reported.
Descriptor  Value  Number (%) 

Publication year  2013  102 (24.0) 
2014  101 (23.8)  
2015  79 (18.6)  
2016  76 (17.9)  
2017  67 (15.8)  
Journal  Cancer Chemother Pharmacol  44 (10.4) 
Ann Oncol  30 (7.1)  
Invest New Drugs  23 (5.4)  
Lung Cancer  17 (4.0)  
Cancer  15 (3.5)  
BMC Cancer  13 (3.1)  
J Clin Oncol  13 (3.1)  
Br J Haematol  11 (2.6)  
Lancet Oncol  11 (2.6)  
Other (91 journals)  248 (58.4)  
Cancer  Lung  59 (13.9) 
Lymph  53 (12.5)  
Colon/rectum  32 (7.5)  
Breast  28 (6.6)  
Stomach  25 (5.9)  
Head and neck  23 (5.4)  
Blood  22 (5.2)  
Kidney  21 (4.9)  
Other  162 (38.1)  
Stage of termination  1  110 (25.9) 
2: Stated the criteria had been met for progression  80 (18.8)  
2: Did not state the criteria had been met for progression  218 (51.3)  
Unclear  17 (4.0) 
Reporting of design characteristics
Extracted data on the reporting of design characteristics is summarised in Table 3. Whilst 380 articles (89.4%) clearly stated , only 78 (18.4%) provided a justification for this value. The success probability was often reported (391 articles; 92.0%), as were the desired typeI (372 articles; 87.5%) and typeII errorrates (382 articles; 89.9%). The chosen optimality criteria was stated in only 240 articles (56.5%). This is the principal driver of the fact that only 202 articles (47.5%) reported , , , , and the optimality criteria; the five components that ensure a design can be easily reproduced.
Whilst (349 articles; 82.1%), (371 articles; 87.3%), and (394 articles; 92.7%) were all regularly reported, was given in only 235 articles (55.3%). This drives the result that just 221 articles (52.0%) reported the four components , , , and ; those that enable a design’s operating characteristics to be computed for any .
Figure 1 depicts the values for and given in the 373 articles (87.8%) that reported both of these quantities. The median value of was 0.2.
Criteria  Number (%) 

Used the phrase “Simon twostage” (or similar) or cited Simon (1989)^{4}  357 (84.0) 
Clearly stated  380 (89.4) 
Gave a justification for  78 (18.4) 
Citation given  40 (9.4) 
Justification given but no citation  38 (8.9) 
Clearly stated  391 (92.0) 
Clearly stated  372 (87.5) 
231 (54.4)  
103 (24.2)  
Clearly stated  382 (89.9) 
165 (38.8)  
173 (40.7)  
Clearly stated the optimality criteria  240 (56.5) 
Nulloptimal  142 (33.4) 
Minimax  93 (21.9) 
Admissable  4 (0.9) 
Other  1 (0.2) 
Clearly stated  349 (82.1) 
Clearly stated  235 (55.3) 
Clearly stated  371 (87.3) 
Clearly stated  394 (92.7) 
Indicated the recruitment target was greater than  117 (27.5) 
Clearly stated and  373 (87.8) 
Clearly stated , , , and  340 (80.0) 
Clearly stated , , , and  221 (52.0) 
Clearly stated , , , , and the optimality criteria  202 (47.5) 
Clearly stated , , , , the optimality criteria, , , , and  109 (25.6) 
Reporting of inferential procedures
Extracted data on the reporting of the inferential procedures performed in the 425 included articles is summarised in Table 4, with additional stratification by stage of termination.
In total, 375 articles (88.2%) reported either a point estimate, value, or CI for their primary outcome. This figure was larger for the trial’s adjudged to have terminated in stage 2 (287/298 articles; 96.3%).
Whilst point estimates were often reported (372 articles; 87.8%), only 5 articles (1.2%) stated that they had reported an adjusted point estimate. In contrast, values were rarely reported (4 articles; 1.3%). For CIs, just 233 articles (54.8%) reported any type of CI, with only 9 (2.1%) indicating that they reported an adjusted CI.
To evaluate whether the articles that reported a point estimate or CI but did not indicate it was adjusted were consistent (to their reported number of decimal places) with unadjusted or adjusted analyses, the trials were reanalysed (Table 5). Two hundred and seventy (96.1%) of the reanalysed articles reported point estimates consistent with an unadjusted estimate. However, 133/228 articles (58.3%) for which adjusted point estimates could be calculated were also consistent with at least one adjusted estimate. For the CIs, 116/178 articles (65.2%) that were reanalysed were consistent with at least one unadjusted interval. Far fewer articles (3/140; 2.1%) for which adjusted intervals could be computed were consistent with at least one adjusted interval.
To visualise the impact of not utilising adjusted inferential procedures, Figure 2A displays the unadjusted estimate (
) against the uniform minimum variance unbiased estimate
^{19} (UMVUE, ) for the 233 trials that terminated in stage 2 where the UMVUE could be computed. The difference between the unadjusted estimate and the UMVUE is presented as a percentage of in Figure 2B. Together, these plots indicate that whilst the difference between the unadjusted and adjusted estimates may often be small, there are instances in which it is large; in 25 cases more than 25% of the difference .Similar visualisations are provided in Figure 3. Figure 3A displays the length of the reported unadjusted CI against the length of the corresponding adjusted CI proposed by Jennison and Turnbull^{18} for the 140 trials for which this adjusted CI could be computed. Figure 3B compares the respective coverage of these unadjusted and adjusted CIs when for the 131 trials in which the target coverage was 0.95. In general, the length of the unadjusted CI is shorter than the corresponding adjusted CI, which is reflected in the coverage being below the desired level for the unadjusted procedure in several instances.
Criteria  Stage 1  Stage 2  All 

Number (%)  Number (%)  Number (%)  
Reported a point estimate, value, or confidence  72 (65.5)  287 (96.3)  375 (88.2) 
interval for the primary outcome  
Reported a point estimate  70 (63.6)  287 (96.3)  372 (87.8) 
Stated the point estimate had been adjusted for  0 (0)  5 (1.7)  5 (1.2) 
the twostage design  
Reported a value  0 (0)  4 (1.3)  4 (0.9) 
Stated the value had been adjusted for the  0 (0)  3 (1.0)  3 (0.7) 
twostage design  
Reported a confidence interval  40 (36.4)  187 (62.8)  233 (54.8) 
Stated the confidence interval had been  0 (0)  9 (3.0)  9 (2.1) 
adjusted for the twostage design  
Analysis performed assuming a sample equal  27/70 (38.6)  72/278 (25.9)  99/348 (28.4) 
to that given in the design 
Criteria  Number (%) 

Reported a point estimate not stated as adjusted and clearly reported the  281/298 (94.2) 
number of successes and sample size assumed in the analysis  
Reported point estimate consistent with an unadjusted estimate  270/281 (96.1) 
Reported point estimate consistent with at least one adjusted estimate  133/228 (58.3) 
Reported a confidence interval not stated as adjusted and clearly reported its  178/298 (59.7) 
level, the number of successes, and sample size assumed in the analysis  
Reported confidence interval consistent with at least one unadjusted interval  116/178 (65.2) 
Reported confidence interval consistent with at least one adjusted interval  3/140 (2.1) 
Finally, note that 348 trials that were judged to have ended in stage 1 or stage 2 reported a point estimate, value, or CI, as well as the sample size required by their design. Amongst these, just 99 (28.4%) performed their analysis using the planned sample size. Differences in the planned and analysed sample sizes are shown in Figure 4.
Discussion
A large proportion of all phase II evidence comes from trials conducted using Simon’s twostage design, as exemplified by the number of articles found to be eligible in this review. This necessitates that such studies be designed, analysed, and reporting effectively. We evaluated the degree to which this is true through a comprehensive review of trials conducted over a 5 year period.
It is easy to argue that the reporting of design components was poor. In particular, the reproducibility of determined designs is limited by infrequent reporting of , , , and in unison. It is also alarming that only 18.4% of trials provided a justification for , considering the interpretation of study results is highly dependent on this value. Furthermore, it may be considered disappointing that so many trials chose standard target errorrates (e.g, ), as it has been highlighted that small concessions in this regard can lead to notable efficiency gains^{21}. Similar statements are true of the chosen optimality criteria, with previous work noting that the oftenused minimax and nulloptimal designs may routinely not be the most advisable choice^{13, 22}.
Few articles stated that they utilised adjusted inferential procedures. Given there is no additional cost to using these methods, this is a disappointing finding. Figures 3 and 4 together indicate that the result of this in practice may be that phase II trials utilising Simon’s twostage design are conservative in their reported point estimate, but anticonservative in the width of their CI. It is also surprising that only 54.8% of articles included a CI of any kind. The size of singlearm trials makes the uncertainty around a point estimate important to quantify; we encourage future studies to include such details.
Many final analyses were performed with a sample size different from that specified in the design (71.6%); this is not surprising given many trials indicated they planned to overrecruit to allow for attrition (Table 3). This highlights the need for trialists to plan for such design deviation, and echoes previous findings from Koyama and Chen^{23}. We initially hoped to extract data on how trials handled design deviation when interpreting their results. This was ultimately judged to be too subjective an endeavour to complete, as many studies interpreted their findings through informal comparison of their point estimate or CI bounds to and/or .
The difficulties in practice of attaining the planned sample size may be reflected in the fact that only 55.3% of trials reported . This lack of reporting of also indicates that many trials that utilise Simon’s twostage design do not formally test the hypothesis they claim to. We note that methodology to comprehensively handle over and underrunning is available. Its use is depicted in Figure 5, which provides the errorrates for 45 trials when the methodology of Englert and Kieser^{7} is implemented. Using this methodology, trials are assured to conform to their desired typeI errorrate, and it appears sample sizes that enable power to reach close to the desired level may have been achieved in practice. As a contrast, the errorrates if was retained at the final analysis are also shown, while other possible methods of interpreting trial results are given in the Supplementary Materials. We note that without utilising established methodology to specifically account for design deviation, many trials may be interpreting their findings in a manner associated with a high probability of erroneous decision making.
We acknowledge a number of limitations to our work. Firstly, only a 10% duplicate extraction was performed. Given the strength of our findings, though, we note that it is unlikely our conclusions would be altered by additional duplicate extractions. It is also impossible to be certain that those trials that did not state they utilised an adjusted inferential procedure had used an unadjusted method. However, our reanalyses (Table 5) do provide evidence that this may be the case.
In all, in light of work that has assessed whether reported randomised trials conform to the recommendations of the CONSORT guidance^{24}, our findings should perhaps not be surprising. Nonetheless, it may have been hoped that the simplicity of Simon’s design would lead to effective reporting. Our results indicate that a CONSORT extension specific to the requirements of singlearm oncology trials may be warranted.
References
 1 Grayling M, Dimairo M, Mander A, Jaki T. A review of perspectives on the use of randomization in phase II oncology trials. JNCI  J Natl Cancer Inst 2019;111:1255–62.
 2 LangrandEscure J, Rivoirard R, Oriol M, et al. Quality of reporting in oncology phase II trials: A 5year assessment through systematic review. PLoS One 2017;12:e0185536.
 3 Eisenhauer E, Therasse P, Bogaerts J, et al. New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1). Eur J Cancer 2009;45:228–47.
 4 Simon R. Optimal twostage designs for phase II clinical trials. Control Clin Trials 1989;10:1–10.
 5 Ivanova A, Paul B, Marchenko O, Song G, Patel N, Moschos S. Nineyear change in statistical design, profile, and success rates of phase II oncology trials. J Biopharm Stat 2016;26:141–9.
 6 Belin L, Broet P, De Rycke Y. A rescue strategy for handling unevaluable patients in Simon’s two stage design. PLoS One 2015;10:e0137586.
 7 Englert S, Kieser M. Methods for proper handling of overrunning and underrunning in phase II designs for oncology trials. Stat Med 2015;34:2128–37.
 8 Zhao J, Yu M, Feng X. Statistical inference for extended or shortened phase II studies based on Simon’s twostage designs. BMC Med Res Methodol 2015;15:48.
 9 Bowden J, Wason J. Identifying combined design and analysis procedures in two‐stage trials with a binary end point. Stat Med 2012;31:3874–84.
 10 Grayling M, Mander A. Do single‐arm trials have a role in drug development plans incorporating randomised trials? Pharm Stat 2016;15:143–51.
 11 Grellety T, PetitMoneger A, Diallo A, MathoulinPelissier S, Italiano A. Quality of reporting of phase II trials: A focus on highly ranked oncology journals. Ann Oncol 2014;25:536–41.
 12 Hanfelt J, Slack R, Gehan E. A modification of Simon’s optimal design for phase II trials when the criterion is median sample size. Control Clin Trials 1999;20:555–66.
 13 Jung S, Lee T, Kim K, George S. Admissible two‐stage designs for phase II cancer clinical trials. Stat Med 2004;23:561–9.
 14 Mander A, Thompson S. Twostage designs optimal under the alternative hypothesis for phase II cancer clinical trials. Contemp Clin Trials 2010;31:572–8.
 15 Clopper C, Pearson E. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 1934;26:404–13.
 16 Porcher R, Desseaux K. What inference for twostage phase II trials? BMC Med Res Methodol 2012;12:117.
 17 Jung S, Owzar K, George S, Lee T. pvalue calculation for multistage phase II cancer clinical trials. J Biopharm Stat 2006;16:765–75.
 18 Jennison C, Turnbull B. Confidence intervals for a binomial parameter following a multistage test with application to MILSTD 105D and medical trials. Technometrics 1983;25:49–58.
 19 Jung S, Kim K. On the estimation of the binomial probability in multistage clinical trials. Stat Med 2004;23:881–96.
 20 Wu Y, Shih W. Approaches to handling data when a phase II trial deviates from the prespecified Simon’s twostage design. Stat Med 2008;27:6190–208.
 21 Khan I, Sarker SJ, Hackshaw A. Smaller sample sizes for phase II trials based on exact tests with actual error rates by tradingoff their nominal levels of significance and power. Br J Cancer 2012;107:1801–9.
 22 Mander A, Wason J, Sweeting M, Thompson S. Admissible twostage designs for phase II cancer clinical trials that incorporate the expected sample size under the alternative hypothesis. Pharm Stat 2012;11:91–6.
 23 Koyama T, Chen H. Proper inference from Simon’s twostage designs. Stat Med 2008;27:3145–54.
 24 Turner L, Shamseer L, Altman D, Schulz K, Moher D. Does use of the CONSORT statement impact the completeness of reporting of randomised controlled trials published in medical journals? A Cochrane review. Syst Rev 2012;1:60.
 25 Guo H, Liu A. A simple and efficient biasreduced estimator of response probability following a group sequential phase II trial. J Biopharm Stat 2005;15:773–81.
 26 Chang M, Wieand H, Chang V. The bias of the sample proportion following a group sequential phase II clinical trial. Stat Med 1989;8:563–70.
 27 Pepe M, Feng Z, Longton G, Koopmeiners J. Conditional estimation of sensitivity and specificity from a phase 2 biomarker study allowing early termination for futility. Stat Med 2009;28:762–79.
 28 Tsai W, Chi Y, Chen C. Interval estimation of binomial proportion in clinical trials with a twostage design. Stat Med 2008;27:15–35.
 29 Bryant J, Day R. Incorporating toxicity considerations into the design of twostage phase II clinical trials. Biometrics 1995;51:1372–83.
 30 Fleming T. Onesample multiple testing procedure for phase II clinical trials. Biometrics 1982;38:143–51.
Supplementary materials to ‘Twostage singlearm trials are rarely reported adequately’
Supplementary methods
Twostage groupsequential singlearm trial designs for a binary primary outcome
In the main manuscript, we outlined the design of “Simon twostage” trials. Here, we elaborate on the statistical details of performing adjusted inference and handling design deviation in such studies.
First, let
be the probability density and cumulative distribution function of a
random variable. Next, let be the number of successes seen for the primary outcome in the first patients. Finally, set and . Then, define to be the probability that successes are observed by the end of stage . We then haveAny pointestimation procedure, , must specify values for for all possible numbers of successes that could be seen on trial termination at the end of each stage. We denote these values by , and then specifically we must have them for .
As discussed, a ‘naive’ point estimation procedure, , that does not account for the incorporated interim analysis, sets . In our work, we also consider six adjusted estimation procedures that have been proposed in the literature. For this, some final notation is useful: for any point estimation procedure , we can compute its expected value and bias through
The considered adjusted point estimates are then

The biassubtracted estimate^{25}

The biasadjusted estimate^{26}, which is given by the numerical solution to

The uniform minimum variance unbiased estimate^{19}

The uniform minimum variance conditionally unbiased estimate^{27}

The conditional estimate^{28}, which sets for . When , is the numerical maximiser of

The median unbiased estimate^{23}, which sets as the numerical solution to
where
We also consider two adjusted confidence interval (CI) procedures and several unadjusted CI procedures. As for the point estimation procedures above, any CI procedure, , must specify for all possible combinations on trial termination. The first considered adjusted CI procedure^{18} sets and for a % CI as the numerical solutions to
The second procedure^{16}, a mid approach, sets
and similarly for .
We compare these to a number of standard unadjusted procedures, in particular ClopperPearson^{15}, which for sets
where is the
% quantile of an
distribution with and degrees of freedom. Supplementing this, and .For further information on the above, see Porcher and Desseaux^{16}.
Finally, we describe how a determination can be made on in the presence of design deviation. Englert and Kieser^{7} provide comprehensive methodology for this by describing the decision rules for any design in terms of a discrete conditional error function. Whilst they allow for the interim analysis being conducted after any number of patients (i.e., a different value from that specified in the design), here we describe handling of deviation only in the timing of the stage two analysis, as just one paper noted that they conducted their interim analysis at an unplanned point. To determine whether to reject at the end of the second stage, if the analysis is conducted based on patients data ( not necessarily equal to ), a value based on the second stage data only is first computed
Then, is rejected if , for
Literature review
Inclusion criteria
As discussed in the main manuscript, PubMed was searched using the following term: (“2013/01/01”[Date  Publication] : “2017/12/31”[Date  Publication]) AND Clinical Trial[Publication Type] AND (phase II[Title/Abstract] OR phase 2[Title/Abstract]) AND (cancer[All Fields] OR oncology[All Fields]). Thus, we reviewed clinical trial publications over a five year period, checking all such articles that included the phrase ‘phase II’ or ‘phase 2’ in the title/abstract and the word ‘cancer’ or ‘oncology’ in any field.
We wished to include as many treatment arms as possible, whilst not biasing our findings. Our initial inclusion criteria for arms reported in the returned records were therefore as follows:

Accessible fulllength articles. No short communications were allowed as the quality of reporting would be expected to be lower in such articles. We required access to the complete text for inclusion.

Primary publication on an arm’s complete results. No secondary or preliminary analyses were allowed as the quality of reporting would be expected to be lower in such articles.

Treatment arms designed and analysed using “Simon’s twostage design”. Thus

Treatment arms that were designed using Simon’s approach, but for which inference was performed as if the primary outcome was a timetoevent variable (e.g., a dichotomised version of PFS was used for design purposes, but inference was based on the exact PFS eventtime data) were to be excluded. These were to be excluded as we planned to reanalyse included arms using appropriate adjusted methods for a binary primary outcome and compare our findings to the reported results; such comparisons would not be logical across analysis frameworks.

Treatment arms using Bayesian methods were to be excluded, as analysis and reporting of such trials would be expected to differ (e.g., computation and reporting of adjusted point estimates would be expected to be less frequent).

Treatment arms with (e.g., where the primary outcome is toxicity) were to be excluded. Though such trials could easily be analysed using adjusted procedures like those outlined above, we anticipated some authors may not be aware of this.

Treatment arms for which the design was intentionally modified were to be excluded. Such a design would then correspond to an adaptive approach and not the considered groupsequential design.

Five hundred and thirty four articles were randomly selected for evaluation for inclusion by MJG and APM based on the above criteria. The authors agreed on inclusion for 520 articles (97.3%). The authors disagreed on the remaining 12 articles for six reasons, which led to the following additional clarifications on inclusion

Treatment arms stated to have used “Simon’s twostage design” but for which no additional information was available to confirm this should be excluded (e.g., no reporting of stagewise sample sizes or other discussion of interim analysis). We felt inclusion of such articles would bias downwards the identified quality of reporting and that presentation of what are likely conservative values in some instances would be more appropriate.

Treatment arms for which the results of sequential trials are reported simultaneously should be excluded (e.g., simultaneous reporting of phase I and II results). We again felt inclusion of such articles would potentially bias our results, as substantial focus may not be given to the Simondesigned component. In some cases, it was also unclear as to whether data had been combined across phases and if reanalysis using the considered adjusted inference procedures would then be appropriate.

Arms from articles reporting the results of multiple treatment arms should be included only if each arm was designed and analysed independently (e.g., the overarching design could not be a randomised selection approach, or similar). We anticipated these trials may focus more on arm differences and not on evaluating point estimates or confidence intervals for each arm separately.

Treatment arms for which it was stated formal interim monitoring of toxicity was performed should be excluded. Such trials would be more similar to a Bryant and Day^{29} type approach than Simon’s methodology.

Treatment arms stated to have been designed using “Fleming’s” approach^{30} should be excluded, as we felt such trials may not report interim sample sizes or stopping rules as frequently given the way Fleming designs are identified.

Treatment arms for which it was stated there was more than one primary outcome should be excluded. For one article it was not clear how a second primary outcome may have influenced design, analysis, and reporting. We therefore decided to exclude all subsequent articles with more than one primary outcome.
A decision was then taken that given there was by enlarge agreement on inclusion, the remaining articles were to be assessed for inclusion by MJG only, but that discussion with APM would be conducted if required for any individual article.
Data extraction
For each of the 534 articles reviewed by both authors, a pilot duplicate data extraction was performed for the treatment arms they deemed to be eligible for inclusion. In total, data was extracted by both authors for 58 eligible treatment arms; in each case data was extracted for 28 questions. These were Questions 1, 4, 5, 713, 15, 1821, and 2334 from Table 1 in the main manuscript, along with “What was the total number of efficacy evaluable patients?”. Following the pilot evaluation, this last question was deemed to be an unsuitable method of extracting data that would facilitate trial reanalyses, as some studies did not utilise the number of evaluable patients in their analysis. This was therefore replaced by Question 22, “If ‘Yes’ to Q21, what was the sample size assumed in the analysis?”. Following the pilot data extraction, the remaining questions listed in Table 1 (Questions 2, 3, 6, 14, 16, and 17) were also added to enhance our evaluation. Data for these additional questions was subsequently extracted by MJG for those arms included during the pilot.
We are thus able to evaluate the replicability of extraction across 58 treatment arms and 27 of the final questions used for our results. As discussed in the main manuscript, across 14 questions requiring nonbinary value extraction (Questions 4, 5, 713, 20, 23, 26, 33, 34), the duplicate extractions agreed 96.2% of the time. Across a wider set of 26 questions (all except Question 1, which we omit as equality for this is implied by performing a comparison), agreement occurred 94.3% of the time. With replicability high, duplicate extraction was not performed for the remaining included arms.
Note that data was extracted from the included articles and their supplementary materials, but not from any linked protocols. This is because we believed all questions amounted to critical information that should be present in the primary publication reporting a trial’s results.
Trial reanalyses
For those trials that reported a point estimate or CI not stated to have been adjusted, the reported results were reanalysed were possible to evaluate consistency with unadjusted and adjusted procedures. We limited this reanalysis to those trials that terminated in stage two, for as can be seen above, point estimation and CI procedures typically only modify standard inferential methods if a trial proceeds to stage two. To this end, denote the number of realised patients assumed in the two analyses by , where is that extracted in Question 22.
First, the reported point estimate was checked for consistency with to the reported number of decimal places. For example, suppose the data for Questions 22 and 23 indicate 21 patients were assumed in the analysis with 7 successes, and Question 25 indicates that the reported point estimate was 0.33. This would be consistent with to the reported number of decimal places, as to 2 decimal places. So long as and were also reported clearly (Questions 10 and 12), this point estimate was also evaluated for whether it was consistent with each of the six adjusted point estimates described earlier.
An equivalent approach was performed to evaluate whether any reported CI was consistent with one of four unadjusted (ClopperPearson, Wald, Wilson, and BlythStillCasella; the four unadjusted CI methods stated to have been used by any included article across our review) and two adjusted CIs (those given earlier).
For those included treatment arms that reported a 95% CI, we also compared the coverage provided by their CI procedure to the adjusted method of Jennison and Turnbull^{18} when the true success probability is . To this end, note that coverage can be computed for any and via
Finally, Figure 5 compares the unconditional operating characteristics for the sample sizes given in , when either is retained, or the method of Englert and Kieser^{7} described earlier is implemented. For example, we compare the typeI errorrates via
where is the random variable denoting the distribution of the second stage value under the boundary of for a second stage sample size of . Power is compared similarly.
Supplementary results
Additional result interpretation approaches
Extending Figure 5, Supplementary figure 1 provides the probability of several quantities that trials were routinely observed to use in the interpretation of their results. These are: , , , and . Note that, e.g., the first quantity is computed as
and similarly for the others.
Observe that the quantities conditioned on do not generally take values close to the target typeI errorrate 0.05. Similarly, the quantities conditioned on are not generally close to the target power 0.8.
Included articles that reported the results of more than one eligible treatment arm
Supplementary Tables 14 and Supplementary Figures 25 display the corresponding results to those given in the main manuscript for the 204 treatment arms that were included across the 75 articles that reported the results of more than one eligible treatment arm.
The results are qualitatively similar to those presented in the main manuscript. In particular, for only 26.0% of the 204 treatment arms were , , , , the optimality criteria, , , , and reported clearly. Adjusted point estimates (4 arms, 2.0%) and adjusted confidence intervals (12 arms, 5.9%) were again rarely reported, with little evidence to suggest that adjusted methods had been utilised with this unstated (Supplementary table 4).
Note that a larger proportion of the 204 treatment arms described here (53.9%) were judged to have ended in stage one than for the 425 treatment arms described in the main manuscript (25.9%). This is a consequence of many multiarm trials struggling to recruit in treatment arms for rare conditions.
Descriptor  Value  By article  By arm 

Number (%)  Number (%)  
Publication  2013  21 (28.0)  55 (27.0) 
year  2014  13 (17.3)  36 (17.6) 
2015  23 (30.7)  69 (33.8)  
2016  12 (16.0)  32 (15.7)  
2017  6 (8.0)  12 (5.9)  
Journal  J Clin Onol  8 (10.7)  29 (14.2) 
Ann Oncol  6 (8.0)  21 (10.3)  
Pediatr Blood Cancer  3 (4.0)  19 (9.3)  
Eur J Cancer  5 (6.7)  13 (6.4)  
Lancet Oncol  6 (8.0)  13 (6.4)  
J Neurooncol  4 (5.3)  11 (5.4)  
Other (27 journals)  43 (57.3)  98 (48.0)  
Cancer  Lung  9 (12.0)  31 (15.2) 
Lymph  9 (12.0)  21 (10.3)  
Brain/CNS  7 (9.3)  17 (8.3)  
Breast  7 (9.3)  16 (7.8)  
Blood  5 (6.7)  14 (6.9)  
Other  38 (50.7)  105 (51.5)  
Stage of  1  N/A  110 (53.9) 
termination  2: Stated the criteria had been met for progression  N/A  40 (19.6) 
2: Did not state the criteria had been met for progression  N/A  43 (21.1)  
Unclear  N/A  11 (5.4) 
Criteria  Number (%) 

Used the phrase “Simon twostage” (or similar) or cited Simon (1989)^{4}  157 (77.0) 
Clearly stated  192 (94.1) 
Gave a justification for  43 (21.1) 
Citation given  32 (15.7) 
Justification given but no citation  11 (5.4) 
Clearly stated  191 (93.6) 
Clearly stated  173 (84.8) 
57 (27.9)  
76 (37.3)  
Clearly stated  173 (84.8) 
95 (46.6)  
40 (19.6)  
Clearly stated the optimality criteria  112 (54.9) 
Nulloptimal  69 (33.8) 
Minimax  41 (20.1) 
Admissable  0 (0) 
Other  2 (1.0) 
Clearly stated  165 (80.9) 
Clearly stated  130 (63.7) 
Clearly stated  176 (86.3) 
Clearly stated  163 (79.9) 
Indicated the recruitment target was greater than  32 (15.7) 
Clearly stated and  185 (90.7) 
Clearly stated , , , and  164 (80.4) 
Clearly stated , , , and  130 (63.7) 
Clearly stated , , , , and the optimality criteria  86 (42.2) 
Clearly stated , , , , the optimality criteria, , , , and  53 (26.0) 
Criteria  Stage 1  Stage 2  All 

Number (%)  Number (%)  Number (%)  
Reported a point estimate, value, or confidence  72 (65.5)  81 (97.6)  163 (79.9) 
interval for the primary outcome  
Reported a point estimate  72 (65.5)  81 (97.6)  163 (79.9) 
Stated the point estimate had been adjusted for  0 (0)  4 (4.8)  4 (2.0) 
the twostage design  
Reported a value  0 (0)  0 (0)  0 (0) 
Stated the value had been adjusted for the  0 (0)  0 (0)  0 (0) 
twostage design  
Reported a confidence interval  28 (25.5)  57 (68.7)  93 (45.6) 
Stated the confidence interval had been  1 (0.9)  9 (10.8)  12 (5.9) 
adjusted for the twostage design  
Analysis performed assuming a sample equal  14/57 (24.6)  13/73 (17.8)  27/130 (20.8) 
to that given in the design 
Criteria  Number (%) 

Reported a point estimate not stated as adjusted and clearly reported the  77/83 (92.8) 
number of successes and sample size assumed in the analysis  
Reported point estimate consistent with an unadjusted estimate  75/77 (97.4) 
Reported point estimate consistent with at least one adjusted estimate  35/66 (53.0) 
Reported a confidence interval not stated as adjusted and clearly reported its  
level, the number of successes, and sample size assumed in the analysis  48/84 (57.1) 
Reported confidence interval consistent with at least one unadjusted interval  37/48 (77.1) 
Reported confidence interval consistent with at least one adjusted interval  2/44 (4.5) 
Comments
There are no comments yet.