Targeting the Uniformly Most Powerful Unbiased Test in Sample Size Reassessment Adaptive Clinical Trials with Deep Learning

12/16/2019 ∙ by Tianyu Zhan, et al. ∙ 0

In recent pharmaceutical drug development, adaptive clinical trials become more and more appealing due to ethical considerations, and the ability to accommodate uncertainty while conducting the trial. Several methods have been proposed to optimize a certain study design within a class of candidates, but finding an optimal hypothesis testing strategy for a given design remains challenging, mainly due to the complex likelihood function involved. This problem is of great interest from both patient and sponsor perspectives, because the smallest sample size is required for the optimal hypothesis testing method to achieve a desired level of power. To address these issues, we propose a novel application of the deep neural network to construct the test statistics and the critical value with a controlled type I error rate in a computationally efficient manner. We apply the proposed method to a sample size reassessment confirmatory adaptive study MUSEC (MUltiple Sclerosis and Extract of Cannabis), demonstrating the proposed method outperforms the existing alternatives. Simulation studies are also performed to demonstrate that our proposed method essentially establishes the underlying uniformly most powerful (UMP) unbiased test in several non-adaptive designs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 30

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomized clinical trials (RCTs) remain the gold standard for understanding the effect of a treatment or other intervention relative to placebo or standard of care (Diamond et al., 2015; Wu et al., 2017; Barnhart et al., 2018). To facilitate the development process and make it more ethical to patients, adaptive designs have become more and more appealing in the past several decades, as they allow for prospectively planned modifications to design aspects based on accumulated unblinded data (Bauer et al., 2016). For instance, the sample size reassessment adaptive approach is to prospectively plan adjustments to the sample size based on interim results (FDA, 2018). It has been the main focus of the adaptive design methodology development, and remains the most frequently proposed adaptive design to regulatory agencies for both Food and Drug Administration (FDA) (Lin et al., 2016) and European Medicines Agency (EMA) (Elsäßer et al., 2014).

A major challenge for such adaptive designs to be applied in confirmatory (Phase III) clinical studies is the justification of type I error rate control, which is required by regulatory agencies (FDA, 2018; EMA, 2007). By repeatedly looking at data with the possibility for interim sample size adjustment, one may inflate the type I error rate with usual test statistics (Bretz et al., 2009). There are many statistical methods proposed to maintain the trial integrity, for example by using weighted statistics (Cui et al., 1999; Lehmacher and Wassmer, 1999) or the combination test principle (Bauer and Kohne, 1994; Liu et al., 2002) from a frequentist perspective, and Bayesian methods (Inoue et al., 2002; Ciarleglio et al., 2015; Ciarleglio and Arendt, 2017, 2019) to accommodate randomness in observed data. The type I error control can be justified by either analytic derivation or proper simulation studies. Nevertheless, identifying the uniformly most powerful (UMP) unbiased tests (Lehmann and Romano, 2006) is also of great interest to both patients and sponsors, because the smallest sample size is required to achieve a desired level of power in a given study design. A safe and efficacious drug can be delivered more efficiently to meet unfulfilled medical needs. However, it remains challenging to derive the UMP unbiased tests in adaptive designs, due to the complex likelihood function introduced by interim adjustment rules.

In recent years, the Deep Neural Networks (DNN) approach has made remarkable progress and success in various domains, especially in image recognition and natural language processing

(Vogelstein et al., 2007; Perry et al., 2019; Schulz et al., 2019)

. It not only provides a powerful functional representation of the complex patterns in data, but also completely automates the step of feature engineering in previous machine-learning techniques

(Chollet and Allaire, 2018).

In this article, we propose a novel hypothesis testing framework by constructing the test statistics from DNN to approximate the UMP unbiased test in a finite sample size. DNN essentially learns the underlying probability of observed data being drawn from the alternative hypothesis as compared to the null hypothesis. A proper cutoff or critical value could be computed by simulations from null data to control the type I error rate at a nominal level

. To avoid the time-consuming Monte Carlo method for each observed data and to make the decision rule pre-specified, we further build another DNN to model the critical value. A similar strategy was adopted by Chen and Zhang (2009), where the multivariate adaptive splines for analysis of longitudinal data (MASAL) (Zhang, 1997, 1999)

was utilized to estimate the empirical critical value and the number of genotyped markers in genomewide association (GWA) studies. We apply our method to a Phase III adaptive clinical trial MUSEC (MUltiple Sclerosis and Extract of Cannabis) with sample size reassessment

(Zajicek et al., 2012), demonstrating the proposed method has a significantly higher power than the existing alternatives, and hence a smaller sample size is required. Simulations are also performed to show that our proposed two-fold DNN guided test can well approximate a UMP unbiased level test in several cases of either simple or composite hypothesis testing with either known or unknown nuisance parameters.

The remainder of this article is organized as follows. In Section 2, we introduce our DNN-guided hypothesis testing framework in the context of non-adaptive designs with a simple hypothesis, and provide a short review on Deep Neural Networks (DNN). We further generalize the framework to composite hypothesis testing and adaptive designs at Section 3.2. Our method is applied to a sample size reassessment adaptive design MUSEC at Section 4. In Section 5, we further perform simulation studies to compare our method and known UMP unbiased tests under some cases. Concluding remarks are provided in Section 6.

2 Simple hypothesis testing

We start with a simple hypothesis testing problem of the mean

in a Normal distribution

with the standard deviation

as a nuisance parameter. The null hypothesis is tested against the alternative hypothesis with type I error rate controlled at a nominal level , for example . Denote by

the probability density functions corresponding to

and with a sample from , for .

2.1 Known nuisance parameter

Further assume that the nuisance parameter is known at value . We define a test function taking the value of if the null hypothesis is rejected, and otherwise. The rejection region is given by . Based on the Neyman-Pearson Lemma, a test that satisfies

(1)

for some , is a uniformly most powerful (UMP) level test, and is also a UMP unbiased level test (Lehmann and Romano, 2006). To explicitly derive this test, one first expresses the rejection region in (1) as , where is a sufficient statistic for . The constant is chosen such that , where . In this simple setup, the -test is a UMP level test. However, when the likelihood function is complicated, it is usually hard to derive the distribution of the sufficient statistics in a finite sample, as with the case study of adaptive design we considered at Section 4.

As an alternative, we formulate the hypothesis testing in the context of a binary classification problem to categorize whether is sampled from the alternative hypothesis or from the null hypothesis . We introduce a latent indicator indicating where is sampled. Given , the conditional probability density function or the probability mass function is equal to , for . Therefore, the rejection region in (1) can be expressed as

(2)

where and . A larger value of indicates that is more likely to be drawn from as compared to . The constant in (2) is set at such that

(3)

to control the type I error rate at . Based on the sufficiency part of the Neyman-Pearson Lemma, any test that satisfies (2) and (3) is a UMP level test. The question then becomes how to model in (2).

Next we provide a short review of DNN which defines a mapping and learns the value of the parameters that result in the best function approximation of output label based on input data (Goodfellow et al., 2016). The deep in DNN stands for successive layers of representations. For example, in a DNN with two layers denoted as , the data and parameters

are transformed by activation function

in the first layer, and then transformed by in the second layer to approximate . The last-layer activation function is commonly chosen as the sigmoid function for binary classification and the linear function for a continuous variable approximation (Chollet and Allaire, 2018), while the inner-layer activation function is usually the function defined by (Jarrett et al., 2009).

In our proposed hypothesis testing framework, we utilize a DNN to characterize the functional form of or equivalently the likelihood ratio in (2), and further determine a rejection region with a controlled type I error at in (3). For a sample with index

, we define an auxiliary random variable

taking value either or , where the event indicates a sample being drawn from the distribution under , for . In the Step 1 of Algorithm 1, we generate training data with and data with for the binary classification problem. We further utilize constructed by DNN to approximate satisfying

(4)

where is a sufficient statistics of , is the parameter in DNN, () is a small error term, and is the known value of the nuisance parameter . For example, in a DNN with two layers, one has and . By the Fisher-Neyman factorization theorem of a sufficient statistic, we also have . Based on the universal approximation theorem and its extensions, in (4) can be arbitrarily small by using a depth-2 DNN with a sufficiently larger number of nodes if is continuous (Cybenko, 1989), or by using a sufficiently flexible DNN if is Lebesgue integrable (Lu et al., 2017).

However, the relatively larger number of parameters in can easily make DNN overfit the training data and lose generalized ability to the new data. In Step 2, a proper DNN structure is selected from several candidates by cross-validation with as the training data and the remaining as the validation data (Goodfellow et al., 2016). In choosing the candidate pool of DNN structures, one usually starts with a simple skeleton with small numbers of nodes and layers. By making DNN wider and deeper, typically the validation error first gets smaller and then peaks up. One can further propose several structures around this suboptimal solution in the candidate pool. The final optimal structure is the one with the smallest validation error. From Section 4 and 5, the performance of our method is consistent across different DNN structures given that their validation error is relatively small.

In Step 3, DNN seeks a solution which maximizes the log likelihood function,

(5)

where . Essentially, DNN obtains as the Maximum Likelihood Estimate (MLE) of in (4), and hence approximates the underlying in (2). In Step 3, in (3) is evaluated empirically based on a set of validation data simulated from . We reject at level if .

  Step 1: Build training data for DNN Simulate samples from the null hypothesis and samples from the alternative hypothesis . The input is as a sufficient statistics for , and the classification label is if the sample is simulated from while otherwise, for .
  Step 2: Utilize cross-validation to select the structure of DNN Details of the cross-validation are provided in Section 2.1.
  Step 3: Train a selected DNN to obtain Train a DNN with the ReLU as the inner-layer activation function and the sigmoid as the last-layer activation function. The linear predictor is denoted as .
  Step 4: Compute the cutoff value in the decision rule Simulate a set of validation data from with size , and denote their linear predictors as . Set the cutoff value in (3) as the upper quantile.
Algorithm 1 A DNN-guided UMP level test with known nuisance paramters

2.2 Unknown nuisance parameters

If the nuisance parameter is unknown, then we substitute it in (2) by its MLE to obtain . We propose to build a more general DNN with varying ’s to characterize the underlying , and utilize a second-fold DNN to estimate the critical value. In the Step 1 of Algorithm 2, a total number of training datasets are aggregated, where each subset has a fixed standard deviation at , for , as in the Algorithm 1. The range of should be wide enough to cover the potential minimum and maximum of from observed data . For example, in the simulation study of Section 5.2, contains as a lower bound if has a sample size of with standard deviation . The same range of is applied to the range of as the training data of the second-fold DNN to model the critical value at Step 3. Having observed a new data , we plug into to compute the test statistics, and into for the critical value, where

as an unbiased estimator of

.

The proposed two-fold DNN guided hypothesis testing framework not only makes the decision rule pre-specified, but is also computationally efficient as compared to re-sampling based approaches, such as the parametric bootstrap method (Efron and Tibshirani, 1994). One does not need to simulate a relatively large number of null datasets for each observed data .

A similar automatic procedure was adopted in Chen and Zhang (2009), where they utilized multivariate adaptive splines for analysis of longitudinal data (MASAL) (Zhang, 1997, 1999) to model the empirical critical value from the generalized extreme value distribution (GEV) and the number of genotyped markers in genomewide association (GWA) studies. MASAL finds a simple while accurate piecewise linear approximation to the underlying smooth curve and estimates correlations between observations by EM-type iterative procedure. In practice, the convergence of covariance structure estimation needs to be examined by results from several iterations (Zhang, 2004). On the other hand, our second-fold DNN does not necessarily assume a smooth underlying function of critical values, and could be generally applied to complex settings, for example the adaptive design in the next section.

  The first-fold DNN for test statistics
  Step 1: Build aggregated training data for the first DNN Stack sets of training data from Step 1 of Algorithm 1 to construct an aggregated training data with

total samples. The input vector

for sample contains and .
  Step 2: Train the first DNN using the selected structure by cross-validation to obtain the test statistic Same as Step 2 and 3 in Algorithm 1.
  The second-fold DNN for critical values
  Step 3: Build training data for the second DNN The training input data are , for . The training output label for each is the upper quantile in linear predictors simulated from null distribution .
  Step 4: Train the second DNN using the selected structure by cross-validation to obtain the critical value Same as the Step 2 and 3 in Algorithm 1 but use linear function as the last-layer activation function in DNN training with parameters estimated at .
  Hypothesis testing for an observed data Calculate the input data based on for the first DNN, and further compute its corresponding test statistic . Use the second DNN to obtain the critical value , and reject if .
Algorithm 2 A DNN-guided UMP level test with unknown nuisance paramters

3 Composite hypothesis testing

The previous section on simple hypothesis testing is mainly of theoretical interest, since the problem arising in drug development typically involves a composite hypothesis, for example testing if the efficacy of a treatment is superior to placebo. In Section 3.1, we introduce a general framework of our proposed deep learning guided hypothesis testing method, and apply it to adaptive designs at Section 3.2.

3.1 A general framework

Suppose that we have two samples of size , where denoting the treatment group and for the placebo group, from a probability density function (or probability mass function) , where is the parameter of interest and is for the nuisance parameter. We consider a one-sided composite hypothesis testing of versus with type I error rate controlled at , if a larger corresponds to a better clinical outcome. This is equivalent to test versus , where .

We generalize Algorithm 2 at Section 2.2 by accommodating varying ’s to Algorithm 3. Denote , where is the MLE of given data . Correspondingly, is the MLE of . We define and as unbiased estimates of and , respectively. At Step 1, the input data for the first DNN incorporates , and to construct the test statistics. At Step 3 of training the second DNN to model the critical value, the underlying are included in the input , where are the common parameter of interest for two groups under . After observing data and , one first computes the DNN constructed test statistic with , and then calculates the critical value , where . The decision is to reject if .

  The first-fold DNN for test statistics
  Step 1: Build aggregated training data for the first DNN Stack sets of data with samples from and samples from to construct an aggregated training data. The input vector contains , and .
  Step 2: Train the first DNN using the selected structure by cross-validation to obtain the test statistic Same as Step 2 and 3 in Algorithm 1.
  The second-fold DNN for critical values
  Step 3: Build training data for the second DNN The training input data are of size , where . The output label for each is the upper quantile in linear predictors for data simulated from null distribution for placebo and for the treatment group.
  Step 4: Train the second DNN using the selected structure by cross-validation to obtain the critical value Same as Step 2 and 3 in Algorithm 1 but use linear function as the last-layer activation function in DNN training with parameters estimated at .
  Hypothesis testing for observed data and Calculate the input data for the first DNN, and compute its corresponding test statistic . Use the second-fold DNN to obtain the critical value , where . Reject if .
Algorithm 3 A general framework of DNN guided hypothesis testing

3.2 Adaptive designs with sample size reassessment

Now let us apply Algorithm 3 to a sample size adaptive clinical trial in the context of a binary endpoint to evaluate the efficacy profile of a treatment group versus placebo with randomization ratio. If a larger response rate corresponds to a better outcome, then the null and alternative hypothesis are:

(6)

where and are the response rates in the placebo and the treatment group, respectively. This is equivalent to testing if the response difference is larger than or if it is equal to . The number of responders in the control and

in the treatment group are assumed to follow two independent Binomial distributions.

In a non-adaptive design, there exits a UMP unbiased test of hypothesis (6) by Theorem 4.4.1 (Lehmann and Romano, 2006)

, because the joint distribution of

and is in an exponential family. The UMP unbiased test is in terms of the conditional distribution of on

, which is the hypergeometric distribution. An approximate test whose overall level tends to be closer to

is obtained by using the Normal approximation to the hypergeometric distribution without continuity correction,

(7)

where , and is the sample size per group. More details and discussions could be found in Section 4.5 in Lehmann and Romano (2006).

Now consider a two-stage adaptive design with a sample size reassessment at the interim analysis. Denote by and the number of responders in the placebo and treatment group, respectively, and by the sample size per group in the first-stage. Correspondingly, , and as the notations in the second-stage. Having observed interim data , one can adjust based the following conditional power to achieve a desired pre-specified power (Mehta and Pocock, 2000):

where as defined in (7) and . The , and can be estimated by their empirical counterparts. However, uncertainty and variability of those parameters need to be considered to protect against misspecification of the hypothesized treatment effect (Ciarleglio and Arendt, 2017). Conditional expected power (CEP), on the other hand, provides a robust alternative by averaging the traditional CP using prior distribution of and , and , respectively (Ciarleglio et al., 2015; Ciarleglio and Arendt, 2019):

(8)

Having observed the first data for adaptation, one adjusts as the minimum integer for to achieve a targeted level , for example . For a practical consideration, is upper bounded by due to cost and timeline constraints, and is lower bounded by in order to build a required safety profile.

Contrary to non-adaptive designs, it is challenging to characterize the joint distribution of the number of responders under the null hypothesis, because the value of is adjusted based on observed accumulated data in the first-stage. By adopting Algorithm 3, we include the MLEs of by using data from two stages separately, and in the first DNN training dataset, i.e., . In constructing the second-fold DNN to model critical values, the training data includes the underlying null response rate . For an observed data, its empirical estimator based on the first-stage data is an unbiased estimator of under , and hence is plugged into to compute the critical value.

4 Sample size reassessment in the MUSEC trial

In this section, we apply our proposed method to the Phase III clinical trial MUSEC (MUltiple Sclerosis and Extract of Cannabis, Trial Registration Number NCT00552604, Zajicek et al. (2012)), which implemented an adaptive design with sample size re-estimation to investigate a standardized oral cannabis extract (CE) for the symptomatic relief of muscle stiffness and pain in adult patients with stable multiple sclerosis and ongoing troublesome muscle stiffness.

The primary outcome measure was a binary endpoint of patient reporting “relief of muscle stiffness” from baseline to 12 weeks of treatment based on an 11-point category rating scale. Sample size calculations were based a response rate of in CE and in placebo from previous studies, and therefore

subjects per group gives the Fisher exact test

power with a two-sided type I error rate of (Zajicek et al., 2012). An unblinded interim analysis was planned when of the total subjects had developed their 12-week outcomes. The Independent Data Monitoring Committee (IDMC) made the recommendation to reduce the total number of subjects to be enrolled by as the adjusted sample size was sufficient to maintain a conditional power of .

To evaluate the performance of our proposed method, we re-design this adaptive design by treating the primary endpoint as an instant binary response whose observation is available right after patient’s enrollment. The one-sided type I error rate is to be controlled at the nominal level , and the response rate in the treatment group is considered at for illustration. With an interim analysis at information rate, the sample size per group in the first-stage is . We consider the sample size reassessment framework based on the conditional expected power (CEP) in Section 3.2

. Beta distributions with mean at observed rates and the same variance

are assumed as the prior for and (Ciarleglio and Arendt, 2017). The CEP in (8) is evaluated by Monte Carlo integration with iterations. The second-stage sample size per group is adjusted for CEP to reach . The lower limit is set at while the upper bound is at .

In Algorithm 3 of our proposed method, we aggregate datasets for scanning from to with an increment of and . The corresponding is computed for the proportional test in (7) to achieve power, and hence the resulting and are aligned with the response assumptions. In the validation step, we also considered scenarios where the observed is lower or higher than the assumption. The DNN structure is selected as the one with the smallest validation error in cross-validation from candidates, which are all combinations of the number of layers at and , and the number of nodes per layer at and

. The number of epochs is

, the batch size is , and the dropout rate is set at . The same structure candidate pool is applied to the rest of this article for the first-fold DNN to construct the test statistic.

In Step 3 of modeling the second DNN, we simulate training data with varying null response rate as a regular sequence from to . The moderate size is sufficient to give us a validation mean squared error loss approximately in the second DNN, and to deliver an accurate type I error rate control as can be seen later on. The structure candidate pool for the second DNN includes all combinations of the number of layers at and , and the number of nodes per layer at , and . The dropout rate is set at , the batch size is , and the number of epochs is . The critical value is computed based on samples. We further simulate another validation data to evaluate the type I error rate control and power performance. The iterations for computing the critical value and for simulating validation data is used throughout this article.

The inverse normal combination test approach (Lehmacher and Wassmer, 1999) with equal weights, denoted as “INCTA”, and a Bayesian method with adjusted critical value (Ciarleglio and Arendt, 2017, 2019), denoted as “BM”, are conducted for comparison. We also compute the average sample size (ASN), which is the same for all three methods within a scenario.

The structure of the first DNN is selected at nodes per layer and layers. The second DNN has nodes and layers. We first consider a study design with for the target CEP and for the prior beta distribution variance. The type I error is studied under null response rate at , , , and , which establish a reasonable range around the observed control rate at . As shown in Table 1, both our proposed method and INCTA have an accurate controlled type I error rate at . The critical value in the BM is adjusted to to protect the type I error across all evaluated response rates. In the Supplemental Materials Table 1, we further evaluate the type I error protection in cases where the sampling distribution is a mixture of assumed binary distribution and a uniform distribution with minimum at , and maximum at twice the response rate. Our method has a more conservative type I error rate as compared with the two alternatives, and hence is more robust to model misspecifications.

Under the training alternative hypothesis and , our method has the highest power at as compared with from INCTA and from BM (Table 1). By performing replicates of the validation process, we find that all three methods have a relatively small standard deviation at approximately , demonstrating the robustness of our findings. The power performance of DNN method remains the highest when the observed response rate is higher or lower than the assumed . By adjusting the magnitude of , we compute the average sample size (ASN) for each design to achieve power at Table 2. In the case where and , our method has the smallest sample size per group requirement at , which corresponds to a saving relative to INCTA with , and a saving against BM with . In the corresponding heatmap at Figure 1, we plot the conditional probability of rejecting given the number of responders in CE and Placebo group. All three methods have similar decision zone of favoring if CE has more observed responders than the Placebo.

Three other designs with varying ’s and ’s are evaluated at Table 3, demonstrating a consistent power gain of our proposed method, along with an accurate type I error control. In the Supplemental Materials Table 2, we show that the performance of our proposed method is robust with different DNN structures.

Figure 1: Heatmap of conditional power under the null and the alternative hypothesis for sample size reassessment design with and
Type I error / Power ASN
DNN INCTA BM
0.17 0.00 5.0% 5.0% 5.0% 405
0.22 0.00 5.0% 5.1% 5.0% 404
0.27 0.00 5.0% 5.0% 5.0% 403
0.32 0.00 5.1% 5.1% 5.0% 402
0.37 0.00 5.1% 5.0% 5.0% 403
0.27 0.12 90.9% 85.9% 88.8% 250
0.27 0.13 94.5% 88.7% 91.7% 227
0.27 0.14 96.7% 90.7% 93.7% 208
Table 1: Type I error rate and power in sample size reassessment design with and
ASN
DNN INCTA BM
0.27 0.12 242 389 284
0.27 0.13 189 272 203
0.27 0.14 152 198 158
Table 2: Average sample size (ASN) to achieve power in sample size reassessment design with and
Type I error / Power
DNN INCTA BM
0.8 0.005 0.17 0.00 5.1% 5.1% 5.0%
0.22 0.00 5.1% 5.0% 5.0%
0.27 0.00 5.0% 5.1% 5.0%
0.32 0.00 5.0% 5.0% 5.0%
0.37 0.00 5.1% 5.1% 5.1%
0.27 0.12 94.8% 89.1% 93.3%
0.27 0.13 97.6% 91.3% 95.5%
0.27 0.14 98.9% 92.7% 96.3%
0.75 0.001 0.17 0.00 4.9% 5.1% 5.2%
0.22 0.00 4.9% 5.0% 5.0%
0.27 0.00 5.1% 5.1% 5.2%
0.32 0.00 5.0% 5.0% 5.0%
0.37 0.00 5.1% 5.1% 5.2%
0.27 0.12 88.9% 84.4% 87.1%
0.27 0.13 93.0% 87.2% 90.0%
0.27 0.14 95.3% 89.3% 92.2%
0.75 0.005 0.17 0.00 5.1% 5.1% 5.1%
0.22 0.00 5.0% 5.1% 5.1%
0.27 0.00 5.0% 5.0% 5.0%
0.32 0.00 5.1% 5.1% 5.1%
0.37 0.00 5.1% 5.0% 5.1%
0.27 0.12 92.9% 87.2% 90.8%
0.27 0.13 96.2% 89.7% 93.4%
0.27 0.14 97.9% 91.2% 94.9%
Table 3: Sample size reassessment designs with varying ’s and varying ’s

5 Simulation studies

In this section, we evaluate the performance of our DNN guided hypothesis testing framework under three setups: simple hypothesis testing with known nuisance parameters at Section 5.1, simple hypothesis testing with unknown nuisance parameters at Section 5.2, and composite hypothesis testing at Section 5.3. We essentially show that our method learns the existing UMP unbiased tests in simple hypothesis testing settings, and performs no worse than a popular approximate method in a composite hypothesis testing case.

5.1 Simple hypothesis testing for mean of a Normal distribution with known standard deviation

Consider the setup in Section 2.1, where the mean of a Normal distribution with known standard deviation is tested at level for hypothesis versus . In this simple hypothesis testing problem, the -test is a UMP level test. We consider , and choose as the alternative mean for the -test to achieve a or power. In Algorithm 1, the training dataset size and are set to be at Step 1.

By cross-validation, the DNN structure with four layers, and ten nodes per layer has the smallest validation error, and is utilized for the final model fitting. As can be seen from Table 4, our proposed method accurately controls the type I error rate at the nominal level across all scenarios with varying ’s and ’s. It also consistently reaches the upper power limit as the power of the

-test. We further conduct a sensitivity analysis of the hyperparameters in DNN under

and . Results at the Supplemental Materials Table 3 demonstrate that the performance of our proposed method is robust with the training data size and from and , the number of nodes, layers and the dropout rate in DNN.

Type I error Power
DNN -test DNN -test
50 1 0 0.233 5.0% 5.0% 50.0% 50.0%
0.414 5.0% 5.0% 90.0% 89.9%
150 2 0.269 5.0% 5.0% 50.0% 50.0%
0.478 5.0% 5.0% 90.1% 90.0%
Table 4: Type I error rate and power in simple hypothesis testing with known standard deviation

5.2 Simple hypothesis testing for mean of a Normal distribution with unknown standard deviation

Next we consider a scenario with an unknown standard deviation, where the one-sample -test is a UMP level test. The sample size is considered at and . We set and choose the value of for the one-sample -test to reach approximately power. In Step 1 of Algorithm 2, we aggregate training dataset with and . We generate regular sequences from to as the input data in training the second-fold DNN for modeling the critical value. In the validation step, we consider cases with and , which are included in the set , but is not.

The structure of the first DNN is selected at ten nodes per layer and four layers. The second DNN has forty nodes and two layers. As shown Table 5, our DNN based method reaches the upper power limit from the one-sample -test with a controlled type I error rate at across all scenarios evaluated.

Type I error Power
DNN One-sample -test DNN One-sample -test
100 1.0 0 0.293 5.0% 5.0% 89.7% 89.6%
1.5 0.439 5.0% 5.0% 89.7% 89.7%
2.0 0.585 4.9% 5.0% 89.5% 89.6%
200 1.0 0.207 5.0% 5.0% 89.8% 89.8%
1.5 0.310 5.0% 5.0% 89.8% 89.8%
2.0 0.414 5.0% 5.0% 89.7% 89.8%
Table 5: Type I error rate and power in simple hypothesis testing with unknown standard deviation

5.3 Composite hypothesis testing for mean of two Normal distributions with unknown standard deviation

In this section, we consider the so-called Behrens-Fisher problem of testing the equality of the means and from two Normal distributions with unknown standard deviations , and no equal variance assumption (Behrens, 1929; Fisher, 1935). The is tested against the at level , where . Deriving the UMP unbiased test is challenging, because the sufficient statistics are not complete (Lehmann and Romano, 2006). The

Welch approximate t-test

is a popular approximate solution that are satisfactory for practical purposes (Welch, 1951).

In constructing our DNN guided hypothesis testing strategy by Algorithm 3, we aggregate sets of data to include all combinations of and in , and taking values for the Welch’s -test to achieve approximately and power at level at Step 1. Other parameters are set as: , and per group. At step 3 of training the second DNN, we generate a size of null data with the common mean value , and taking values in regular sequences from to . Since the Normal distribution is in a location-scale family, then the critical value is invariant under varying . Therefore, it is sufficient to fix at Step 3, and to exclude the unbiased estimator of from at Step 4. In the validation stage, the and are evaluated at and , and at . The nuisance standard deviation is not incorporated in the previous training step, and is utilized to test the predictive performance of our method. The is also considered at a lower value as compared with the magnitude to achieve power in the training step.

After conducting the cross-validation, the first DNN has nodes per layer and layers, while the second DNN has nodes and layers. Across all scenarios in Table 6, our DNN has an accurate type I error protection at , and has a consistent power performance as compared to the Welch’s -test. Marginal power gain of DNN method is also available in some cases, for example compared with when , and .

Type I error Power
DNN Welch’s -test DNN Welch’s -test
0.95 0.95 0.5 0.834 5.0% 5.0% 79.7% 79.7%
0.801 - - 72.0% 72.0%
1.1 0.7 0.861 5.0% 5.0% 79.8% 79.7%
0.825 - - 72.2% 72.2%
1.1 0.95 0.5 0.861 5.0% 5.0% 79.8% 79.8%
0.825 - - 72.1% 72.0%
1.1 0.7 0.887 5.0% 5.0% 79.9% 79.8%
0.848 - - 72.3% 72.1%
Table 6: Type I error rate and power in composite hypothesis testing with unknown standard deviation

6 Discussions

In this article, we propose a novel DNN-guided hypothesis testing framework to target the UMP unbiased test in sample size reassessment adaptive clinical trials. Our method has moderate power gain compared to several alternatives, and has accurate type I error rate control at a nominal level. The pre-specified decision rule not only makes the computation more efficient, but is also more acceptable to regulatory agencies. The well-trained DNNs can be locked in files before the conduction of the current trial to ensure the study integrity.

The proposed framework can be generally applied to other hypothesis testing problems if the potential UMP unbiased tests are hard to derive. A future work is to generalize our method to multiple hypothesis testing problems, for example seamless Phase II/III design, where familywise error rate (FWER) needs to be properly controlled under all configurations of true and false null hypotheses.

Acknowledgements

The R code and a R markdown help file are available at Github to replicate the case study and simulation studies.


References

  • Barnhart et al. (2018) Barnhart, K. T., M. D. Sammel, M. Stephenson, J. Robins, K. R. Hansen, W. A. Youssef, N. Santoro, E. Eisenberg, H. Zhang, et al. (2018). Optimal treatment for women with a persisting pregnancy of unknown location, a randomized controlled trial: The act-or-not trial. Contemporary Clinical Trials 73, 145–151.
  • Bauer et al. (2016) Bauer, P., F. Bretz, V. Dragalin, F. König, and G. Wassmer (2016). Twenty-five years of confirmatory adaptive designs: opportunities and pitfalls. Statistics in Medicine 35(3), 325–347.
  • Bauer and Kohne (1994) Bauer, P. and K. Kohne (1994). Evaluation of experiments with adaptive interim analyses. Biometrics, 1029–1041.
  • Behrens (1929) Behrens, W. (1929). A contribution to error estimation with few observations. Journal of Agriculture Scientific Archives of the Royal Prussian State College-Economy 68, 807–837.
  • Bretz et al. (2009) Bretz, F., F. Koenig, W. Brannath, E. Glimm, and M. Posch (2009). Adaptive designs for confirmatory clinical trials. Statistics in Medicine 28(8), 1181–1217.
  • Chen and Zhang (2009) Chen, X. and H. Zhang (2009). The null distributions of test statistics in genomewide association studies. Statistics in Biosciences 1(2), 214–227.
  • Chollet and Allaire (2018) Chollet, F. and J. J. Allaire (2018). Deep Learning with R (1st ed.). Greenwich, CT, USA: Manning Publications Co.
  • Ciarleglio and Arendt (2017) Ciarleglio, M. M. and C. D. Arendt (2017). Sample size determination for a binary response in a superiority clinical trial using a hybrid classical and Bayesian procedure. Trials 18(1), 83.
  • Ciarleglio and Arendt (2019) Ciarleglio, M. M. and C. D. Arendt (2019). Sample size re-estimation in a superiority clinical trial using a hybrid classical and Bayesian procedure. Statistical Methods in Medical Research 28(6), 1852–1878.
  • Ciarleglio et al. (2015) Ciarleglio, M. M., C. D. Arendt, R. W. Makuch, and P. N. Peduzzi (2015). Selection of the treatment effect for sample size determination in a superiority clinical trial using a hybrid classical and Bayesian procedure. Contemporary Clinical Trials 41, 160–171.
  • Cui et al. (1999) Cui, L., H. J. Hung, and S.-J. Wang (1999). Modification of sample size in group sequential clinical trials. Biometrics 55(3), 853–857.
  • Cybenko (1989) Cybenko, G. (1989).

    Approximation by superpositions of a sigmoidal function.

    Mathematics of Control, Signals and Systems 2(4), 303–314.
  • Diamond et al. (2015) Diamond, M. P., R. S. Legro, C. Coutifaris, R. Alvero, R. D. Robinson, P. Casson, G. M. Christman, J. Ager, H. Huang, K. R. Hansen, et al. (2015). Letrozole, gonadotropin, or clomiphene for unexplained infertility. New England Journal of Medicine 373(13), 1230–1240.
  • Efron and Tibshirani (1994) Efron, B. and R. J. Tibshirani (1994). An introduction to the bootstrap. CRC press.
  • Elsäßer et al. (2014) Elsäßer, A., J. Regnstrom, T. Vetter, F. Koenig, R. J. Hemmings, M. Greco, M. Papaluca-Amati, and M. Posch (2014). Adaptive clinical trial designs for European marketing authorization: a survey of scientific advice letters from the European Medicines Agency. Trials 15(1), 383.
  • EMA (2007) EMA (2007). Reflection paper on methodological issues in confirmatory clinical trials planned with an adaptive design. London: EMEA.
  • FDA (2018) FDA (2018). Adaptive designs for clinical trials of drugs and biologics guidance for industry. US Department of Health and Human Services, Federal Registrar.¡ https://www. fda. gov/downloads/drugs/guidances/ucm201790. pdf.
  • Fisher (1935) Fisher, R. A. (1935). The fiducial argument in statistical inference. Annals of Eugenics 6(4), 391–398.
  • Goodfellow et al. (2016) Goodfellow, I., Y. Bengio, and A. Courville (2016). Deep learning. MIT press.
  • Inoue et al. (2002) Inoue, L. Y., P. F. Thall, and D. A. Berry (2002). Seamlessly expanding a randomized phase II trial to phase III. Biometrics 58(4), 823–831.
  • Jarrett et al. (2009) Jarrett, K., K. Kavukcuoglu, M. Ranzato, and Y. LeCun (2009). What is the best multi-stage architecture for object recognition? In

    2009 IEEE 12th International Conference on Computer Vision

    , pp. 2146–2153. IEEE.
  • Lehmacher and Wassmer (1999) Lehmacher, W. and G. Wassmer (1999). Adaptive sample size calculations in group sequential trials. Biometrics 55(4), 1286–1290.
  • Lehmann and Romano (2006) Lehmann, E. L. and J. P. Romano (2006). Testing statistical hypotheses. Springer Science & Business Media.
  • Lin et al. (2016) Lin, M., S. Lee, B. Zhen, J. Scott, A. Horne, G. Solomon, and E. Russek-Cohen (2016). Cber’s experience with adaptive design clinical trials. Therapeutic Innovation & Regulatory Science 50(2), 195–203.
  • Liu et al. (2002) Liu, Q., M. A. Proschan, and G. W. Pledger (2002). A unified theory of two-stage adaptive designs. Journal of the American Statistical Association 97(460), 1034–1041.
  • Lu et al. (2017) Lu, Z., H. Pu, F. Wang, Z. Hu, and L. Wang (2017). The expressive power of neural networks: A view from the width. In Advances in Neural Information Processing Systems, pp. 6231–6239.
  • Mehta and Pocock (2000) Mehta, C. R. and S. J. Pocock (2000). Adaptive increase in sample size when interim results are promising: A practical guide with examples. Statistics in Medicine, 1–6.
  • Perry et al. (2019) Perry, R., T. M. Tomita, J. Patsolic, B. Falk, and J. T. Vogelstein (2019). Manifold forests: Closing the gap on neural networks. arXiv preprint arXiv:1909.11799.
  • Schulz et al. (2019) Schulz, M.-A., T. Yeo, J. Vogelstein, J. Mourao-Miranada, J. Kather, K. Kording, B. A. Richards, and D. Bzdok (2019). Deep learning for brains?: Different linear and nonlinear scaling in UK biobank brain images vs. machine-learning datasets. bioRxiv, 757054.
  • Vogelstein et al. (2007) Vogelstein, R. J., U. Mallik, J. T. Vogelstein, and G. Cauwenberghs (2007).

    Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses.

    IEEE Transactions on Neural Networks 18(1), 253–265.
  • Welch (1951) Welch, B. L. (1951). On the comparison of several mean values: an alternative approach. Biometrika 38(3-4), 330–336.
  • Wu et al. (2017) Wu, X.-K., E. Stener-Victorin, H.-Y. Kuang, H.-L. Ma, J.-S. Gao, L.-Z. Xie, L.-H. Hou, Z.-X. Hu, X.-G. Shao, J. Ge, et al. (2017). Effect of acupuncture and clomiphene in Chinese women with polycystic ovary syndrome: a randomized clinical trial. Journal of the American Medical Association 317(24), 2502–2514.
  • Zajicek et al. (2012) Zajicek, J. P., J. C. Hobart, A. Slade, D. Barnes, P. G. Mattison, M. R. Group, et al. (2012). Multiple sclerosis and extract of cannabis: results of the MUSEC trial. Journal of Neurology, Neurosurgery & Psychiatry 83(11), 1125–1132.
  • Zhang (1997) Zhang, H. (1997). Multivariate adaptive splines for analysis of longitudinal data. Journal of Computational and Graphical Statistics 6(1), 74–91.
  • Zhang (1999) Zhang, H. (1999). Analysis of infant growth curves using multivariate adaptive splines. Biometrics 55(2), 452–459.
  • Zhang (2004) Zhang, H. (2004). Mixed effects multivariate adaptive splines model for the analysis of longitudinal and growth curve data. Statistical Methods in Medical Research 13(1), 63–82.