Two-stage designs are used for many purposes, including enrichment, sample size re-estimation and to modify randomization probabilities to improve the efficiency and/or efficacy of estimators. All these procedures use accumulated data to change the operation of the experimental design, which induces dependencies between the first and second stage data. Our interest lies in the effects of such dependencies on inference at the end of a pilot study where the first stage sample size is fixed, and the second stage sample size is large.
In two-stage enrichment designs, patients more likely to benefit from the treatment are identified based on data from the first stage, and second stage trials are conducted in the identified subpopulation [e.g., Simon and Maitournam , Ivanova and Tamura , Rosenblum and van der Laan , Trippa et al. , Zang and Guo ]. Two-stage sample size re-estimation methods are conducted by revising the final sample size with parameter estimation from the first stage [e.g., Stein , Proschan , Shih , Schwartz and Denne , Zhong et al. , Tarima et al. , Broberg and Miller ]. In two-stage adaptive optimal designs, information from the first stage is used to estimate optimal treatment assignment probabilities for the second stage [e.g., Haines et al. , Lane and Flournoy , Englert and Kieser , Lane et al. , Shan et al. ].
Lane and Flournoy 
studied asymptotic distributional properties of the maximum likelihood estimator for nonlinear regression models with independent normal errors. In their study, they used the Fisher information to norm the score function when taking limits, obtaining a limiting distribution for the maximum likelihood estimator that is a random scale mixture of normal random variable. Use of this result requires knowledge of the distribution of the limiting scaling random variable.Lane and Flournoy found this distribution in the special case of an exponential mean function. But the method used is not generalizable, and so their result is informative, but not generally useful in practice.
In their review paper on likelihood theory for stochastic processes, Barndorff-Nielsen and Sørensen 
describe conditions under which maximum likelihood estimators normed with the Fisher information converge to randomly scaled mixture of normal distributions, as was the case inLane and Flournoy . Limiting random mixtures of normal random variables also arise in Ivanova et al. , Ivanova and Flournoy , and May and Flournoy . But Barndorff-Nielsen and Sørensen describe a solution to this problem. Namely, they describe how using a random norming in lieu of the Fisher information can lead to a standard normal distribution instead.
This paper examines the use of random normings in a practical situation. In particular, we evaluate these alternative random norms in the same context as in Lane and Flournoy  and Lane et al. , and show how to apply them to obtain the more useful standard normal distribution. Then we compare the rates of convergence and efficiencies of the different norming alternatives.
Accordingly, this paper is organized as follows. In Section 2, we present the model to be studied in this paper. In Section 3, we describe stable and mixing convergences, which are needed, and a generalized version of the Cramér-Slutzky theorem. In Section 4, we present the main asymptotic results for maximum likelihood estimators with random normings. We conduct simulation studies to compare the efficiencies obtained with these normings for exponential and logistic models in Section 5.
2 The Model
Let be observations from a two-stage adaptive design, where is the number of observations and is the single-dose used for the th stage, . To avoid degenerate cases, we assume , and set . We consider a general regression model with independent normal errors:
where is some (possibly) nonlinear mean function, twice differentiable by ; is given; and for simplicity, is a 1-dimensional parameter. In addition, adaptation is restricted to the choice of , and depends on stage 1 data only through sufficient statistics from stage 1. More specifically, ) is a random function, where . Define . Then . But . As for , and . But is only conditionally on .
Let denote maximum likelihood estimators of based on stage data, , and let denote the maximum likelihood estimator of based on all trials. Since maximum likelihood estimators (MLEs) are functions of sufficient statistics, is a function of the first stage mean response , and both and are functions of .
Then the likelihood function is
Letting , and , the score function can be written as
3 Stable and Mixing Convergence
3.1 Motivation and Definitions
Let and be real random variables defined on some probability space , and let be a subfield. Given a sequence of random variables , suppose one wants to obtain the limiting distribution of the product of . If converges in probability to a constant , and converges in distribution to , then by the Cramér-Slutzky theorem []. However, Lane and Flournoy  showed for model (1) that if is small (and provided common regularity conditions with ), then
where for every , where ; and is independent of and . Since Equation (2) holds for all , it holds in the limit as with fixed. That is, as with fixed and independent of . But is a random function of that does not converge to a constant when is held fixed. So one cannot divide both sides of Equation (2) by and apply the classical Cramér-Slutzky theorem to obtain a limit.
To obtain a standard normal limit instead of the normal mixture in Equation (2) requires a generalized version of the Cramér-Slutzky theorem, which is given in Lemma 4.2 below. The generalized Cramér-Slutzky theorem requires the concepts of stable and mixing convergence, which were introduced by Rényi . So before proceeding, we recall these concepts. A thorough description of stable and mixing convergence can be found in Häusler and Luschgy .
Let denote the conditional probability given the event . We say that converges stably to as if
Stable convergence is stronger than convergence in distribution, but not as strong as convergence in probability. If is independent of , then the limit is said to be mixing.
3.2 Stable Convergence Under Model (1).
If and under model 1, stably with independent of as while is fixed.
4 Standard normal limits with random norming
4.1 Random Norms and Their Limits under Model (1)
Barndorff-Nielsen and Sørensen 
describe random measures of information that can be used as norms for estimator and test statistics, and sometimes yield a more useful limit (e.g., standard normal) for MLEs. FollowingBarndorff-Nielsen and Sørensen , we call them the observed, incremental observed and incremental expected information measures. In the two-stage setting, it not only makes sense to define increments in the log-likelihood between individual subjects, but also between stages because sufficient statistics are stage-wise data summaries. We examine both.
First we formally define these, together with the expected (Fisher) information, and then we evaluate them under model 1:
The observed information is the negative derivative of the score function:
Barndorff-Nielsen and Sørensen  and others have considered the observed information to be a standard with which the other information measures are compared.
The Fisher information
is the variance of the score function. Assuming the integral and derivatives exist and are interchangeable, it is given by
Efron and Hinkley  studied the trade-off between the observed and expected (Fisher) information. They argue for using the observed information for data analysis after a study is completed, and they express a preference for using the expected information to design an experiment. Barndorff-Nielsen and Sørensen  state that the difference (between the observed and expected information) is due, essentially, to the high content of ancillary information carried by the observed information. Pierce  and Firth  showed the observed information is larger than the Fisher information by an amount .
To define the incremental information in general, suppose a study is conducted in stages with subjects in each stage, . Then the log-likelihood can be written in increments as where is the th subject-wise increment and is the th stage-wise increment with .
The incremental expected information was introduced as the conditional variance by Lévy and Borel 
in an early version of the Martingale central limit theorem. Letdenote the history of the experiment up through the trial for subject , ; and let be the trivial field. Then is a filtration of , i.e.: . Using subject-wise and stage-wise increments in , we obtain the subject-wise and stage-wise incremental norms:
The incremental expected information is also called the quadratic characteristic of the score martingale.
The incremental observed information is given by
In the terminology of martingale theory, it is called the quadratic variation of the score martingale [e.g., Barndorff-Nielsen and Sørensen ] and squared variation [e.g., Hall and Heyde ]. Barndorff-Nielsen and Sørensen show that use of the incremental observed information may improve the robustness of estimators.
It is common for the random information measures to converge to the Fisher information. However, there can be substantial differences with small sample sizes. Note that only observed and expected information are defined solely in terms of the likelihood function and its distribution law. The incremental observed and expected information require knowledge of how the log-likelihood function increases from one subject or one stage to the next.
We now evaluate the random information norms that we will use to obtain standard normal limits for . Under model (1), with , the observed information is
The subject-wise and stage-wise incremental observed information are, respectively,
The subject-wise and stage-wise incremental expected information are the same:
Lemma 4.1 provides convergence results for the random normings that are then used to obtain the desired standard normal limit for .
Under model (1), if and , as with is fixed,
where and .
The first term of equation (5) tends to when divided by . In the second term, by the weak law of large numbers,
As , and , , and .
The first term of equation (6) goes to when divided by . In the second term, is distributed as for every , so as . And is independent of . Therefore,
where as and .
4.2 The Generalized Cramér-Slutzky theorem and Its Application
Now we introduce the Generalized Cramér-Slutzky Theorem in order to obtain main theoretical results in Theorem 4.3, that is, to obtain standard normal limits for using random norms. According to Lemma 4.1, the observed information , the stage-wise and subject-wise incremental expected information , , and the subject-wise incremental observed information can be applied to normalize the MLE by the generalized Cramér-Slutzky theorem, while the stage-wise incremental observed information cannot.
The Generalized Cramér-Slutzky Theorem  Suppose that . Let be a continuous function of two variables, if , where is a -measurable random variable. Then
Under model (1),
as with fixed.
Defining , is continuous function of two variables when . Let and . Then and . Because , . Now by Lemma 4.2,
Since is independent of ,
5 Adaptive Optimal Design Examples
In this section, we apply Theorem 4.3
to normalize MLEs following an adaptive optimal design under logistic and exponential (location and scale) regression models. Then we compare their tail probabilities and the difference between cumulative distribution functions using random norms and the Fisher information. For all models, the dose in the first stage is fixed at, while the dose for stage 2 is selected from the range based on stage 1 data. The divergence of the MLE of to infinity necessitates restricting the search to some finite interval ; for simplicity throughout this section, we assume . All simulations assume the true parameter and known variance .
The stage-two dose that maximizes the increase in information on the unknown parameter is
The two-stage adaptive optimal design is , where is selected adaptively as given by (9), i.e.,
and . For each model, we evaluate the MLE norms’ performance for several fixed values of , including a locally optimal stage 1 sample size :
where the notation makes Fisher information’s dependence on the design explicit. To provide an ideal benchmark, is evaluated at the true value of for all models. A practical method to approximate the locally optimal stage 1 sample size is discussed by Lane et al. .
5.1 Logistic Regression Models
We explore the sample size needed to obtain the normal tail probabilities for the location parameter and scale parameter logistic regression models, separately.
5.1.1 The Logistic-Location Model
Consider the Logistic-Location Model with independent normal errors:
Maximizing the first-stage likelihood function,
yields the MLE:
Adaptively selecting the second-stage dose to be
the likelihood given data from both stages is
and the MLE based on all data is
where maximizes The average Fisher information given data from both stages is
where and are the probabilities that falls on the boundaries and , respectively.