## 1 Introduction

It is well known that a carefully designed experiment can improve the statistical inference in regression analysis substantially. Optimal design of experiments is the more efficient the more knowledge about the underlying regression model is available and an impressive theory has been developed to construct optimal designs under the assumption of a “given” regression model [see, for example, the monographs of Pukelsheim (2006), Atkinson et al. (2007) and Fedorov and Leonov (2013)]. On the other hand, model selection is an important step in any data analysis and these references also partially discuss the problem of constructing efficient designs to address model uncertainty in the design of experiment. Because of its importance this problem has a long history. Early work dates back to Box and Hill (1967); Stigler (1971); Atkinson and Fedorov (1975) who determined optimal designs for model discrimination by - roughly speaking - maximizing the power of a test between competing regression models [see also Ucinski and Bogacka (2005); López-Fidalgo et al. (2007); Wiens (2009); Dette and Titoff (2009) or Tommasi and López-Fidalgo (2010) for some more recent references]. A different line of research in this context was initiated by Läuter (1974) who proposed a criterion based on a product of the determinants of the information matrices in the various models under consideration, which yields efficient designs for all models under consideration. This criterion has been used successfully by Dette (1990) to determine efficient designs for a class of polynomial regression models and by Biedermann et al. (2006) to construct efficient designs for binary response models, when there is uncertainty about the form of the link function. As these criteria do not reflect model discrimination, Zen and Tsai (2002); Atkinson (2008); Tommasi (2009) considered a mixture of Läuter-type and discrimination criteria to construct efficient designs for model discrimination and parameter estimation. An alternative concept to robust designs with respect to misspecified models consists in the minimization of the maximal mean squared error calculated over a class of misspecified models with respect to the design under consideration [see Wiens (2015) for an overview]. Several authors have worked on this problem and we mention exemplarily Wiens and Xu (2008) who derive robust prediction and extrapolation designs or Konstantinou et al. (2017) who analyze robust designs under local alternatives for survival trials. This list of references is by no means complete and there exist many more papers on this subject. However, a common feature in most of the literature consists in the fact that either (at least implicitly) the designs are constructed under the assumption that model selection is performed by hypotheses testing or designs are determined with “good” properties for a class of competing models.

On the other hand there exists an enormous amount of literature to perform statistical inference under model uncertainty, which - to our best knowledge - has not been discussed in the context of optimal experimental design. One possibility is to select an adequate model from a set of candidate models and numerous model selection criteria have been developed for this purpose [see monographs of Burnham and Anderson (2002), Konishi and Kitagawa (2008) and Claeskens and Hjort (2008) among others]. These procedures are widely used and have the advantage to deliver a single model for the statistical analysis, which make them very attractive for practitoners. However, there exists a well known post-selection problem in this approach because estimators chosen after model selection behave usually like mixtures of many potential estimators. For example, if

is a parameter of interest in a regression model (such as a prediction at a particular point, the area under the curve or a specific quantile of the regression model) it is known that selecting a single model and ignoring the uncertainty resulting from the selection process may give confidence intervals for

with coverage probability smaller than the nominal value, see for example Chapter 7 in

Claeskens and Hjort (2008) for a mathematical treatment or Bornkamp (2015) for a high-level discussion of this phenomenon.As an alternative several authors proposed to smooth estimators for the parameter across several models, rather than choosing a specific model from the class under consideration and performing the estimation in the selected model. This approach takes the additional estimator variability caused by model uncertainty adequately into account and has been discussed intensively in the Bayesian community, where it is known as “Bayesian model averaging” [see the tutorial of Hoeting et al. (1999) among many others]. Hjort and Claeskens (2003)

pointed out several problems with this approach. In particular, they mentioned the difficulties to specify prior probabilities for a class of models and the problem of mixing together many conflicting prior opinions in the statistical analysis. As an alternative these authors proposed a non-Bayesian approach, which they call “frequentist model averaging” and developed asymptotic theory for their method. There exists evidence that model averaging improves the estimation efficiency [see

Breiman (1996) or Raftery and Zheng (2003)], and recently, Schorning et al. (2016) demonstrated the superiority of model averaging about estimation after model selection by an information criterion in the context of dose response models. These results have recently been confirmed by Aoki et al. (2017) and Buatois et al. (2018) in the context of nonlinear mixed effect models.The present paper is devoted to the construction of optimal designs if parameters of interest are estimated under model uncertainty via frequentist model averaging. Section 2 gives a brief review on model averaging and states the asymptotic properties of this approach under local alternatives. The asymptotic properties are used in Section 3 to define new optimality criteria, which directly reflect the goal of model averaging. Roughly speaking, an optimal design for model averaging estimation minimizes the asymptotic mean squared error of the model averaging estimator under local alternatives. In Section 4 we present a numerical study comparing the optimal designs for model averaging estimation with commonly used designs and demonstrate that the new designs yield substantially more precise estimates. Further simulation results which demonstrate that our findings are representative can be found in Section 6.2. Finally, the proofs of the theoretical results are given in Section 6.1.

## 2 Model Averaging under local misspecification

Model averaging is a common technique to estimate a parameter of interest, say , under model uncertainty. Roughly speaking this estimate is a weighted average of the estimates in the competing models under consideration, where different choices for the weights have been proposed in the literature [see for example Wassermann (2000) or Hansen (2007) for Bayesian and non-Bayesian model averaging methods]. In this section we briefly describe this concept and the corresponding asymptotic theory in the present context, such that the results can be used to construct optimal designs for model averaging estimation. The results follow more or less from the statements in Hjort and Claeskens (2003) and Claeskens and Hjort (2008) and - although we use a slightly different notation - any details regarding their derivation are omitted for the sake of brevity.

We assume that different experimental conditions, say , are chosen in the design space , and that at each experimental condition one can observe responses, say . We also assume that for each the responses at experimental condition

are realizations of independent identically (real valued) random variables

with unknown density . Therefore the total sample size is given by and the experimental design problem consists in the choice of (number of different experimental conditions), (the experimental conditions) and the choice of (the numbers of observations taken at each ), such that the model averaging estimate is most efficient.To measure efficiency and to compare different experimental designs we will use asymptotic arguments and consider the case for . As common in optimal design theory we collect this information in the matrix

(2.1) |

Following Claeskens and Hjort (2008) we assume that is contained in a set, say , of parametric candidate densities which is constructed as follows. The first candidate density in is given by a parametric density , where the form of is assumed to be known, and denote the unknown parameters, which vary in a compact parameter space, say . The second candidate density is given by the parametric density , which is obtained by fixing the parameter value to a pre-specified (known) value . Throughout the paper, we will call the wide density (model) and the narrow density (model), respectively. Additional candidate models are obtained by choosing certain submodels between the wide density and the narrow density . More precisely, for a chosen subset of indices with cardinality , we introduce the projection matrices and which map a

-dimensional vector to its components corresponding to indices in

and , respectively. Using the abbreviations and , we define the candidate density by(2.2) |

Consequently, for the density the components of with indices in are fixed to the corresponding components of , while the components with indices in are considered as unknown parameters. Note that , and that in the most general case there are possible candidate densities. As we might not be interested in all possible submodels we assume that the competing models are defined by different sets (for some ). Thus the class of candidate models is given by

(2.3) |

Following Hjort and Claeskens (2003), we consider local deviations throughout the paper and assume that the “true” density is given by

(2.4) |

where the “true” parameter values are given by and . Note that the “true” density is given by the wide density with a varying value of which differs from through the perturbation term . Thus, for tending to infinity, it approximates the narrow density .

###### Example 2.1

Consider the case, where

is a normal density with variance

and mean function(2.5) |

where , and the explanatory variable varies in an interval, say . This model is the well known sigmoid Emax model and has numerous applications in modelling the dependence of biochemical or pharmacological responses on concentration [see Goutelle et al. (2008) for an overview]. The sigmoid Emax model is especially popular for describing dose-response relationships in drug development [see MacDougall (2006) among many others]. The parameters in (2.5) have a concrete interpretation: is used to model a Placebo-effect, denotes the maximum effect of (relative to placebo) and is the value of which produces half of the maximum effect. The so-called Hill parameter characterizes the slope of the mean function . The parameter is included in every candidate model, whereas for the narrow model the components are fixed as . Consequently, the narrow candidate model is a normal density with mean

(2.6) |

and variance . In this case, is the frequently used Michaelis Menten function, which is widely utilized to represent an enzyme kinetics reaction, where enzymes bind substrates and turn them into products [see, for example, Cornish-Bowden (2012)]. The two other candidate models between are obtained by either fixing or and the corresponding densities are normal densities with mean functions

(2.7) |

respectively. The latter model is the well known Emax model which is sometimes also referred to as the hyperbolic Emax model [see Holford and Sheiner (1981) and MacDougall (2006) among others]. Finally, under the local misspecification assumption (2.4) the true density

corresponds to a normal distribution with mean

and variance . Typical functionals of interest are the area under the curve (AUC)

(2.8) |

calculated for a given region or, for a given , the “quantile” defined by

(2.9) |

The value defined in (2.9) is well-known as , that is, the effective dose at which of the maximum effect is achieved [see MacDougall (2006) or Bretz et al. (2008)].

As pointed out at the end of Example 2.1 we are interested in the estimation of a quantity, say , where is a differentiable function of the parameter . For this purpose we fix one model in the set of candidate models defined in (2.3) and use the estimator , where is the maximum-likelihood estimator in model . Under the assumption (2.4) of a local misspecification and the common conditions of regularity [see, for example, Lehmann and Casella (1998), Chapter 6] it can be shown by adapting the arguments in Hjort and Claeskens (2003) and Claeskens and Hjort (2008) to the current situation that the resulting estimator satisfies

(2.10) |

Here denotes weak convergence and is a normal distribution with variance

(2.11) |

where is the gradient of with respect to , that is,

(2.12) |

and the information matrix in the candidate model , that is

(2.13) |

The mean in (2.10) is of the form

where

is the gradient (with respect to the parameters) in the wide model, the matrix is defined by

(2.14) |

the matrices and are given by (2.13) and

(2.15) |

respectively, and denotes the information matrix in the wide model .

The frequentist model averaging estimator is now defined by assigning weights , with , to the different candidate models and defining

(2.16) |

where are the estimators in the different candidate models . The asymptotic behaviour of the model averaging estimator can be derived from Hjort and Claeskens (2003). In particular, it can be shown that under assumption (2.4) and the standard regularity conditions a standardized version of is asymptotically normally distributed, that is

(2.17) |

Here the mean and variance are given by

(2.18) | |||||

(2.19) |

respectively, is the information matrix of the wide model and the vector is given by

(2.20) |

where we used the notation of , and introduced (2.12), (2.13) and (2.15). The optimal design criterion for model averaging, which will be proposed in this paper, is based on an asymptotic representation of the mean squared error of the estimate derived from (2.17) and will be carefully defined in the following section.

## 3 An optimality criterion for model averaging estimation

Following Kiefer (1974) we call the quantity in (2.1) an approximate design on the design space . This means that the support points define the distinct dose levels where observations are to be taken and the weights represent the relative proportion of responses at the corresponding support point (). For an approximate design and given total sample size a rounding procedure is applied to obtain integers ( from the not necessarily integer valued quantities [see, for example Pukelsheim (2006), Chapter 12], which define the number of observations at ().

If the observations are taken according to an approximate design and an appropriate rounding procedure is used such that as , the asymptotic mean squared error of the model averaging estimate of the parameter of interest can be obtained from the discussion in Section 2, that is

(3.1) |

where the variance and the bias are defined in equations (2.18) and (2.19), respectively. Consequently, a “good” design for the model averaging estimate (2.16) should give “small” values of . Therefore, for a given finite set of candidate models of the form (2.2) and weights a design is called locally optimal design for model averaging estimation of the parameter , if it minimizes the function in (3.1) in the class of all approximate designs on . Here the term “locally” refers to the seminal paper of Chernoff (1953) on optimal designs for nonlinear regression models.

Locally model averaging optimal designs address uncertainty only with respect to the model but require prior information for the parameters
and . While such knowledge might be available in some circumstances [see, for example, Dette et al. (2008) or Bretz et al. (2010)]
sophisticated design strategies have been proposed in the literature, which require less precise knowledge about the model parameters,
such as sequential, Bayesian or standardized maximin optimality criteria [see Pronzato and Walter (1985); Chaloner and Verdinelli (1995) and
Dette (1997) among others]. Any of these methodologies can be used to construct efficient
robust designs for model averaging and for the sake of brevity we restrict ourselves to Bayesian optimality criteria.

Here we address the uncertainty with respect to the unknown model parameters by a prior distribution, say , on
and call a design Bayesian optimal for model averaging estimation of the parameter with respect to the prior if it
minimizes the function

(3.2) |

where the function is defined in (3.1) (we assume throughout this paper that the integral exists).

Locally and Bayesian optimal designs for model averaging have to be calculated numerically in all cases of interest, and we present several examples in Section 4. Next, we state necessary conditions for - and - optimality. The proofs are given in the Section 6.1.

###### Theorem 3.1

If the design is a locally optimal design for model averaging estimation of the parameter , then the inequality

(3.3) |

holds for all , where is defined by (2.18) and the functions and are given by

where the vector is defined by (2.20), the vector by

and the information matrix by (2.13), respectively. The design denotes the Dirac measure at the point .

Moreover, there is equality in (3.3) for every point in the support of .

###### Theorem 3.2

The derived conditions of Theorem 3.1 and Theorem 3.2 can be used in the following way: If a numerically calculated design does not satisfy inequality (3.3), it will not be locally optimal for model averaging estimation of the parameter and the search for the optimal design has to be continued. Note that the functions and are not convex and therefore sufficient conditions for optimality are not available.

###### Remark 3.1

Note that Hjort and Claeskens (2003) also consider model averaging using random weights depending on the data in the definition of the estimator in (2.16). Typical examples are smooth AIC-weights

(3.7) |

which are based on the AIC-scores

where denotes the log-likelihood function of model evaluated in the maximum likelihood estimator and is the number of parameters to be estimated in model () [see Claeskens and Hjort (2008), Chapter 2]. Moreover, the estimator of the target which is based on model selection by AIC can also be rewritten in terms of a model averaging estimator by using random weights of the form

(3.8) |

where is the indicator function of the set and denotes the model with the greatest AIC-score among the candidate models. For further choices of model averaging weights see for example Buckland et al. (1997), Hjort and Claeskens (2003) or Hansen (2007). In general, the case of random weights in model averaging estimation is more difficult to handle and in general the asymptotic distribution is not normal [see Claeskens and Hjort (2008), p. 196]. As a consequence an explicit calculation of the asymptotic bias and variance is not available.

From the design perspective it therefore seems to be reasonable to consider the case of fixed weights, for which the asymptotic properties of model averaging estimation under local misspecification are well understood and determine efficient designs for this estimation technique. Moreover, we also demonstrate in Section 4 and in the appendix (see Section 6) that model averaging estimation with fixed weights often shows a better performance than model averaging with smooth AIC-weights and that the optimal designs derived under the assumption of fixed weights also improve the current state of the art for model averaging using random weights.

## 4 Optimal designs for model averaging

In this section, we investigate the performance of optimal designs for model averaging estimation of a parameter . For this purpose, we consider several examples from the literature, and compare the Bayesian optimal designs for model averaging estimation of a parameter with commonly used uniform designs by means of a simulation study. Thoughout this paper we use the cobyla algorithm for the minimization of the criterion defined in (3.2) [see Powell (1994) for details].

### 4.1 Estimation of the ED in the sigmoid Emax model

We consider the situation introduced in Example 2.1, where
the underlying density is a normal distribution with variance and different regression functions
are under consideration for the mean. More precisely, the set contains candidate models
which are defined by the different mean functions (2.5), (2.6) and (2.7), respectively.
The parameter of interest is the defined in (2.9),
which is estimated by an appropriate model averaging estimator.
The design space is given by the interval and we assume that observations can be taken in the experiment.

We determine a Bayesian optimal design for model averaging estimation of the .
As the Emax model is linear in the
parameters and ,
the optimality criterion does not depend on and and no prior information is required for these parameters.
For the parameters we choose
independent uniform priors
and on the sets and , respectively,
and the variance is fixed as (note that one can choose a prior for as well).
Finally, under the local misspecification assumption we set such that .

We first consider equal weights for the model averaging estimator, that is , .
The Bayesian optimal design for model averaging estimation of the is given by

(4.1) |

and satisfies the necessary condition of optimality in Theorem 3.2 [see the left panel of Figure 1]. Note that the design defined by (4.1) would not be optimal if the inequality was not satisfied.

In order to investigate the properties of the different designs for model averaging estimation we have conducted a simulation study, where we compare the Bayesian optimal design (4.1) for model averaging estimation of the with two uniform designs

(4.2) | |||||

(4.3) |

which are quite popular in the presence of model uncertainty
[see Schorning et al. (2016) and Bornkamp et al. (2007)].
Note that the design is a uniform design with the same number of support points as the optimal design in (4.1), whereas the design is a uniform design with more support points.
Moreover, we also
provide a comparison with two estimators commonly used in practice, namely the model averaging estimator based on smooth AIC-weights defined in (3.7) and the estimator in the model chosen by AIC model selection, which
is obtained as a model averaging estimator (2.16) using the weights in (3.8). For these estimators we also used observations taken according to
the designs , and . As the approximate designs cannot be implemented directly for observations
a rounding procedure [see, for example Pukelsheim (2006), Chapter 12] is applied to determine the number of observations taken at such that we have in total observations. For example, the implemented design obtained from the Bayesian optimal design for model averaging estimation of the uses
, , , and observations at the points
, , , and , respectively, and implementable versions of the designs and are obtained similarly.

All results presented here are based on simulations runs generating in each run observations
of the form

(4.4) |

for the different designs, where the are independent standard normal distributed random variables
and different combinations of the “true” parameter
in (4.4) are investigated whereas is fixed.
In the following discussion we will restrict ourselves to presenting results for the parameters , .
Note that this is the parameter combination under local misspecification assumption for ,
and . Further simulation results for other parameter combinations can be found in Section 6.2.1.

estimation method | |||
---|---|---|---|

design | fixed weights | smooth AIC-weights | model selection |

0.355 | 0.508 | 0.596 | |

0.810 | 0.913 | 1.017 | |

0.637 | 0.846 | 0.994 |

In each simulation run, the parameter is estimated by model averaging using the different designs and the mean squared error is calculated from all simulation runs. More precisely, if is the model averaging estimator for the parameter of interest based on the observations from model (4.4) with the design , its mean squared error is given by

where is the in the
“true” sigmoid Emax model (4.4) with parameters . The simulated mean squared error
of the model averaging estimator with fixed weights , for the different designs , and
is shown in the left column of Table 1. The middle column of this table shows the mean squared error of the model averaging
estimator with the smooth
AIC-weights in (3.7), while the right column gives the corresponding results for the weights in (3.8), that is
estimation of the in the model identified by the AIC for the different designs. The numbers printed in boldface in each column correspond to the smallest mean squared error obtained from the three designs.

We observe that model averaging always yields a smaller mean squared error than estimation in the model identified by the AIC.
For example, if the design is used, the mean squared error of the estimator based on model selection is , whereas it
is and for the model averaging estimator using fixed weights and smooth AIC-weights, respectively (see the first row in
Table 1). The situation for the non-optimal uniform designs is similar.
These results (and also further simulation results presented in Section 6.2.1) coincide with the findings of Schorning et al. (2016),
Aoki et al. (2017) and Buatois et al. (2018)
and indicate that model averaging usually yields more precise estimates of the target than estimators based on model selection.
Moreover, model averaging
estimation with fixed weights shows a substantially better performance than the model averaging
estimator with data driven weights. Note that Wagner and Hlouskova (2015) observed a similar effect in the context
of principal components augmented regressions.

estimation method | |||
---|---|---|---|

design | fixed weights | smooth AIC-weights | model selection |

0.476 | 0.502 | 0.582 | |

0.915 | 0.900 | 1.014 | |

0.869 | 0.949 | 1.067 |

Compared to the uniform designs and the optimal design in (4.1)
yields a reduction of the mean squared error by and for model averaging estimation with fixed weights.
Moreover, this design also reduces the mean squared error of model averaging estimation with smooth AIC-weights (by and ) and
for estimation in the model identified by the AIC (by and ).

As a further example we consider the model averaging estimator (2.16) of the parameter
for the four models in Example 2.1 with non-equal weights, that is
, , and . The Bayesian optimal design for model averaging estimation of the
is then given by

(4.5) |

The necessary condition is depicted in the right panel of Figure 1.
A comparison of the designs and in
(4.1) and (4.5) shows that the support points are similar, but that there
appear substantial differences in the weights.

In the simulation study of this model averaging estimator
we consider the same parameters as in the previous example.
The corresponding results can be found in Table 2 and show a similar but less pronounced picture as for the model averaging estimator
with uniform weights.
Model averaging always shows a better performance than estimation in the model selected by the AIC (improvement between and using fixed weights and between and using smooth AIC-weights). Moreover, for the designs and
we observe an improvement when using fixed weights instead of smooth AIC-weights for model averaging, but
for the design there is in fact no improvement.
A comparison of the results in Table 1 and 2
shows that for all designs non-uniform weights for model averaging estimation yield a larger
mean squared error than uniform weights.

The Bayesian optimal design for model averaging estimation of the improves the designs and by and , respectively, if model averaging with fixed (non-uniform weights)
is used, and by for model averaging estimation with smooth AIC-weights and estimation in the model selected by the AIC.

Simulation results for further parameter combinations in the sigmoid Emax model can be found in Table 5 and 6 in Section 6.2.1.
These results show a very similar picture as described in the previous paragraphs.
We observe that in all considered scenarios model averaging estimation yields a smaller simulated mean squared error than estimation in a model identified by the AIC, independently of the design and parameters under consideration.
Bayesian optimal designs for model averaging estimation of the yield a substantially more precise estimation
than the uniform designs in almost all cases.
We refer to Section 6.2.1 for more details.

### 4.2 Estimation of the AUC in the logistic regression model

In this section we consider the logistic regression model

(4.6) |

which is frequently used in dose-response modeling or modeling population growth [see, for example, Zwietering et al. (1990)]. This means we consider a normal distribution with variance and mean (function) given by (4.6). The design space is given by and we are interested in the estimation of the area under the curve (AUC) defined in (2.8), where the region and the design space coincide. In model (4.6) the value is the Placebo-effect, denotes the maximum effect (relative to placebo) of the drug and is the dose which produces half of the maximum effect. The parameter characterizes the slope of the mean function . We assume that the parameter is included in every candidate model, whereas the components of the parameter can be fixed to the corresponding components of , such that there are competing models in the candidate set , that is

(4.7) | ||||

(4.8) | ||||

(4.9) |

and defined by (4.6). As the parameters and appear linear in the model only the prior distributions for and
have to be specified, which are chosen as independent uniform priors supported on the sets and , respectively. The variance is fixed as and is chosen such that .

The Bayesian optimal design for model averaging estimation of the AUC with equal weights
has been calculated numerically and
is given by

(4.10) |

The performance of the different designs is again evaluated by means of a simulation study generating data from the model

(4.11) |

where are standard normal distributed random variables and observations can be taken. We focus on the case , and which corresponds to a local misspecification, where

Comments

There are no comments yet.