1 Introduction
Suppose an experiment is to be performed to estimate a
vector of unknown parameters from vector of responses , with parameter space and sample space . The responses are obtained via design : an matrix, where is the number of design variables. Once has been observed, analysis will be performed by assuming that is a realization from the joint density and has prior density . We refer to and as the fitted model and fitted prior, respectively, and the resulting posterioras the fitted posterior. The fitted model may depend on additional nuisance parameters but these have been integrated out to obtain and .
Decisiontheoretic Bayesian design of experiments starts with the specification of a loss function denoted by
where dependence on is through the fitted posterior. Two exemplar loss functions considered throughout are squared error losswhere and is the fitted posterior mean of , and the self information loss .
Traditionally, a Bayesian design minimizes the expected loss (which we refer to as the fitted expected loss) over the space of all possible designs (Chaloner and Verdinelli, 1995)
. The expectation is with respect to the joint distribution of
and implied by the fitted model, i.e.where .
Now suppose that we wish to design the experiment by averaging the loss with respect to a joint density for implied by another model: the designer model. Let denote the joint density of under the designer model where is a vector of unknown parameters with designer prior density . Let so that are parameters common to both models and are parameters present only in the fitted model. Similar to the fitted model, the designer model may depend on additional latent variables but these have been integrated out to obtain and . The designer expected loss is defined as
(1) 
Initially the loss is averaged with respect to the fitted posterior of conditional on , before being averaged with respect to all remaining unknowns ( and ) under the designer model. The initial step is necessary since is absent from the designer model.
There are several reasons why a design may be sought under a different model to the fitted model. The designer model may best represent current scientific knowledge. However to aid in interpretation or for pure convenience, a simpler model will be fitted on observation of the responses. Conversely, the fitted model may be more complex than the designer model. This scenario would fit within the iterative learning framework of Box (1980) whereby, in a sequential approach, the model fitted to data at the current stage (the fitted model) is updated in response to criticism of the model fitted to data at the previous stage (the designer model). Etzioni and Kadane (1993) considered the case where the fitted and designer prior distributions were different but the models (the joint distribution for ) were identical. We consider the case where both prior and model can be different.
In general, it will not be possible to evaluate the designer expected loss (1) in closed form. In recent years, new computational methodology has been developed for approximately minimizing the fitted expected loss (see Ryan et al. 2016 for a recent review) and can also be applied to approximately minimize the designer expected loss. However in this paper, we aim to gain understanding of designing under an alternative model by a) considering the linear model (see Section 2) where it can be possible to evaluate the designer expected loss in closed form, and b) developing a large sample approximation (see Section 3) to the designer expected loss which is analogous to that developed for the fitted expected loss (Chaloner and Verdinelli, 1995). Proofs for all results are found in the Supplementary Material.
2 Linear models
2.1 Fitted model
In this section, the fitted model is the linear model
where is the model matrix (a function of the design ) and is the identity matrix. In Section 2.2, we consider the case where the designer model is given by the fitted linear model but where the mean has been contaminated by model discrepancy. In Section 2.3, the designer model is the unit treatment model.
2.2 Model discrepancy
In this section we suppose that is known and therefore drop conditioning on . The designer model is a linear model including a zero mean Gaussian process model discrepancy term (Kennedy and O’Hagan, 2001), i.e.
where is an correlation matrix. The th element of is , where is a correlation function, is the th row of and is a vector of unknown parameters controlling correlation. Now ; all parameters of interest are common to both fitted and designer models. We assume a common prior distribution for , i.e. , so that the fitted and designer prior distributions are the same.
Theorem 1
Under the above fitted and designer models, the designer expected squared error and self information losses are
(2)  
(3) 
respectively, where is a constant not depending on design ,
is proportional to the fitted posterior variance of
and with the designer prior mean of .Compare expressions (2) and (3) to the corresponding expressions for the fitted expected loss
respectively. The sandwich variance term in the designer expected squared error loss is analogous to the quantity which appears when one performs inference under an unknown alternative model (e.g., Davison, 2003, pages 147148). This idea is investigated further in Section 3.
To demonstrate the difference between designs found under designer and fitted expected loss, consider the following example. Suppose there is design variable and the experiment has runs. For , let be the th design variable and suppose the th row of is . We assume a squared exponential correlation function, , where . For the scalar
, a Gamma distribution designer prior is assumed, i.e.,
, where and are known. This implies that the th element of is . Finally a noninformative prior is assumed for , i.e. .Designs are found under both designer expected squared error and self information loss for different values of and . The values of and are chosen so that and varies between 0 and 500. As increases, the correlation between elements of decreases, leading to independent normal random errors, i.e. no systematic model discrepancy. Without loss of generality, the designs found have the following structure . Designs that minimize the fitted expected squared error and self information losses when are referred to as A and Doptimal, respectively, and both have . Figure 1 shows a plot of against for the designs found by minimizing the designer expected squared error and self information loss. As expected, for both squared error and self information loss, as increases, , the value of for no systematic model discrepancy, i.e., the A and Doptimal designs, respectively.
2.3 Unit treatment designer model
For the fitted model, assume an inverse gamma prior distribution for , i.e., . Suppose the designer model is the unit treatment model, where experimental runs with the same design variables, i.e. , have the same mean response. Specifically,
where is a vector of unknown treatment effects and the designer model matrix is a function of . Thus , and there are no parameters of interest common to fitted and designer models.
Lemma 1
Under the above fitted and designer models, the fitted posterior expectation of the squared error and self information losses are
(4)  
(5) 
respectively, where is a constant which does not depend on the design and with .
Theorem 2
Under the above fitted and designer models, the designer expected squared error and self information losses are
(6)  
(7) 
respectively, where and .
The expectation of the term in (6) with respect to the marginal distribution of under the designer model is not available in closed form. In the example that follows, we use a delta method approximation where .
Compare expressions (6) and (7) for the designer expected squared error and self information losses, respectively, to the corresponding expressions for the fitted expected loss
where is the digamma function. The difference lies in the expectation of and in (4) and (5), with respect to the marginal distribution of under the designer and fitted models, respectively. The term summarizes lack of fit (O’Hagan and Forster, 2004, page 319) of the fitted model so it is natural that the expectation of this quantity (or a function thereof) drives the difference between designer and fitted expected losses.
To demonstrate this difference, we consider Example 1 from Gilmour and Trinca (2012) involving an experiment with runs and design variables. The fitted model is a secondorder model including an intercept, three firstorder terms, three quadratic terms and three pairwise interactions, i.e. . For the fitted model, we assume a noninformative improper prior for , i.e., , , and . For the unit treatment model, we assume that . We do however need to choose a positivedefinite prior scale matrix for the designer expected loss to exist. We choose the unit information specification (Smith and Spiegelhalter, 1980) which is commonly used to represent prior ignorance but still leads to a proper prior. Under this prior, .
Minimizing the designer expected squared error loss is equivalent (dropping constants that do not depend on design ) to minimizing
(8) 
where and . Similarly, minimizing the delta method approximate designer expected self information loss is equivalent to minimizing
(9) 
Designs, referred to as ADoptimal and DDoptimal, are found under loss functions (8) and (9), respectively. Additionally, A and Doptimal designs are found, equivalent to minimizing, respectively
Table 1 shows efficiencies for the four designs found. AD and DDefficiency of a design are
where and are the AD and DDoptimal designs, respectively. Similar expressions are used for A and Defficiency.
Efficiencies  Efficiencies  

Design  AD  A  DD  D  Design  AD  A  DD  D 
ADoptimal  100.0  84.1  95.9  81.2  DDoptimal  74.9  63.0  100.0  84.6 
A optimal  10.2  100.0  9.8  97.1  Doptimal  5.8  83.3  6.9  100.0 
Clearly, the A and Doptimal designs are less robust to the unittreatment model. The Aoptimal design has 14 support points (unique design points) compared to 10 for the ADoptimal design. The equivalent values for the D and DDoptimal designs are 16 and 10, respectively. The difference between
and the number of support points is known as pure error degrees of freedom.
Gilmour and Trinca (2012) advocate finding designs that minimize the variance of an estimator of under the fitted model where is estimated under the unit treatment model. Taking this approach favours designs that have larger pure error degrees of freedom than standard A or Doptimal designs. Here it is demonstrated that this is also a consequence of a Bayesian approach having designed under the unit treatment model.3 Large sample approximation
As discussed in Section 1, in general, the designer expected loss is not available in closed form and will require approximation to find a design in practice. In this section, a large sample approximation to the designer expected loss is derived which is analogous to approximations to the fitted expected loss (Chaloner and Verdinelli, 1995). The general form for these approximations is the prior expectation of a functional of the Fisher information. The Fisher information arises due to the following large sample approximation to the fitted posterior distribution, i.e.
(10) 
where is the maximum likelihood estimate of under the fitted model (with the containing set) and
is the Fisher information under the fitted model.
The loss can be approximated by replacing dependence on the fitted posterior by dependence on the approximate fitted posterior (10). First define and to be the values of that minimize the KullbackLiebler divergence between the fitted model and a) the fitted model having integrated out (the parameters absent from the designer model), and b) the designer model, i.e. and minimize
respectively. Furthermore, define
(11)  
(12)  
(13)  
(14) 
The following result can now be proved.
Theorem 3
A large sample approximation to the designer expected loss is
(15) 
In (15), is the density of where
and
The tractability of the normal distribution means that the
inner expectation of the approximate loss with respect to (i.e. ) is often available in closed form. The approximation given by (15) then reduces to the prior expectation of functionals of the Fisher information, and the quantities in (11) to (14). However, the prior of is formed of two components; the distribution of conditional on under the fitted prior and then the distribution of under the designer prior.Consider the case where , i.e. all parameters of interest are present in both models. Under this scenario the following corollary can be proved.
Corollary 1
Large sample approximations to the inner expectation of the squared error and self information loss with respect to conditional on the designer model and are
(16)  
(17) 
where is a constant not depending on the design . Note the sandwich variance term appearing in the large sample approximation to the expectation of the squared error loss (16). This is exact in the case when the fitted posterior distribution is normal; see Section 2.2.
Acknowledgement
The authors would like to thank Prof Dave Woods for initial discussions and feedback.
Supplementary material
Supplementary material available at the end of the document includes proofs for all results in the manuscript.
Supplementary Material for “Bayesian decisiontheoretic design of experiments under an alternative model”
This document includes proofs of results in the main manuscript. Equation numbers with no prefix refer to equations in the main manuscript whereas equation numbers with prefix S refer to equations in this document.
Proof of THEOREM 1
The fitted posterior of is where . Under the designer model, integrating out , gives , where . The designer expected squared error and self information losses, conditional on , are
(S1)  
(S2) 
Since is a linear operator, it is straightforward to take expectations of (S1) and (S2) with respect to the designer prior of resulting in (2) and (3), respectively.
Proof of LEMMA 1
The fitted posterior of is
(S3) 
a multivariate distribution (e.g., Kotz and Nadarajah, 2004, page 1) with mean , scale matrix , degrees of freedom , and negative log density
(S4) 
where is a constant which does not depend on or . The fitted posterior expectation of the squared error loss (4) immediately follows, noting that the fitted posterior variance of is . The fitted posterior expectation of the self information loss (5) follows from
(e.g., Kotz and Nadarajah, 2004, page 23) where and is the digamma function.
Proof of THEOREM 2
Proof of THEOREM 3
Approximate the loss by replacing dependence on the fitted posterior by dependence on the large sample approximation (10) giving the following approximation to the designer expected loss
The posterior distribution, can be approximated by deriving the conditional distribution of given from (10) using the usual properties of the normal distribution. The key point is that this distribution only depends on through , so we can write
where the last line follows from an application of Bayes’ theorem. Reordering the terms and noting that the expectation with respect to
can be written as expectation with respect to givesLarge sample approximations to the distributions , and are
(S5) 
respectively, where the last two distributions follow from inference results for the wrong model (e.g., Davison, 2003, pages 147148). The expression in (15) follows.
Proof of COROLLARY 1
References
 Box (1980) Box, G. (1980) Sampling and bayes’ inference in scientific modelling and robustness. J. R. Statist. Soc. A 143, 383–430.
 Chaloner and Verdinelli (1995) Chaloner, K. and Verdinelli, K. (1995) Bayesian experimental design: a review. Statist. Sci. 10, 273–304.
 Davison (2003) Davison, A. (2003) Statistical Models. Cambridge University Press, Cambridge.
 Etzioni and Kadane (1993) Etzioni, R. and Kadane, J. (1993) Optimal experimental design for another’s analysis. J. Am. Statist. Assoc. 88, 1404–1411.
 Gilmour and Trinca (2012) Gilmour, S. G. and Trinca, L. (2012) Optimum design of experiments for statistical inference (with discussion). J. R. Statist. Soc. C 61, 345–401.
 Kennedy and O’Hagan (2001) Kennedy, M. and O’Hagan, A. (2001) Bayesian calibration of computer models (with discussion). J. R. Statist. Soc. B 63, 425–464.
 Kotz and Nadarajah (2004) Kotz, S. and Nadarajah, S. (2004) Multivariate t Distributions and their Applications. Cambridge University Press, Cambridge.

O’Hagan and Forster (2004)
O’Hagan, A. and Forster, J. (2004)
Kendall’s Advanced Theory of Statistics Volume 2B Bayesian Inference
. Wiley 2nd edn.  Ryan et al. (2016) Ryan, E., Drovandi, C., McGree, J. and Pettitt, A. (2016) A review of modern computational algorithms for Bayesian optimal design. Int. Statist. Rev. 84, 128–154.

Smith and Spiegelhalter (1980)
Smith, A. and Spiegelhalter, D. (1980) Bayes factors and choice criteria for linear models.
J. R. Statist. Soc. B 42, 213–220.