DeepAI

# Bayesian decision-theoretic design of experiments under an alternative model

Decision-theoretic Bayesian design of experiments is considered when the statistical model used to perform the analysis is different to the model a-priori used to design the experiment. Closed form results and large sample approximations are derived for the special case of normal linear models and for general cases, respectively. These are compared to the case when the fitted and designer models are identical.

• 7 publications
• 13 publications
01/14/2019

### Optimality Criteria for Probabilistic Numerical Methods

It is well understood that Bayesian decision theory and average case ana...
10/29/2014

### A Statistical Decision-Theoretic Framework for Social Choice

In this paper, we take a statistical decision-theoretic viewpoint on soc...
03/19/2018

### Bayesian design of experiments for intractable likelihood models using coupled auxiliary models and multivariate emulation

A Bayesian design is given by maximising the expected utility over the d...
12/31/2020

### Closed-Form Minkowski Sums of Convex Bodies with Smooth Positively Curved Boundaries

This paper proposes a closed-form parametric formula of the Minkowski su...
05/06/2022

### Farrington-Manning in the Extreme Case

The Farrington-Manning method is a common method for evaluating equivale...
11/03/2021

### Implicit Deep Adaptive Design: Policy-Based Experimental Design without Likelihoods

We introduce implicit Deep Adaptive Design (iDAD), a new method for perf...
06/24/2014

### Combining predictions from linear models when training and test inputs differ

Methods for combining predictions from different models in a supervised ...

## 1 Introduction

Suppose an experiment is to be performed to estimate a

vector of unknown parameters from vector of responses , with parameter space and sample space . The responses are obtained via design : an matrix, where is the number of design variables. Once has been observed, analysis will be performed by assuming that is a realization from the joint density and has prior density . We refer to and as the fitted model and fitted prior, respectively, and the resulting posterior

 π(β∣y,F)∝π(y∣β,F)π(β∣F)

as the fitted posterior. The fitted model may depend on additional nuisance parameters but these have been integrated out to obtain and .

Decision-theoretic Bayesian design of experiments starts with the specification of a loss function denoted by

where dependence on is through the fitted posterior. Two exemplar loss functions considered throughout are squared error loss

 λSE(β,y)=∥β−E(β∣y,F)∥22,Ip,

where and is the fitted posterior mean of , and the self information loss .

Traditionally, a Bayesian design minimizes the expected loss (which we refer to as the fitted expected loss) over the space of all possible designs (Chaloner and Verdinelli, 1995)

. The expectation is with respect to the joint distribution of

and implied by the fitted model, i.e.

 LF(Δ)=∫Y∫Bλ(β,y)π(β,y∣F)dβdy,

where .

Now suppose that we wish to design the experiment by averaging the loss with respect to a joint density for implied by another model: the designer model. Let denote the joint density of under the designer model where is a vector of unknown parameters with designer prior density . Let so that are parameters common to both models and are parameters present only in the fitted model. Similar to the fitted model, the designer model may depend on additional latent variables but these have been integrated out to obtain and . The designer expected loss is defined as

 LD(Δ)=∫Y∫Θ∫Γλ(β,y)π(γ∣y,θ,F)π(θ,y∣D)dγdθdy. (1)

Initially the loss is averaged with respect to the fitted posterior of conditional on , before being averaged with respect to all remaining unknowns ( and ) under the designer model. The initial step is necessary since is absent from the designer model.

There are several reasons why a design may be sought under a different model to the fitted model. The designer model may best represent current scientific knowledge. However to aid in interpretation or for pure convenience, a simpler model will be fitted on observation of the responses. Conversely, the fitted model may be more complex than the designer model. This scenario would fit within the iterative learning framework of Box (1980) whereby, in a sequential approach, the model fitted to data at the current stage (the fitted model) is updated in response to criticism of the model fitted to data at the previous stage (the designer model). Etzioni and Kadane (1993) considered the case where the fitted and designer prior distributions were different but the models (the joint distribution for ) were identical. We consider the case where both prior and model can be different.

In general, it will not be possible to evaluate the designer expected loss (1) in closed form. In recent years, new computational methodology has been developed for approximately minimizing the fitted expected loss (see Ryan et al. 2016 for a recent review) and can also be applied to approximately minimize the designer expected loss. However in this paper, we aim to gain understanding of designing under an alternative model by a) considering the linear model (see Section 2) where it can be possible to evaluate the designer expected loss in closed form, and b) developing a large sample approximation (see Section 3) to the designer expected loss which is analogous to that developed for the fitted expected loss (Chaloner and Verdinelli, 1995). Proofs for all results are found in the Supplementary Material.

## 2 Linear models

### 2.1 Fitted model

In this section, the fitted model is the linear model

 y∣β,σ2,F∼N(Xβ,σ2In)

where is the model matrix (a function of the design ) and is the identity matrix. In Section 2.2, we consider the case where the designer model is given by the fitted linear model but where the mean has been contaminated by model discrepancy. In Section 2.3, the designer model is the unit treatment model.

### 2.2 Model discrepancy

In this section we suppose that is known and therefore drop conditioning on . The designer model is a linear model including a zero mean Gaussian process model discrepancy term (Kennedy and O’Hagan, 2001), i.e.

 y∣β,η,D∼N(Xβ+η,σ2In),η∣ρ,D∼N(0,σ2R),

where is an correlation matrix. The th element of is , where is a correlation function, is the th row of and is a vector of unknown parameters controlling correlation. Now ; all parameters of interest are common to both fitted and designer models. We assume a common prior distribution for , i.e. , so that the fitted and designer prior distributions are the same.

#### Theorem 1

Under the above fitted and designer models, the designer expected squared error and self information losses are

 LD,SE(Δ) = σ2tr{^VF(V−1+XT~SX)^VF}, (2) LD,SI(Δ) = C+12log|^VF|+12tr{(V−1+XT~SX)^VF}, (3)

respectively, where is a constant not depending on design ,

is proportional to the fitted posterior variance of

and with the designer prior mean of .

Compare expressions (2) and (3) to the corresponding expressions for the fitted expected loss

 LF,SE(Δ) = σ2tr(^VF), LF,SI(Δ) = C+12log|^VF|+p2,

respectively. The sandwich variance term in the designer expected squared error loss is analogous to the quantity which appears when one performs inference under an unknown alternative model (e.g., Davison, 2003, pages 147-148). This idea is investigated further in Section 3.

To demonstrate the difference between designs found under designer and fitted expected loss, consider the following example. Suppose there is design variable and the experiment has runs. For , let be the th design variable and suppose the th row of is . We assume a squared exponential correlation function, , where . For the scalar

, a Gamma distribution designer prior is assumed, i.e.,

, where and are known. This implies that the th element of is . Finally a non-informative prior is assumed for , i.e. .

Designs are found under both designer expected squared error and self information loss for different values of and . The values of and are chosen so that and varies between 0 and 500. As increases, the correlation between elements of decreases, leading to independent normal random errors, i.e. no systematic model discrepancy. Without loss of generality, the designs found have the following structure . Designs that minimize the fitted expected squared error and self information losses when are referred to as A- and D-optimal, respectively, and both have . Figure 1 shows a plot of against for the designs found by minimizing the designer expected squared error and self information loss. As expected, for both squared error and self information loss, as increases, , the value of for no systematic model discrepancy, i.e., the A- and D-optimal designs, respectively.

### 2.3 Unit treatment designer model

For the fitted model, assume an inverse gamma prior distribution for , i.e., . Suppose the designer model is the unit treatment model, where experimental runs with the same design variables, i.e. , have the same mean response. Specifically,

where is a vector of unknown treatment effects and the designer model matrix is a function of . Thus , and there are no parameters of interest common to fitted and designer models.

#### Lemma 1

Under the above fitted and designer models, the fitted posterior expectation of the squared error and self information losses are

 E(λSE(β,y)∣y,F) = ^bFaF+n−2tr(^VF), (4) E(λSI(β,y)∣y,F) = K+12log|^VF|+p2log^bF, (5)

respectively, where is a constant which does not depend on the design and with .

#### Theorem 2

Under the above fitted and designer models, the designer expected squared error and self information losses are

 LD,SE(Δ) = (6) LD,SI(Δ) = K+12log|^VF|+p2E(log^bF∣D), (7)

respectively, where and .

The expectation of the term in (6) with respect to the marginal distribution of under the designer model is not available in closed form. In the example that follows, we use a delta method approximation where .

Compare expressions (6) and (7) for the designer expected squared error and self information losses, respectively, to the corresponding expressions for the fitted expected loss

 LF,SE(Δ) = bFaF−2tr(^VF), LF,SI(Δ) = K+12log|^VF|+p2{logbF+ψ(aF+n2)−ψ(n2)},

where is the digamma function. The difference lies in the expectation of and in (4) and (5), with respect to the marginal distribution of under the designer and fitted models, respectively. The term summarizes lack of fit (O’Hagan and Forster, 2004, page 319) of the fitted model so it is natural that the expectation of this quantity (or a function thereof) drives the difference between designer and fitted expected losses.

To demonstrate this difference, we consider Example 1 from Gilmour and Trinca (2012) involving an experiment with runs and design variables. The fitted model is a second-order model including an intercept, three first-order terms, three quadratic terms and three pairwise interactions, i.e. . For the fitted model, we assume a non-informative improper prior for , i.e., , , and . For the unit treatment model, we assume that . We do however need to choose a positive-definite prior scale matrix for the designer expected loss to exist. We choose the unit information specification (Smith and Spiegelhalter, 1980) which is commonly used to represent prior ignorance but still leads to a proper prior. Under this prior, .

Minimizing the designer expected squared error loss is equivalent (dropping constants that do not depend on design ) to minimizing

 lD,SE(Δ)=tr((In+nHZ)(In−nn+1HX))tr((XTX)−1), (8)

where and . Similarly, minimizing the delta method approximate designer expected self information loss is equivalent to minimizing

 lD,SI(Δ)=plogtr((In+nHZ)(In−nn+1HX))−log|XTX|. (9)

Designs, referred to as AD-optimal and DD-optimal, are found under loss functions (8) and (9), respectively. Additionally, A- and D-optimal designs are found, equivalent to minimizing, respectively

 lF,SE(Δ)=tr((XTX)−1),andlF,SI(Δ)=−log|XTX|.

Table 1 shows efficiencies for the four designs found. AD- and DD-efficiency of a design are

where and are the AD- and DD-optimal designs, respectively. Similar expressions are used for A- and D-efficiency.

Clearly, the A- and D-optimal designs are less robust to the unit-treatment model. The A-optimal design has 14 support points (unique design points) compared to 10 for the AD-optimal design. The equivalent values for the D- and DD-optimal designs are 16 and 10, respectively. The difference between

and the number of support points is known as pure error degrees of freedom.

Gilmour and Trinca (2012) advocate finding designs that minimize the variance of an estimator of under the fitted model where is estimated under the unit treatment model. Taking this approach favours designs that have larger pure error degrees of freedom than standard A- or D-optimal designs. Here it is demonstrated that this is also a consequence of a Bayesian approach having designed under the unit treatment model.

## 3 Large sample approximation

As discussed in Section 1, in general, the designer expected loss is not available in closed form and will require approximation to find a design in practice. In this section, a large sample approximation to the designer expected loss is derived which is analogous to approximations to the fitted expected loss (Chaloner and Verdinelli, 1995). The general form for these approximations is the prior expectation of a functional of the Fisher information. The Fisher information arises due to the following large sample approximation to the fitted posterior distribution, i.e.

 N(^βF,I−1F) (10)

where is the maximum likelihood estimate of under the fitted model (with the containing set) and

 IF=−∫Y∂2logπ(y∣β,F)∂β∂βTπ(y∣β,F)dy

is the Fisher information under the fitted model.

The loss can be approximated by replacing dependence on the fitted posterior by dependence on the approximate fitted posterior (10). First define and to be the values of that minimize the Kullback-Liebler divergence between the fitted model and a) the fitted model having integrated out (the parameters absent from the designer model), and b) the designer model, i.e. and minimize

 f(t)=∫Ylogπ(y∣t,F)π(y∣θ,F)dyd(t)=∫Ylogπ(y∣t,F)π(y∣θ,D)dy,

respectively. Furthermore, define

 IFD = −∫Y∂2logπ(y∣t,F)∂t∂tT∣∣∣t=~βFDπ(y∣θ,F)dy (11) JFD = ∫Y∂logπ(y∣t,F)∂t∣∣∣t=~βFD∂logπ(y∣t,F)∂tT∣∣∣t=~βFDπ(y∣θ,F)dy (12) ID = −∫Y∂2logπ(y∣t,F)∂t∂tT∣∣∣t=~βDπ(y∣θ,D)dy (13) JD = ∫Y∂logπ(y∣t,F)∂t∣∣∣t=~βD∂logπ(y∣t,F)∂tT∣∣∣t=~βDπ(y∣θ,D)dy. (14)

The following result can now be proved.

#### Theorem 3

A large sample approximation to the designer expected loss is

 ^LD(Δ)=∫Θ∫Γ∫^BFλ(β,^βF)g(^βF∣β)d^βFQ(β)π(γ∣θ,F)π(θ∣D)dγdθ. (15)

In (15), is the density of where

 V0 = (IF−IFDJ−1FDIFD+IDJ−1DID)−1 m0 = V0(IFβ−IFDJ−1FDIFD~βFD+IDJ−1DID~βD),

and

 Q(β) = ∣∣I−1FI−1DJDI−1DIFDJ−1FDIFD∣∣−12|V0|12 ×exp(12mT0V−10m0+12βTIFβ−12~βTDIDJ−1DID~βD+12~βTFDIFDJ−1FDIFD~βFD).

The tractability of the normal distribution means that the

inner expectation of the approximate loss with respect to (i.e. ) is often available in closed form. The approximation given by (15) then reduces to the prior expectation of functionals of the Fisher information, and the quantities in (11) to (14). However, the prior of is formed of two components; the distribution of conditional on under the fitted prior and then the distribution of under the designer prior.

Consider the case where , i.e. all parameters of interest are present in both models. Under this scenario the following corollary can be proved.

#### Corollary 1

Large sample approximations to the inner expectation of the squared error and self information loss with respect to conditional on the designer model and are

 ^E(λSE(β,y)∣β) = ∥β−~βD∥22,Ip+tr{I−1DJDI−1D}, (16) ^E(λSI(β,y)∣β) = G−12log|IF|+12∥β−~βD∥22,IF+12tr{IFI−1DJDI−1D}, (17)

where is a constant not depending on the design . Note the sandwich variance term appearing in the large sample approximation to the expectation of the squared error loss (16). This is exact in the case when the fitted posterior distribution is normal; see Section 2.2.

## Acknowledgement

The authors would like to thank Prof Dave Woods for initial discussions and feedback.

## Supplementary material

Supplementary material available at the end of the document includes proofs for all results in the manuscript.

Supplementary Material for “Bayesian decision-theoretic design of experiments under an alternative model”

This document includes proofs of results in the main manuscript. Equation numbers with no prefix refer to equations in the main manuscript whereas equation numbers with prefix S refer to equations in this document.

#### Proof of THEOREM 1

The fitted posterior of is where . Under the designer model, integrating out , gives , where . The designer expected squared error and self information losses, conditional on , are

 E(λSE(β,y)∣ρ,D) = σ2tr{^VF(V−1+XTSX)^VF}, (S1) E(λSI(β,y)∣ρ,D) = C+12log|^VF|+12tr{(V−1+XTSX)^VF}. (S2)

Since is a linear operator, it is straightforward to take expectations of (S1) and (S2) with respect to the designer prior of resulting in (2) and (3), respectively.

#### Proof of LEMMA 1

The fitted posterior of is

 β∣y,F∼t(^μF,^bFaF+n^VF,aF+n), (S3)

a multivariate distribution (e.g., Kotz and Nadarajah, 2004, page 1) with mean , scale matrix , degrees of freedom , and negative log density

 −logπ(β∣y,F)=Q+12log|^VF|+p2log^bF+aF+n+p2log⎛⎜⎝1+∥β−^μF∥22,^V−1F^bF⎞⎟⎠, (S4)

where is a constant which does not depend on or . The fitted posterior expectation of the squared error loss (4) immediately follows, noting that the fitted posterior variance of is . The fitted posterior expectation of the self information loss (5) follows from

 L = E⎧⎪⎨⎪⎩aF+n+p2log⎛⎜⎝1+∥β−^μF∥22,^V−1F^bF⎞⎟⎠∣y,F⎫⎪⎬⎪⎭ = aF+n+p2{ψ(aF+n+p2)−ψ(aF+n2)},

(e.g., Kotz and Nadarajah, 2004, page 23) where and is the digamma function.

#### Proof of THEOREM 2

The proof follows from taking the expectation of (4) and (5) with respect to , the marginal distribution of under the designer model.

#### Proof of THEOREM 3

Approximate the loss by replacing dependence on the fitted posterior by dependence on the large sample approximation (10) giving the following approximation to the designer expected loss

 ^LD(Δ)=∫Θ∫Y∫Γλ(β,^βF)π(γ∣y,θ,F)π(y∣θ,D)π(θ∣D)dγdydθ.

The posterior distribution, can be approximated by deriving the conditional distribution of given from (10) using the usual properties of the normal distribution. The key point is that this distribution only depends on through , so we can write

 ^LD(Δ) = ∫Θ∫Y∫Γλ(β,^βF)π(γ∣^βF,θ,F)π(y∣θ,D)π(θ∣D)dγdydθ, = ∫Θ∫Y∫Γλ(β,^βF)π(^βF∣θ,γ,F)π(γ∣θ,F)π(^βF∣θ,F)π(y∣θ,D)π(θ∣D)dγdydθ,

where the last line follows from an application of Bayes’ theorem. Reordering the terms and noting that the expectation with respect to

can be written as expectation with respect to gives

 ^LD(Δ)=∫Θ∫^BF∫Γλ(β,^βF)π(^βF∣θ,γ,F)π(^βF∣θ,F)π(^βF∣θ,D)d^βFπ(γ∣θ,F)π(θ∣D)dγdθ.

Large sample approximations to the distributions , and are

 N(β,I−1F), N(~βFD,I−1FDJFDI−1FD), N(~βD,I−1DJDI−1D), (S5)

respectively, where the last two distributions follow from inference results for the wrong model (e.g., Davison, 2003, pages 147-148). The expression in (15) follows.

#### Proof of COROLLARY 1

The squared error and self information losses are approximated by

 ^λSE(β,^βF) = ∥β−^βF∥22,Ip (S6) ^λSI(β,^βF) = G−12log|IF|+12∥β−^βF∥22,IF, (S7)

respectively. Expectation of (S6) and (S7) with respect to (S5) results in (16) and (17), respectively.

## References

• Box (1980) Box, G. (1980) Sampling and bayes’ inference in scientific modelling and robustness. J. R. Statist. Soc. A 143, 383–430.
• Chaloner and Verdinelli (1995) Chaloner, K. and Verdinelli, K. (1995) Bayesian experimental design: a review. Statist. Sci. 10, 273–304.
• Davison (2003) Davison, A. (2003) Statistical Models. Cambridge University Press, Cambridge.
• Etzioni and Kadane (1993) Etzioni, R. and Kadane, J. (1993) Optimal experimental design for another’s analysis. J. Am. Statist. Assoc. 88, 1404–1411.
• Gilmour and Trinca (2012) Gilmour, S. G. and Trinca, L. (2012) Optimum design of experiments for statistical inference (with discussion). J. R. Statist. Soc. C 61, 345–401.
• Kennedy and O’Hagan (2001) Kennedy, M. and O’Hagan, A. (2001) Bayesian calibration of computer models (with discussion). J. R. Statist. Soc. B 63, 425–464.
• Kotz and Nadarajah (2004) Kotz, S. and Nadarajah, S. (2004) Multivariate t Distributions and their Applications. Cambridge University Press, Cambridge.
• O’Hagan and Forster (2004) O’Hagan, A. and Forster, J. (2004)

Kendall’s Advanced Theory of Statistics Volume 2B Bayesian Inference

.
Wiley 2nd edn.
• Ryan et al. (2016) Ryan, E., Drovandi, C., McGree, J. and Pettitt, A. (2016) A review of modern computational algorithms for Bayesian optimal design. Int. Statist. Rev. 84, 128–154.
• Smith and Spiegelhalter (1980)

Smith, A. and Spiegelhalter, D. (1980) Bayes factors and choice criteria for linear models.

J. R. Statist. Soc. B 42, 213–220.