DeepAI

# High dimensional optimal design using stochastic gradient optimisation and Fisher information gain

Finding high dimensional designs is increasingly important in applications of experimental design, but is computationally demanding under existing methods. We introduce an efficient approach applying recent advances in stochastic gradient optimisation methods. To allow rapid gradient calculations we work with a computationally convenient utility function, the trace of the Fisher information. We provide a decision theoretic justification for this utility, analogous to work by Bernardo (1979) on the Shannon information gain. Due to this similarity we refer to our utility as the Fisher information gain. We compare our optimisation scheme, SGO-FIG, to existing state-of-the-art methods and show our approach is quicker at finding designs which maximise expected utility, allowing designs with hundreds of choices to be produced in under a minute in one example.

• 2 publications
• 6 publications
• 13 publications
04/11/2019

### Bayesian optimal design using stochastic gradient optimisation and Fisher information gain

Finding high dimensional designs is increasingly important in applicatio...
03/16/2020

### A note on properties of using Fisher information gain for Bayesian design of experiments

Designs found by maximizing the expected Fisher information gain can res...
05/18/2020

### Unbiased MLMC stochastic gradient-based optimization of Bayesian experimental designs

In this paper we propose an efficient stochastic optimization algorithm ...
03/26/2018

### A Common Framework for Natural Gradient and Taylor based Optimisation using Manifold Theory

This technical report constructs a theoretical framework to relate stand...
12/28/2020

### Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways f...
04/06/2018

### Optimal Design Emulators: A Point Process Approach

Design of experiments is a fundamental topic in applied statistics with ...
03/12/2018

### Efficient construction of Bayes optimal designs for stochastic process models

Stochastic process models are now commonly used to analyse complex biolo...

## 1 Introduction

Selecting a good design for an experiment can be crucial to extracting useful information and controlling costs. Applications include medical studies (Amzal et al., 2006), epidemic modelling (Cook et al., 2008), pharmacokinetics (Ryan et al., 2014; Overstall and Woods, 2017) and ecology (Gillespie and Boys, 2019). In modern applications it is increasingly feasible to take a large number of measurements, for example placing sensors (Krause et al., 2009) or making observations in a numerical integration problem (Oates et al., 2019). Therefore high dimensional designs are increasingly relevant. However most methods for optimal design are expensive in the high dimensional setting, so there is a need for more efficient and scalable methods.

We focus on the Bayesian approach to optimal experimental design, which takes into account existing knowledge and uncertainty about the process being studied before the experiment is undertaken. In this framework an experimenter must select a design. They then receive some utility based on the outcome of the experiment. The aim is to select the design which maximises expected utility given the experimenter’s beliefs before the experiment takes place. Mathematically this is an optimisation problem. A practical challenge is that the expected utility to be optimised usually cannot be calculated exactly. Instead only random estimates can be produced through simulation. We approach the problem by applying methods from the machine learning literature for high dimensional optimisation in similar settings: automatic differentiation and adaptive variants of stochastic gradient descent. These rely on being able to estimate the gradient of the expected utility with respect to the design. Therefore our methods are relevant to problems where there is a continuous space of possible designs.

For many utility functions it is expensive to estimate utilities or their gradients. For instance, a single utility evaluation often requires performing some aspect of Bayesian inference for simulated data using a Monte Carlo calculation. Therefore we focus on a utility function which is particularly computationally convenient as it is often available in closed form: the trace of the Fisher information. This is commonly used in classical optimal design, but is often criticised in the Bayesian experimental design literature as effectively only relying on an approximation to the posterior (see Section

2 for more discussion). However Walker (2016) shows that it has an information theoretic justification. We provide a further justification of this utility as a Bayesian approach, by noting that there is a first principles decision theoretic derivation, analogous to Bernardo’s (1979) argument supporting the use of the Shannon information gain utility. Due to this similarity, we refer to our utility as the Fisher information gain (FIG). The derivation uses an underlying utility function based on the Hyvärinen score (Hyvärinen, 2005).

We compare our optimisation scheme, SGO-FIG, to existing state-of-the-art methods such as the algorithm of Müller (1999) and ACE (Overstall and Woods, 2017), also using the FIG utlity. Our approach is quicker at finding designs which maximise expected utility, and allows designs with hundreds of choices to be produced in under a minute. A drawback is that in one example it often converges to poor local maxima, and we show how post-processing methods from Overstall and Woods (2017) can be used to find the overall optimal design. Furthermore we find FIG evaluation is roughly 10 times quicker than SIG in one example.

Similar gradient-based optimisation approaches to ours have been explored previously. Pronzato and Walter (1985) optimise the expected determinant of the Fisher information using analytically derived gradients. Huan and Marzouk (2013, 2014) optimise expected Shannon information using gradients (either derived analytically or based on finite differences) for a biased numerical approximation to the utility. The novelty of our approach is to use recently developed adaptive stochastic gradient algorithms and automatic differentiation frameworks, as well as a utility function chosen to allow cheap calculation of unbiased gradient estimates.

The remainder of the paper is structured as follows. Section 2 presents background on Bayesian experimental design, including common utility functions, the role of decision theory and existing computational methods. Section 3 describes the utility function we use, Fisher information gain. Details of a decision theoretic justification for this utility are presented in an appendix. Section 4 presents our algorithm for optimal design. Section 57 present three example applications, including a comparison of our algorithm to existing methods. Code to illustrate all of these applications is available at github.com/SophieHarbisher/SGO-FIG. Finally, Section 8 concludes with a discussion.

## 2 Bayesian experimental design

Optimal experimental design concerns the following problem. An experimenter must select a design . The experiment produces data with likelihood , where

is a vector of unknown parameters. The goal is to select the design which optimises some notion of the quality of the experiment, typically based on its informativeness and its cost.

The Bayesian approach to experimental design involves selecting a function , giving the utility of selecting design given observations and true parameters . We try to maximise the expected utility of i.e. the prior predictive utility

 J(τ)=Eθ∼π(θ),y∼f(y|θ;τ)[U(τ,θ,y)], (1)

where is the prior density for . The optimal design is that maximising . Throughout we consider the case where a unique maximum exists, although much of the methodology will remain useful when this is not the case.

This section reviews relevant details of Bayesian experimental design. See Chaloner and Verdinelli (1995) and Ryan et al. (2016) for more comprehensive surveys. First, Section 2.1 introduces some useful notation. Section 2.2 summarises some common choices of utility function, and Section 2.3 describes a particularly principled approach: deriving it using decision theory. Section 2.4 reviews computational methods for estimating the optimal design.

### 2.1 Notation

As usual in Bayesian statistics, we will make use of the posterior density and the prior predictive density for

. In our setting both depend on the experimental design ,

 π(θ|y;τ) =π(θ)f(y|θ;τ)/π(y;τ), (2) π(y;τ) =Eθ∼π(θ)[f(y|θ;τ)]. (3)

We will also make use of the Fisher information matrix,

 I(θ;τ)=Ey∼f(y|θ;τ)[u(y,θ;τ)u(y,θ;τ)T], (4)

which is based on the score function,

 u(y,θ;τ)=∇θlogf(y|θ;τ). (5)

We will focus on models where both of these are well defined.

We concentrate on the case where i.e. a design is a fixed number, , of real-valued quantities. Such a design typically represents times or locations of measurements. We will typically assume there are parameters and that is a vector of observations .

### 2.2 Common utility functions

Ideally a specific utility function for the situation at hand could be chosen, perhaps by eliciting preferences over different combinations from the experimenter (e.g. Wolfson et al., 1996). However this is often infeasible in practice. Instead several generic choices of utility have been proposed, including: scalar summaries of posterior precision or the difference between and (the posterior mean); information theoretic choices; utilities based on predicting future observations. Ryan et al. (2016) describe these utilities in more detail, and refer to them as fully Bayesian as they are functionals of the posterior distribution .

Producing good estimates of posterior quantities can be computationally expensive. Hence another set of generic utility choices are based on cruder posterior approximations, in particular Gaussian approximations using , the inverse of the Fisher information matrix (4

), as the variance matrix

(Chaloner and Verdinelli, 1995). The utility can then be taken to be a scalar summary of . This corresponds to using alphabet optimality criteria from classical experimental design (Box, 1982). Such utilities include (T-optimality), (D-optimality) and (A-optimality). Ryan et al. (2016) refer to these as pseudo-Bayesian as they are not functionals of the posterior.

###### Remark 1.

The distinction between pseudo and fully Bayesian utility functions is not as clear cut as it appears. In particular, Section 3 will present an example where a fully Bayesian and a pseudo-Bayesian utility (under the preceding definitions) are equivalent, in the sense of always producing the same expected utility function up to an additive constant, and hence the same optimal design. We explore this issue further in Appendix A.

A utility of particular interest later is Shannon information gain (SIG),

 USIG(τ,θ,y) =logπ(θ|y;τ)−logπ(θ) (6) =logf(y|θ;τ)−logπ(y;τ) (7)

A common SIG estimate replaces in (7) with a Monte Carlo estimate

 ^π(y;τ)=1LL∑ℓ=1f(y|θ(ℓ);τ), (8)

where are independent draws from the prior. A typical choice of is 1000 (Overstall and Woods, 2017), which makes each utility evaluation somewhat computationally expensive. Furthermore, some approximation error is introduced. In general, numerical estimation of most fully Bayesian utility functions involves a similar mixture of computational cost and approximation error (see e.g. Ryan et al., 2016 and Overstall and Woods, 2017 for details).

### 2.3 Bayesian decision theory

Lindley (1972), following Raiffa and Schlaifer (1961), proposed viewing experimental design as a decision problem. As in the general framework at the start of Section 2, the experimenter selects a design and the experiment produces data given parameters with an assumed prior distribution. Following this, the experimenter selects an action based on observing (but not ). Their preferences are represented by a function , which we will refer to as the base utility.

The utility required for Bayesian experimental design can be derived by assuming that the optimal action (which we assume exists) is always taken, giving

 U(τ,θ,y)=maxaEθ∼π(θ|y;τ)[V(a,θ,y,τ)].

In principle the base utility could be elicited from the experimenter’s preferences. However this is often not possible and instead a generic function can be used. One possibility is to let be a point estimate of . Then a base utility function such as mean squared error can be used. See Chaloner and Verdinelli (1995) for a discussion.

Bernardo (1979) proposed that instead be an estimated density for . The utility can then be based on a strictly proper scoring rule, a functional for evaluating the quality of density estimates. Bernardo showed that this framework allows a decision theoretic derivation of Shannon information gain. We summarise the argument in Appendix A.

### 2.4 Existing computational methods

Below we discuss two popular algorithms for Bayesian experimental design, which we use in our comparisons: the Müller and ACE algorithms. Many other algorithms have been proposed. For example, Ryan et al. (2014) look at high dimensional designs with a low dimensional parametric form, Price et al. (2018)

use evolutionary algorithms, and

Gillespie and Boys (2019) perform a sophisticated search on a discrete grid of designs.

#### Müller algorithm

Müller (1999)

performs optimal design using Markov Chain Monte Carlo (see

Amzal et al., 2006 and Kück et al., 2006 for similar approaches using sequential Monte Carlo.) The target density is

 h(τ,θ1,…,θJ,y1,…,yJ)∝J∏j=1U(τ,θj,yj)π(θj)f(yj|θj;τ).

The marginal density for is proportional to a power of the expected utility, . Hence the mode of under this marginal is the optimal design. Estimating the mode from MCMC samples can be non-trivial, especially for high dimensional designs. Taking larger values of makes the mode easier to identify, but increases the computational cost of each iteration.

The Müller algorithm uses Metropolis-Hastings to sample from the target distribution. To draw a proposal, first is sampled from a proposal kernel. Then pairs are sampled independently from . In our implementation we sample . Therefore the tuning choices we require are and .

#### Approximate co-ordinate exchange (ACE)

The ACE algorithm (Overstall and Woods, 2017) consists of two phases. Phase I is referred to as coordinate exchange. It loops over the components of the design, updating each in turn. To perform an update, designs are selected from the one dimensional search space and Monte Carlo estimates of expected utility calculated. A Gaussian process is fitted to the expected utility estimates and used to propose an improved value for the design component under consideration. This is accepted or rejected based on a Bayesian test of whether it improves expected utility, using a large number of simulations under the current and proposed designs.

Phase II is referred to point exchange. It considers whether the design output by phase I can be improved by replacing some components of the design with replicates of other components. New designs are proposed in a greedy fashion, replicating the design point which would yield the largest improvement in estimated expected utility, then removing the point which would result in the least reduction of the estimated expected utility. Similarly to phase I, the proposed design is then accepted based on a Bayesian test of whether expected utility is improved, after sampling utilities under the existing and candidate designs.

ACE can converge to local optima so the authors suggest running the algorithm multiple times and selecting the design which returns the highest expected utility. These runs can be performed in parallel to reduce computation time. The algorithm is implemented in the acebayes R package (Overstall et al., 2017). The package allows ACE phase II to be run separately, so it can be used to post-process designs from any algorithm.

## 3 Fisher information gain

Section 3.1 introduces the Fisher information gain utility, and discusses its properties, with further details given in Appendix A. Section 3.2 discusses evaluating it computationally.

### 3.1 Definition and properties

Walker (2016) proposes maximising the following objective function for optimal design:

 JFIG(τ)=Eθ∼π(θ)[trI(θ;τ)]. (9)

Under the framework of Section 2.2, the utility function used is

 UFIG(τ,θ,y)=trI(θ;τ), (10)

corresponding to classical T-optimality. We’ll refer to this as the Fisher information gain (FIG) due to a rough analogy with SIG discussed in Appendix A.

###### Remark 2.

The FIG utility is not invariant to reparameterisation. This will be discussed further in Section 4.4.

Since the right hand side of (10) does not involve , the FIG utility is not a functional of the posterior, and therefore is pseudo-Bayesian under the terminology of Ryan et al. (2016). However Walker (2016) shows that also results (up to an additive constant) from using utilities which are functionals of the posterior (utilities and in Section A.4 of the appendix). As noted in Remark 1, hence an expected utility equivalent to can be derived from both pseudo-Bayesian and fully-Bayesian utility functions.

The following result provides further theoretical support for .

###### Corollary 1.

The expected utility can be derived from Bayesian decision theory using the negative Hyvärinen score (Hyvärinen, 2005) as the base utility.

This follows as a special case of Theorem 1 given in Appendix A. Appendix A.4 gives the details. It also discusses how the various utilities just described correspond to information theoretic quantities derived from the Hyvärinen score. The most computationally convenient to use in practice is , as discussed in the next section.

### 3.2 Evaluation of the Fisher information gain

An advantage of the FIG utility is that for many models the Fisher information is available in closed form. This avoids the computational cost and approximation error associated with many alternative utilities, discussed in Section 2.2. (The computational gains are reduced when a closed form is not available. See Appendix E for details of this case.)

In later examples we will consider a model with observation vector . That is, the observations are true values plus normal noise, which may be correlated. The entry in row column of the associated Fisher information matrix is (see e.g. Miller, 1974, section V, equation (4.4))

 Ii,j(θ;τ)=vi(θ,τ)TΣ(τ)−1vj(θ,τ),

where is the vector of elementwise derivatives of . Thus

 trI(θ;τ)=p∑i=1vi(θ,τ)TΣ(τ)−1vi(θ,τ). (11)
###### Remark 3.

Under the normal model, multiplying by a constant positive scalar only affects the FIG utility by a constant of proportionality, and so does not change the optimal design.

## 4 Maximising expected Fisher information gain

This section describes our methodology for maximising , making use of stochastic gradient optimisation methods. Section 4.1

presents our algorithm and background on the methods it uses. The algorithm requires unbiased estimates of the gradient of

, and Section 4.2 discusses calculating these. Section 4.3 is on optimisation under constraints to the space of designs, and Section 4.4 is about optimisating a weighted version of the FIG.

Note that throughout the paper we use represent gradient with respect to . When differentiation with respect to another vector is required we add a subscript e.g. .

### 4.1 Optimisation algorithm

Algorithm 1 summarises our algorithm to maximise . In this section we discuss the background to the stochastic gradient optimisation methods which it uses, and some implementation details.

Our aim is to maximise although we cannot compute this function exactly. This is possible using stochastic gradient optimisation methods. This is an iterative optimisation method using (when maximisation is required) the update

 τt+1=τt+atgt,

where is an unbiased estimate of and is a decreasing learning rate sequence. For unbiased gradient estimates and an appropriate sequence, convergence to a local optimum is guaranteed. See Kushner and Yin (2003) for an overview of the theory.

Modern variations on stochastic gradient optimisation replace with

multiplied elementwise by a vector of learning rates. This vector is chosen adaptively to speed convergence. Various algorithms of this form are in common use and we use the popular “adaptive moments” Adam algorithm

(Kingma and Ba, 2015). (This is technically a minimisation method, so we use it to minimise .) An appealing feature of Adam is that it often does not require tuning, with the default choices performing well in many situations. We found the defaults worked well throughout this paper.

We typically run SGO-FIG for a fixed number of iterations. Alternatively it could be run until a convergence condition is reached. For example could be estimated at each iteration and a moving average recorded, with the algorithm terminating if the minimum moving average value is not beaten for a prespecified number of iterations.

SGO-FIG requires unbiased estimates of . From the definition of , (9), and assuming weak regularity conditions (see Appendix D) to allow interchange of differentiation and expectation,

 ∇JFIG(τ)=Eθ∼π(θ)[∇trI(θ;τ)].

A closed form of the Fisher information is often available (see Appendix E for when this is not the case). In this case an unbiased Monte Carlo gradient estimate is

 ˆ∇JFIG(τ)=K∑k=1∇trI(θ(k);τ), (12)

where are independent draws from the prior. Using a larger Monte Carlo batch size reduces the variance of the estimates but increases computational cost.

Typically we can calculate this gradient estimate using automatic differentiation (Baydin et al., 2017)

. Our code does this using the Tensorflow framework

(Abadi et al., 2016). We manually compute the function to be differentiated, . See Sections 57 for some examples. Deriving can itself involve lengthy differentiation, and there may be scope for future work to avoid this using advanced automatic differentiation methods.

### 4.3 Optimisation under constraints

We often wish to optimise the expected utility under a constraint: . For the application in this paper, constrained optimisation was possible by the simple pragmatic approach of adding a large penalty to expected utilities for , whose gradient moves designs back towards the feasible space .

In more complex settings this penalisation method may not suffice. One more sophisticated alternative is to compose each stochastic gradient optimisation update with a projection operation into (Kushner and Yin, 2003).

### 4.4 Optimisation using weights

The Fisher information gain can be written as

 trI(θ;τ)=p∑i=1Ey∼f(y|θ;τ)[ui(y,θ;τ)2], (13)

where is the th element of the score function . An alternative utility is to use a weighted sum in (13). We now show that this is equivalent to rescaling the parameters. (This shows that the FIG is not invariant under reparameterisation, as mentioned in Remark 2.)

Consider where is a diagonal weight matrix with diagonal entries , all positive. Let represent the vector of these weights. Then it is easy to show that

 trI(ϑ;τ)=tr[I(θ;τ)W−2]=p∑i=1Ey∼f(y|θ;τ)[ui(y,θ;τ)2/w2i]. (14)

The corresponding expected utility is

 JFIG(τ;w)=Eθ∼π(θ)tr[I(θ;τ)W−2].

When there is no natural parameter scale to use, we argue it is reasonable to weight the parameters so that each contributes a comparable amount to the sum in (14). Otherwise optimal design may concentrate solely on maximising informativeness for a subset of parameters: those with the largest contributions (see Table 1 for an example).

In Appendix B we present Algorithm 2, which adaptively learns weights with the property just described. While this algorithm can be used for optimal design, its convergence is not guaranteed, as discussed in the appendix. However, it can generally be used to pick reasonable weights prior to running Algorithm 1 using rescaled parameters .

## 5 Death model example

Several authors have investigated experimental design for the simple death process (Renshaw, 1993; Cook et al., 2008; Gillespie and Boys, 2019). To illustrate our method, we consider this setting with a single observation time, . In this scenario the observation model is where . Here is a known constant and is the parameter of interest.

The Fisher information for this model can be derived from two standard results. First, the Fisher information for is . Secondly, a reparameterisation produces . Hence for the death model we have

 Iθ(θ)=nτ2λ1−λ=nτ2e−θτ1−e−θτ.

Since this is scalar, the expected FIG utility is . Following Cook et al. (2008), we take and a log normal prior distribution for . In this example the expected utility is a univariate integral, so near-exact numerical calculation is possible. Figure 1a shows the resulting utility surface, and the optimal observation time . (Gillespie and Boys (2019) report the same optimal design to 2 decimal places, despite using a different utility: posterior precision. An explanation is that Fisher information is an approximation to posterior precision.)

Figure 1b shows independent runs of SGO-FIG with initial values of , and a batch size of . (In practice rather than we used a small positive value to avoid numerical issues.) Regardless of the initial value, SGO-FIG quickly locates the optimal design. As would be expected, the closer the initial value is to , the quicker the algorithm converges. Each analysis uses 10,000 iterations, which runs in only a few seconds for this simple example.

## 6 Pharmacokinetic example

This section contains simulations studies using a pharmacokinetic model. The goal is to investigate the performance of our optimisation method and compare it with existing methods. We focus on also using the FIG utility in the other methods to give a fair comparison.

Throughout we compare the methods based on how many utility evaluations they perform. For SGO-FIG each utility evaluation also has an associated gradient evaluation calculation to be performed. Due to this extra cost we also comment on the time taken for the different methods, although this is an imperfect comparison as it is influenced by implementation details of the methods, such as what programming language was used.

#### Model

We assume that drug concentration, , at time (in hours) is distributed as

 yj ∼N(x(θ,τj),σ2), wherex(θ,τj) =Dθ2(e−θ1τj−e−θ2τj),θ3(θ2−θ1),

and . Concentrations at different times are assumed to be independent. This is a modification of a model from Ryan et al. (2014) and Overstall and Woods (2017), removing a multiplicative noise term for simplicity.

Following this earlier work we assume independent log normal prior distributions

 θ1∼LN(log0.1,0.05),θ2∼LN(log1,0.05),θ3∼LN(log20,0.05),

and aim to find observation times in . Also we treat as known. Previous work required consecutive observations to be at least minutes apart. We do not enforce this condition as its implementation would vary between methods making it difficult to draw fair conclusions.

#### Methods

The FIG utility for this model can be calculated using (11). See Appendix C for details. When implementing the algorithms, we found no need to use constrained optimisation as the designs remained in the interval in any case.

The three terms in the sum representation of the FIG (13) are on widely different scales for this example. Table 1 shows that using SGO-FIG only increases the utility contribution for one parameter. Hence we use a weighted FIG (14) instead. After a short pilot run using adaptive weights (i.e. Algorithm 2), weights of were selected. Henceforth all methods use parameters scaled by these weights. Table 1 shows that SGO-FIG on the weighted FIG objective increases the utility contribution for all parameters.

Table 2

contains details of the algorithms we compare and their tuning choices. We sampled 100 initial designs from a uniform distribution over the search space. Each algorithm is run from each of these initial designs. We also investigate post-processing the results from each method using the ACE phase II algorithm (under its default tuning choices from

Overstall and Woods, 2017).

#### Results

First we investigated the cost of evaluating the FIG and SIG utilities. As discussed earlier, we expect SIG evaluation to be slower as it requires computing a Monte Carlo estimate (8). In ACE, the mean time to produce 1000 utility evaluations was 0.008 seconds for FIG compared to 0.082 seconds for SIG. Hence FIG produces roughly a 10 times speed-up in utility evaluation compared to SIG. (This is despite our FIG calculations being performed in R, while SIG uses more efficient C code.)

Figure 2 shows expected utilities returned by each optimal design method. Müller makes only a small improvement on the initial designs. SGO-FIG makes a larger improvment but does worse than ACE. Post-processing helps all the methods, with SGO-FIG results now generally the best, but occasionally extremely poor (about 10% of results have expected utility close to 2.5).

Speed of convergence is also of interest. Figure 3 is a trace plot of utilities during the execution of the ACE and SGO-FIG algorithms. We see that SGO-FIG converges much quicker than ACE (taking only around iterations), while ACE does not appear to have converged after the full computational budget. However SGO-FIG can converge to several different expected utilities, typically achieving a worse value than ACE.

For a fixed number of utility evaluations ACE was much quicker to execute than the other methods. For one particular initial design, approximate times were (in seconds) ACE 130, SGO-FIG 7,800, Müller 1 9,000, Müller 2 4,500. (The cost of our implementation of the Müller algorithm is dominated by non-utility calculations for each iteration, which is why computation time is halved when using half as many iterations.) However SGO-FIG takes only 40s to reach utility evaluations, at which point it has effectively converged.

Figure 4 summarises the designs returned by the algorithms. The Müller algorithm produces designs approximately uniformly spaced across the design space. SGO-FIG converges to designs that have repeated observations at times close to and . Across the runs only the proportions at these two times change, indicating these designs are local maxima. ACE also return designs where all times are close to 1 and 8. However there is more variability in the observation times, suggesting that these algorithms have not yet converged to a local maximum. Also, ACE places more observation times near to 1 more than SGO-FIG. This suggests it is better at avoiding poor local maxima.

Post-processing usually selects only observation times close to 1 rather than any close to 8. Since post-processing improves the expected utility, as shown by Figure 2, it appears that this is the optimal design. The SGO-FIG algorithm has 10 runs where the design returned places all observations around , and post-processing cannot move points to other times to improve the utility. These correspond to the outlying points in Figure 2 where SGO-FIG plus post-processing achieves a poor expected utility close to .

We also investigate different choices of batch size in SGO-FIG. The trace plots in Figure 5 show that for or , SGO-FIG converges in roughly the same number of iterations (). Hence is most efficient in terms of computational cost.

#### Summary

Our simulation study shows that FIG is much quicker to compute than SIG. Furthermore, SGO-FIG finds utility maxima using fewer utility evaluations than the other algorithms investigated. However it is slower than ACE in terms of time taken to run a fixed number of utility evaluations. Overall, this does not prevent SGO-FIG from being quicker to converge to its final design. We expect the advantage to be more pronounced in examples with more expensive utility evalautions.

SGO-FIG is more susceptible than ACE to finding poor local maxima. However combining SGO-FIG with a post-processing by ACE phase II generally seems to find the global maximum. In a small number of cases the SGO-FIG designs reach a local maximum that the post-processing algorithm could not improve significantly. Therefore in practice we suggest running SGO-FIG followed by post-processing multiple times from different initial designs, and selecting the returned design which maximises expected utility.

## 7 Geostatistical regression example

#### Model

In this section we consider a geostatistical regression model. Here a design is a matrix whose th row specifies the location of a measurement. (For the purposes of running the optimal design algorithms, can be flattened into a vector.) The model assumes normal observations with a linear trend and squared exponential covariance function with a nugget effect, giving

 y ∼N(x(θ,τ),Σ(τ)), xi =θ1τi1+θ2τi2, Σ =σ21I+σ22R(τ), Rij =exp[−2∑k=1(τik−τjk)2/ℓ2].

For simplicity we assume that (observation variance components) and (covariance length scale) are known. Hence the unknown parameters are and (trends). An alternative parameterisation which will be useful shortly introduces and so that .

#### FIG utility

Using (11),

 trI(τ)=d∑i=1τTiΣ(τ)−1τi,

which does not depend on . Hence we can calculate deterministically as . Also note that only affects as a constant of proportionality, so it does not change the optimal design. (Recall Remark 3 – multiplying by a constant does not affect the optimal design.)

#### Simulation study

We performed various simulation studies to search for design locations restricted to a unit square centred at the origin. Throughout we use Algorithm 1, with a batch size of . We implemented constrained optimisation by adding a large penalty to designs outside the unit square. Adaptive weights were not considered as the contributions of the parameters to the expected utility are similar by symmetry.

Figure 6 shows resulting designs under various choices of and . For both large and small values, the design points cluster in the corners. In between these extremes, the points are more uniformly spaced in the unit square.

For a comparison with ACE we fix and . Both algorithms were started from initial designs sampled uniformly over the design space. For fair comparison, the initial designs used were common to both algorithms. The default settings for ACE were used, noting that the expected utility is deterministic so it was not necessary to estimate it using Monte Carlo. SGO-FIG used iterations and a batch size of . These settings result in computational budgets of and utility evaluations per run for ACE and SGO-FIG respectively. Each SGO-FIG run took roughly 45 seconds, while the ACE analyses took roughly 4 minutes each. Figure 7 shows that SGO-FIG converges after using many fewer utility evaluations than ACE, and returns designs with higher expected utility values.

## 8 Discussion

We have presented a stochastic gradient optimisation algorithm, SGO-FIG, for Bayesian experimental design which can quickly find optimal high dimensional designs under a particular fast-to-compute utility function, Fisher information gain. We also provide a novel decision theoretic justification for this utility. In our simulation studies SGO-FIG finds maxima of the expected utility function faster than other state of the art methods. We ran our experiments on a CPU, but our Tensorflow code can easily make use of GPU parallelisation, allowing for further speed improvements.

In one application (Section 6) we found SGO-FIG often converged to sub-optimal local maxima, but this could be resolved by post-processing our results with ACE phase II. There may be scope for future work modifying stochastic gradient optimisation methods to avoid local maxima in optimal design problems e.g. using tempering methods, or non-local updates such as line search, as used in ACE.

In Section 6 we also observed that maximising the expected FIG often produces designs with repeated points. A similar finding was reported by Pronzato and Walter (1985). We speculate that the issue may be as follows. Repeated observation times can produce highly concentrated posteriors under some observed data, at the cost of less informative results otherwise. Overall the expected FIG rewards such trade-offs, as Fisher information is an approximation to posterior precision. In other words, FIG can be viewed as a risk seeking utility function. (For more on risk attitudes of utility functions see e.g. French and Insua, 2000.) This suggests modifying the utility to give decreasing returns to posterior concentration – e.g. using a utility – effectively introducing more risk aversion. It would be interesting to explore whether there are variations on the FIG along these lines which avoid designs with repeated points while retaining its computational convenience and decision theoretic support.

More generally, stochastic gradient optimisation could also be used for other utility functions whose gradients can be calculated, such as other functionals of the Fisher information matrix e.g.  and . However the cost of computing determinants or inverses of is higher than simply computing the trace, and there is arguably less theoretical backing for these utility functions.

## References

• Abadi et al. (2016) Abadi, M. et al. (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283.
• Amzal et al. (2006) Amzal, B., Bois, F. Y., Parent, E., and Robert, C. P. (2006). Bayesian-optimal design via interacting particle systems. Journal of the American Statistical Association, 101(474):773–785.
• Baydin et al. (2017) Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind, J. M. (2017). Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153):1–153.
• Bernardo (1979) Bernardo, J. M. (1979). Expected information as expected utility. The Annals of Statistics, 7(3):686–690.
• Box (1982) Box, G. E. P. (1982). Choice of response surface design and alphabetic optimality. Utilitas Math. B, 21:11–55.
• Chaloner and Verdinelli (1995) Chaloner, K. and Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, pages 273–304.
• Cook et al. (2008) Cook, A. R., Gibson, G. J., and Gilligan, C. A. (2008). Optimal observation times in experimental epidemic processes. Biometrics, 64(3):860–868.
• Figurnov et al. (2018) Figurnov, M., Mohamed, S., and Mnih, A. (2018). Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, pages 441–452.
• French and Insua (2000) French, S. and Insua, D. R. (2000). Statistical decision theory. Wiley.
• Gillespie and Boys (2019) Gillespie, C. S. and Boys, R. J. (2019). Efficient construction of Bayes optimal designs for stochastic process models. Statistics and Computing (online preview).
• Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378.
• Huan and Marzouk (2013) Huan, X. and Marzouk, Y. M. (2013). Simulation-based optimal Bayesian experimental design for nonlinear systems. Journal of Computational Physics, 232(1):288–317.
• Huan and Marzouk (2014) Huan, X. and Marzouk, Y. M. (2014). Gradient-based stochastic optimization methods in Bayesian experimental design. International Journal for Uncertainty Quantification, 4(6).
• Hyvärinen (2005) Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6:695–709.
• Jankowiak and Obermeyer (2018) Jankowiak, M. and Obermeyer, F. (2018). Pathwise derivatives beyond the reparameterization trick. In International Conference on Machine Learning, pages 2240–2249.
• Kingma and Ba (2015) Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations.
• Krause et al. (2009) Krause, A., Rajagopal, R., Gupta, A., and Guestrin, C. (2009). Simultaneous placement and scheduling of sensors. In Proceedings of the 2009 International Conference on Information Processing in Sensor Networks, pages 181–192. IEEE Computer Society.
• Kück et al. (2006) Kück, H., de Freitas, N., and Doucet, A. (2006). SMC samplers for Bayesian optimal nonlinear design. In 2006 IEEE Nonlinear Statistical Signal Processing Workshop, pages 99–102. IEEE.
• Kushner and Yin (2003) Kushner, H. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications. Springer Science & Business Media.
• Lindley (1972) Lindley, D. (1972). Bayesian statistics, a review. SIAM.
• Maddison et al. (2017) Maddison, C. J., Mnih, A., and Teh, Y. W. (2017).

The concrete distribution: A continuous relaxation of discrete random variables.

International Conference on Learning Representations.
• Miller (1974) Miller, K. S. (1974). Complex stochastic processes: an introduction to theory and application. Addison Wesley Publishing Company.
• Müller (1999) Müller, P. (1999). Simulation-based optimal design. In Bayesian Statistics 6: Proceedings of Sixth Valencia International Meeting, pages 459–474. Oxford University Press.
• Oates et al. (2019) Oates, C. J., Cockayne, J., Prangle, D., Sullivan, T. J., and Girolami, M. (2019). Optimality criteria for probabilistic numerical methods. arXiv preprint arXiv:1901.04326.
• Overstall et al. (2017) Overstall, A., Woods, D., and Adamou, M. (2017). acebayes: An R package for Bayesian optimal design of experiments via approximate coordinate exchange. arXiv preprint arXiv:1705.08096.
• Overstall and Woods (2017) Overstall, A. M. and Woods, D. C. (2017). Bayesian design of experiments using approximate coordinate exchange. Technometrics, 59(4):458–470.
• Parry et al. (2012) Parry, M., Dawid, A. P., and Lauritzen, S. (2012). Proper local scoring rules. The Annals of Statistics, 40(1):561–592.
• Price et al. (2018) Price, D. J., Bean, N. G., Ross, J. V., and Tuke, J. (2018).

An induced natural selection heuristic for finding optimal Bayesian experimental designs.

Computational Statistics & Data Analysis, 126:112–124.
• Pronzato and Walter (1985) Pronzato, L. and Walter, É. (1985). Robust experiment design via stochastic approximation. Mathematical Biosciences, 75(1):103–120.
• Raiffa and Schlaifer (1961) Raiffa, H. and Schlaifer, R. (1961). Applied statistical decision theory. Division of Research, Harvard Business School.
• Renshaw (1993) Renshaw, E. (1993). Modelling Biological Populations in Space and Time. Cambridge University Press.
• Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

Stochastic backpropagation and approximate inference in deep generative models.

In International Conference on Machine Learning, pages 1278–1286.
• Ryan et al. (2016) Ryan, E. G., Drovandi, C. C., McGree, J. M., and Pettitt, A. N. (2016). A review of modern computational algorithms for Bayesian optimal design. International Statistical Review, 84(1):128–154.
• Ryan et al. (2014) Ryan, E. G., Drovandi, C. C., Thompson, M. H., and Pettitt, A. N. (2014). Towards Bayesian experimental design for nonlinear models that require a large number of sampling times. Computational Statistics & Data Analysis, 70:45–60.
• Walker (2016) Walker, S. G. (2016). Bayesian information in an experiment and the Fisher information distance.

Statistics & Probability Letters

, 112:5–9.
• Wolfson et al. (1996) Wolfson, L. J., Kadane, J. B., and Small, M. J. (1996). Expected utility as a policy-making tool: an environmental health example. Statistics Textbooks and Monographs, 151:261–278.

## Appendix A Decision theoretic support

Bernardo (1979) gives a decision theoretic derivation of Shannon information gain, which we extend here. First Section A.1 reviews background material on scoring rules. Section A.2 proves a general result linking scoring rules and utilities in experimental design. Then Section A.3 relates this result to Shannon information gain and Section A.4 to Fisher information gain.

### a.1 Scoring rules

A scoring rule measures the quality of a distribution to model an uncertain quantity, given a realised value . For the purposes of this paper we represent the distribution by its density . Low scores represent a good match.

A scoring rule is strictly proper if it is always true that is uniquely minimised by . An interpretation of this property is as follows. Consider a decision problem where a density must be reported for a quantity

and the loss function is a scoring rule. The scoring rule is strictly proper if and only if the action minimising expected loss is to report the actual density,

. The expected loss resulting from this action is referred to as the entropy,

 H[p(θ)]=Eθ∼p(θ)[S(p(θ),θ)].

The extra expected loss from making a non-optimal action is known as the divergence,

 D[p(θ),q(θ)]=Eθ∼p(θ)[S(q(θ),θ)−S(p(θ),θ)].

Table 3 summarises the quantities just described for two strictly proper scoring rules, logarithmic score and Hyvärinen score (Hyvärinen, 2005). The Hyvärinen results rely on the following regularity conditions:

1. and are twice differentiable with respect to all .

2. are finite.

3. for .

Note that represents the norm i.e. . For more discussion of scoring rules see for example Gneiting and Raftery (2007) and Parry et al. (2012).

### a.2 Main result

Recall the Bayesian decision theory framework of Section 2.3. After picking a design and observing data , the experimenter must take an action . They then receive some base utility , involving the true parameter value . Their utility for the design given data is the expected base utility under the posterior distribution for , assuming that the optimal is selected. (Throughout this appendix utilities have arguments , but these are suppressed to simplify notation.)

Following Bernardo (1979), suppose that is an estimated density for . Let the base utility be the negative of a strictly proper scoring rule. Then, by the discussion in the previous section, the experimenter’s utility is the negative of the entropy associated with the scoring rule, evaluated on .

The following theorem shows that two other utility functions produce equivalent expected utilities to the negative entropy.

###### Theorem 1.

Given an underlying strictly proper score function , the following choices of utility function produce the same expected utility in Bayesian experimental design:

 Uentropy diff=H[π(θ)]−H[π(θ|y;τ)],Udivergence=D[π(θ|y;τ),π(θ)].

Furthermore the utility

 Uentropy=−H[π(θ|y;τ)]

produces the same expected utility up to an additive constant and hence shares the same optimal design.

###### Proof.

In the framework of Section 2, Bayesian experimental design is concerned with optimising the expected utility where . Using (2) and (3), we also have .

From the definitions of Section A.1,

 Udivergence=Uentropy+Eθ∼π(θ|y;τ)[S(π(θ),θ)].

Hence

 E(θ,y)∼π(θ,y)[Udivergence]

So and produce the same expected utility. Furthermore the expected utility of only differs by an additive constant, since does not depend on . ∎

### a.3 Logarithmic score and Shannon information gain

In the case of the logarithmic score function the equivalent utilities from Theorem 1 are

 Uentropy =Eθ∼π(θ|y;τ)[logπ(θ|y;τ)],(negative Shannon entropy) Uentropy diff =Eθ∼π(θ|y;τ)[logπ(θ|y;τ)]−Eθ∼π(θ)[logπ(θ)], Udivergence =Eθ∼π(θ|y;τ)[logπ(θ|y;τ)−logπ(θ)].

Another equivalent utility is the quantity within the expectation in ,

 USIG=logπ(θ|y;τ)−logπ(θ),

which is the Shannon information gain (6). Hence Theorem 1 provides decision theoretic support for this utility. It does so by essentially following the argument of Bernardo (1979).

### a.4 Hyvärinen score and Fisher information gain

Under the Hyvärinen score function the equivalent utilities from Theorem 1 are

 Uentropy =Eθ∼π(θ|y;τ)[||∇logπ(θ|y;τ)||2], Uentropy diff =Eθ∼π(θ|y;τ)[||∇logπ(θ|y;τ)||2]−Eθ∼π(θ)[||∇logπ(θ)||2], Udivergence =Eθ∼π(θ|y;τ)[||∇logπ(θ|y;τ)−∇logπ(θ)||2].

Two further equivalent utilities are the quantity within the expectation in ,

 Uscore diff=||∇logπ(θ|y;τ)−∇logπ(θ)||2=||∇logf(y|θ;τ)||2,

and its expectation

 UFIG=Ey∼f(y|θ;τ)[||∇logf(y|θ;τ)||2],

i.e. the trace of the Fisher information. This argument proves Corollary 1 from the main text. We refer to as Fisher information gain in a rough analogy to Shannon information gain. While a closer analogy would be to refer to as Fisher information gain, we prefer to reserve the term for the more computationally convenient form .

Walker (2016) proved directly that , and give the same expected utility. We have shown that the same result arises naturally from a decision theory characterisation.

## Appendix B Adaptive weights algorithm

Section 4.4 introduced a weighted FIG utility (14), which is equivalent to scaling each parameter by a weight . For notational convenience below, we introduce and let be the corresponding vector of squared weights. The weighted FIG can then be written

 UFIG(τ,θ,y;~w)=diag(I(θ;τ))T~w−1=p∑i=1Ey∼f(y|θ;τ)[ui(y,θ;τ)2/~wi],

where maps a matrix to its leading diagonal and is the vector of values. The resulting expected utility is .

This appendix considers the problem of selecting an optimal design under the weighted FIG utility, while also selecting optimal weights. More precisely, the aim is to find such that:

1. maximises ,

2. where .
Equivalently, maximises , where

 K(~w;τ)=−12||~w−g(τ)||2=−12p∑i=1{~wi−Eθ∼π(θ),y∼f(y|θ;τ)[ui(y,θ;τ)2]}2.

As discussed in Section 4.4, the second condition aims to make each parameter give an equal contribution to .

Algorithm 2 is an algorithm for optimal design with adaptive weights. It operates by applying stochastic gradient optimisation updates to and . An unbiased Monte Carlo gradient estimate of is

 ˆ∇JFIG(τ;~w)=K∑k=1∇[diag(I(θ(k);τ))T~w−1], (15)

where are independent draws from the prior and model. This corresponds to (12) for the unweighted case.

An unbiased Monte Carlo gradient estimate of is

 ˆ∇K(~w;τ)=diag(I(θ(k);τ))−~w. (16)

The same draws used for (15) can be reused in (16).

It is hard to guarantee that Algorithm 2 converges, as the gradients involved do not correspond to the gradient of a single overall loss function. Furthermore it is difficult to guarantee the existence of a solution to the adaptive optimisation problem, as discussed in Section B.1. Nonetheless we find reasonable performance of the algorithm in a simulation study (see Table 1). Also, as mentioned in the main text, the algorithm can be used simply to select reasonable weights i.e. increase the value of compared to using unit weights. Then Algorithm 1 can be run on parameters rescaled by these weights.

### b.1 Existence of a solution

Theorem 2 shows that a solution to the adaptive weights optimisation problem described above can be guaranteed under a few conditions. However one of these, item 4, is hard to verify: , the solution to a maximisation problem defined by , must be shown to be a continuous function of . Therefore it is difficult to guarantee the existence of a solution in practice.

###### Theorem 2.

There exists with the properties described above given the following conditions:

1. The set of possible designs is compact and convex.

2. is continuous.

3. is bounded above.

4. There exists a continuous function outputting a value maximising .

###### Proof.

Let where is an upper bound of . Define by . It follows from the assumptions that this is a continuous function from a compact convex set to itself. Hence by Brouwer’s fixed point theorem, there exists some which is a fixed point of . Therefore and as required. ∎

## Appendix C Fisher information gain for the pharmacokinetic example

Recall from Section 3.2 that for a model ,

 trI(θ;τ)=p∑i=1vi(θ,τ)TΣ(τ)−1vi(θ,τ)

where is the vector of elementwise derivatives of . For the pharmacokinetic model so that

 trI(θ;τ)=σ−23∑i=1n∑j=1vij(θ,τ)2.

where . It remains to compute the terms, which are

 v1j =1θ2−θ1x(θ,τj)−Dθ2θ3(θ2−θ1)τje−θ1τj, v2j =θ1θ2(θ1−θ2)x(θ,τj)+Dθ2θ3(θ2−θ1)τje−θ2τj, v3j =−1θ3x(θ,τj).

## Appendix D Regularity conditions

Under weak regularity conditions on it is possible to interchange of differentiation and expectation so that

 ∇τEθ∼π(θ)[g(θ,τ)]=Eθ∼π(θ)[∇τg(θ,τ)].

(This result was required in Section 4.2 with .) A sufficient regularity condition is that is finite, where absolute value acts elementwise. It follows that interchange is possible by Fubini’s theorem.

## Appendix E More on estimation of FIG and its gradient

This appendix considers estimating the expected FIG and its gradient when the Fisher information is not available in closed form. The expected FIG can be expressed as

 JFIG(τ)=Eθ∼π(θ),y∼f(y|θ;τ)[||u(y,θ;τ)||2], (17)

where is the score function, (5). Hence a simple unbiased Monte Carlo estimate is

 K∑k=1||u(y(k),θ(k);τ)||2.

where are independent draws from the prior and base density.

However, on taking the gradient of (17) with respect to , it is generally not possible to exchange the order of integration and differentiation. The reason is that the distribution for depends on . This complicates obtaining a Monte Carlo estimate.

A standard approach is to use the “reparameterisation trick” (Rezende et al., 2014). The idea is to express as a transformation of a random variable with fixed density . Then

 JFIG(τ)=Eθ∼π(θ),ϵ∼p(θ)[||u(y(ϵ,θ,τ),θ;τ)||2],

Now it is possible to exchange integration and differentiation under mild regularity conditions (see Appendix D) to get

 ∇JFIG(τ)=Eθ∼π(θ),ϵ∼p(ϵ)[∇||u(y(ϵ,θ,τ),θ;τ)||2].

Hence an unbiased Monte Carlo estimate is

 K∑k=1[∇||u(y(ϵ(k),θ(k),τ),θ(k);τ)||2],

where are independent draws from . Unbiased estimates of the quantities required for Algorithm 2 can be derived similarly.

In some cases cannot be represented as a suitable transformation of . This includes the case of discrete and other common distributions such as Beta and Gamma. Various techniques are available to deal with these cases such as taking a continuous approximation to discrete variables (Maddison et al., 2017) or using implicit differentiation (Figurnov et al., 2018; Jankowiak and Obermeyer, 2018).