Deep Adaptive Design: Amortizing Sequential Bayesian Experimental Design

03/03/2021 ∙ by Adam Foster, et al. ∙ 0

We introduce Deep Adaptive Design (DAD), a method for amortizing the cost of adaptive Bayesian experimental design that allows experiments to be run in real-time. Traditional sequential Bayesian optimal experimental design approaches require substantial computation at each stage of the experiment. This makes them unsuitable for most real-world applications, where decisions must typically be made quickly. DAD addresses this restriction by learning an amortized design network upfront and then using this to rapidly run (multiple) adaptive experiments at deployment time. This network represents a design policy which takes as input the data from previous steps, and outputs the next design using a single forward pass; these design decisions can be made in milliseconds during the live experiment. To train the network, we introduce contrastive information bounds that are suitable objectives for the sequential setting, and propose a customized network architecture that exploits key symmetries. We demonstrate that DAD successfully amortizes the process of experimental design, outperforming alternative strategies on a number of problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A key challenge across disciplines as diverse as psychology (Myung et al., 2013), bioinformatics (Vanlier et al., 2012), pharmacology (Lyu et al., 2019) and physics (Dushenko et al., 2020) is to design experiments so that the outcomes will be as informative as possible about the underlying process. Bayesian optimal experimental design (BOED) is a powerful mathematical framework for tackling this problem (Lindley, 1956; Chaloner and Verdinelli, 1995).

In the BOED framework, outcomes are modeled in a Bayesian manner (Gelman et al., 2013; Kruschke, 2014) using a likelihood and a prior , where is our controllable design and is the set of parameters we wish to learn about. We then optimize to maximize the expected information gained about (equivalently the mutual information between and ):

(1)

The true power of BOED is realized when it is used to design a sequence of experiments , where it allows us to construct adaptive strategies which utilize information gathered from past data to tailor each successive design during the progress of the experiment.The conventional approach for selecting each is to fit the posterior representing the updated beliefs about after the first iterations have been conducted, and then substitute this for the prior in (1(Ryan et al., 2016; Rainforth, 2017; Kleinegesse et al., 2020). The design is then chosen as the one which maximizes the resulting objective.

Unfortunately, this approach necessitates significant computational time to be expended between each step of the experiment in order to update the posterior and compute the next optimal design. In particular, the mutual information objective is doubly intractable (Rainforth et al., 2018; Zheng et al., 2018) and its optimization constitutes a significant computational bottleneck. This can be prohibitive to the practical application of sequential BOED; design decisions, more often than not, need to be made quickly for the approach to be useful (Evans and Mathur, 2005).

To give a concrete example, consider running an adaptive survey to understand political opinions (Pasek and Krosnick, 2010). A question is put to a participant who gives their answer and this data is used to update an underlying model with latent variables . Here sequential BOED is of immense value because previous answers can be used to guide future questions, ensuring that they are pertinent to the particular participant. However, it is not acceptable to have lengthy delays between questions to compute the next design, precluding existing approaches from being used.

To alleviate this problem we propose amortizing the cost of sequential experimental design, performing upfront training before the start of the experiment to allow very fast design decisions at deployment, when time is at a premium.This amortization is particularly useful in the typical scenario where the same adaptive experimental framework will be deployed numerous times (e.g. having multiple participants in a survey). Adaptive experiments that are only run once are rare because they can often be dealt with manually by a human experimenter. Here amortization not only removes the computational burden from the live experiment, it also allows for sharing computation across multiple experiments, analogous to inference amortization that allows one to deal with multiple datasets (Stuhlmüller et al., 2013).

Our approach, called Deep Adaptive Design (DAD), constructs a single design network which takes as input the designs and observations from previous stages, and outputs the design to use for the next experiment. The network is learned by simulating hypothetical experimental trajectories and then using these to train the network to make near-optimal design decisions automatically. That is, it learns to make design decisions as a function of the past data, and we optimize the parameters of this function rather than an individual design. Once learned, the network can be used repeatedly for different instantiations of the experiment (e.g. different human participants) and eliminates the computational bottleneck at each iteration for them, enabling them to be run both adaptively and quickly.

To allow for efficient, effective, and simple training, we show how DAD networks can be learned without any direct posterior or marginal likelihood estimation. This is achieved by deriving contrastive bounds that allow for end-to-end training with stochastic gradient ascent, thereby sidestepping both the need for inference and the double intractability of the mutual information objective.

We also derive a key permutation symmetry property of the optimal design function, and use this to propose a customized architecture for the experimental design network. This is critical to allowing effective amortization across time steps. The overall result of the theoretical formulation, novel contrastive bounds, and neural architecture is a training regime which enables us to bring the power of deep learning to bear on sequential experimental design.

We apply DAD to a range of problems relevant to applications such as epidemiology, physics and psychology. We find that DAD is able to accurately amortize experiments, opening the door to running adaptive BOED in real time.

2 Background

Because experimentation is a potentially costly endeavour, it is essential to design experiments so that we are likely to learn a lot from them. We adopt the BOED framework pioneered by Lindley (1956), in which the central measure of the quality of an experimental design is the expected amount of information gained about the model latent variable from the observation .

We begin with the standard Bayesian modelling set-up consisting of an explicit likelihood model for the experiment, and a prior representing our initial beliefs about the unknown latent. After running an experiment with design and observing , our updated beliefs are the posterior . The amount of information that has been gained about can be mathematically described by the reduction in entropy from the prior to the posterior

(2)

The expected information gain (EIG) is formed by taking the expectation over the Bayesian marginal distribution for the outcome , yielding

which is the mutual information between and under design . The optimal design is defined as where is the space of feasible designs.

The true power of BOED lies in its application to sequential experimentation. In this setting, we run experiments with designs , observing outcomes . Importantly, each can be chosen dependent upon , enabling us to use what has already been learned in previous experiments to design the next one optimally, resulting in a virtuous cycle of improving beliefs and using updated beliefs to design good experiments for future iterations.

The conventional approach to computing designs adaptively is to fit the posterior distribution at each step, and then optimize the mutual information objective that uses this posterior in place of the prior (Ryan et al., 2016; Rainforth, 2017; Kleinegesse et al., 2020)

(3)

where .

Despite the great potential of the sequential BOED framework, this conventional approach is very computationally expensive. At each stage of the experiment we must compute the posterior , which is costly and cannot be done in advance as it depends on . Furthermore, the posterior is then used to obtain by maximizing the objective in (3), which is computationally even more demanding as it involves the optimization of a doubly intractable quantity (Rainforth et al., 2018; Foster et al., 2019). Both of these steps must be done during the experiment, implying that it is infeasible to run adaptive BOED in real time experiment settings unless the model is unusually simple.

2.1 Contrastive information bounds

In Foster et al. (2020), the authors noted that if is continuous, approximate optimization of the mutual information at each stage of the experiment can be achieved in a single unified stochastic gradient procedure that both estimates and optimizes the mutual information simultaneously. A key component of this approach is the derivation of several contrastive lower bounds on mutual information, inspired by work in representation learning (van den Oord et al., 2018; Poole et al., 2019). One such bound is the Prior Contrastive Estimation (PCE) bound, given by

(4)

where is the sample used to generate and are contrastive samples drawn independently from ; as the bound becomes tight. The PCE bound can be maximized by stochastic gradient ascent (SGA) (Robbins and Monro, 1951) to approximate the optimal design . As discussed previously, in a sequential setting this stochastic gradient optimization is repeated times with replaced by at step .

3 Rethinking Sequential BOED

To enable adaptive BOED to be deployed in settings where design decisions must be taken quickly, we first need to place the standard sequential BOED approach within a formal framework for adaptive sequential design.To this end, we introduce the concept of a design function, or policy, that maps from the set of all previous design–observation pairs to the next chosen design.

Let denote the experimental history .We can simulate histories for a given policy , by sampling a , then, for each , fixing (where ) and sampling . The density of this generative process is given by

(5)

The standard sequential BOED approach described in § 2 now corresponds to a costly implicit policy , that performs posterior estimation followed by mutual information optimization to choose each design. By contrast, in DAD, we will learn a deterministic that chooses designs directly.

Another way to think about is that it is the policy which piecewise optimizes the following objective for

(6)

where . It is thus the optimal myopic policy (that is a policy which fails to reason about its own future actions) for an objective given by the sum of EIGs from each experiment iteration. Note that this not the optimal overall policy as it fails to account for future decision making (González et al., 2016; Jiang et al., 2020).

Trying to learn an efficient policy that directly mimics would be extremely computationally challenging because of the difficulties of dealing with both inference and mutual information estimation at each iteration of the training. Indeed, the natural way to do this would involve running a full, very expensive, simulated sequential BOED process to generate each training example.

We address this problem by reformulating the sequential decision problem in a way that completely eliminates the need for calculating either posterior distributions or intermediate objectives. This is done by exploiting an important property of the EIG: the total EIG of a sequential experiment is the sum of the (conditional) EIGs for each experiment iteration. This is formalized in the following result, which allows us to write down a single expression for the expected information gained from the entire sequence of experiments.

Theorem 1.

The total expected information gain for policy over a sequence of experiments is

(7)
(8)

where .

The proof is given in Appendix A. Intuitively, the total EIG of (8) is the expected reduction in entropy from the prior to the final posterior , without considering the intermediate posteriors at all. This formulation is thus key to being able to efficiently and effectively learn a policy that can, in turn, be used quickly at deployment time; our approach is to focus directly on estimating and optimizing the unified objective function .

4 Deep Adaptive Design

Theorem 1 showed that the optimal design function is the one which maximizes the mutual information between the unknown latent and the full rollout of histories produced using that policy, . DAD looks to approximate

explicitly using a neural network, which we now refer to as the

design network , with trainable parameters . This direct function approximation approach marks a major break from existing methods, which do not represent a policy explicitly as a function, but compute designs on the fly during the experiment.

DAD amortizes the cost of experimental design—by training the network parameters , the design network is taught to make correct design decisions across a wide range of possible experimental outcomes.This removes the cost of adaptation for the live experiment itself: during deployment the design network will select the next design nearly instantaneously with a single forward pass of the network.Further, it offers a simplification and streamlining of the sequential BOED process: it only requires the upfront end-to-end training of a single neural network and thus negates the need to set up complex automated inference and optimization schemes that would otherwise have to run in the background during a live experiment. A high-level summary of the DAD approach is given in Algorithm 1.

Two key technical challenges still stand in the way of realizing the potential of adaptive BOED in real time. First, whilst the unified objective does not require the computation of intermediate posterior distributions, it remains an intractable objective due to the presence of . To deal with this, we derive a family of lower bounds that are appropriate for the sequential experiment case and use them to construct stochastic gradient training schemes for . Second, to ensure that this network can efficiently learn a mapping from histories to designs, we require an effective architecture. As we show later, the optimal design function is invariant to the order of the history, and we use this key symmetry to maximize the effectiveness of our networks.

Input: Prior , likelihood , number of steps

Output: Design network

while Training compute budget not exceeded do

       Sample and set for  do
             Compute   Sample   Set
       end for
      Compute estimate for as per § 4.2 Update using stochastic gradient ascent scheme
end while
At deployment time, is fixed and each is obtained in turn by running experiment with design .
Algorithm 1 Deep Adaptive Design (DAD)

4.1 Contrastive bounds for sequential experiments

Our high-level aim is to train to maximize the mutual information . In contrast to most machine learning tasks, this objective is doubly intractable and cannot be directly evaluated or even estimated with a conventional Monte Carlo estimator, except in very special cases (Rainforth et al., 2018). In fact, it is extremely challenging and costly to derive any unbiased estimate for it or its gradients. To train with stochastic gradient methods, we will therefore introduce and optimize lower bounds on , building on the ideas of § 2.1.

Equation 8 shows that the objective function is the expected logarithm of a ratio of two terms. The first is the likelihood of the history, and can be directly evaluated using (5). The second term is an intractable marginal that is different for each sample of the outer expectation and must thus be estimated separately each time.

Given a sample , we can perform this estimation by introducing independent contrastive samples . We can then approximate the log-ratio in two different ways, depending on whether or not we include in our estimate for :

(9)
(10)

These forms can both be evaluated by recomputing the likelihood of the history under each of the contrastive samples . We note that cannot exceed , whereas is potentially unbounded.

We now show that using to approximate the integrand leads to a lower bound on the overall objective , whilst using leads to an upper bound. During training, we focus on the lower bound, because it does not lead to unbounded ratio estimates and is therefore more numerically stable. We refer to this new lower bound as sequential PCE (sPCE).

Theorem 2 (Sequential PCE).

For a design function and a number of contrastive samples , let

(11)

where the expectation is over and independently. Given minor technical assumptions discussed in the proof, we have111 means that is a monotonically increasing sequence in with limit .

(12)

at a rate .

The proof is presented in Appendix A.For evaluation purposes, it is helpful to pair sPCE with an upper bound, which we obtain by using as our estimate of the integrand

(13)

We refer to this bound as sequential Nested Monte Carlo (sNMC). Theorem 4 in Appendix A shows that satisfies complementary properties to . In particular, and both bounds become monotonically tighter as increases, becoming exact as at a rate . We can thus directly control the trade-off between bias in our objective and the computational cost of training. Note that increasing has no impact on the cost at deployment time. Critically, as we will see in our experiments, we tend to only need relatively modest values of for to be an effective objective.

If using a sufficiently large during training proves problematic (e.g. our available training time is strictly limited), one can further tighten these bounds for a fixed by introducing an amortized proposal, , for the contrastive samples , rather than drawing them from the prior. By appropriately adapting

, this can then be trained simultaneously to the design network itself with a single unified objective, in a manner similar to a variational autoencoder 

(Kingma and Welling, 2014), allowing the bound itself to get tighter during training. The resulting more general class of bounds are described in detail in Appendix B and may offer further improvements for the DAD approach. We focus on training with sPCE here in the interest of simplicity of both exposition and implementation.

4.2 Gradient estimation

We optimize the design network parameters using a stochastic optimization scheme such as Adam (Kingma and Ba, 2014). For this, we need to take gradients of the sPCE objective (11). Throughout, we assume that design space is continuous. Initially, we also assume that the observation space is continuous and that

is reparametrizable. This means that we can introduce random variables

, which are independent of and such that we can write . As we already have that , we see that becomes a deterministic function of given and . Under these assumptions, we can thus write222We use and

to represent the Jacobian matrices of partial and total derivatives respectively for vectors

and .

(14)

We can now straightforwardly construct SGA updates by sampling from and evaluating . This can be computed via an automatic differentiation framework (Baydin et al., 2018; Paszke et al., 2019)

and, using the chain rule for total derivatives and the shorthands

and , is given by

(15)
(16)

where the sum is over all and all increasing sequences of length in the range , and an empty product is, by convention, equal to 1. See Appendix C for a derivation and complete description of this gradient.

Whilst this total gradient is suitable for small to moderate values of , we found that the product term in (16) may sometimes lead to a training instability akin to the problem of exploding gradients (Hochreiter, 1991; Hochreiter and Schmidhuber, 1997). To mitigate these effects, we also propose an alternative gradient scheme that uses the partial derivative instead of the total derivative for w.r.t. , thus taking into account only the direct dependence of each on , yielding

(17)

Intuitively, what this means is that we follow the gradient that individually improves the design–observation pairs whilst treating other data pairs as fixed. Thus will be updated to improve the history in way that accounts for other in terms of their combined effect on , but not in terms of the knock-on effects that changing an earlier design would have on changing the later designs would select. This is conceptually similar to Jiang et al. (2020), in which the existence of future steps is accounted for, but not their functional dependence.

Using an analogous estimator to (17) we extend the gradient estimation to the case where

are discrete/categorical variables (see Appendix 

C

). Crucially, this estimator does not induce high variance estimates that typically plague such settings

(Schulman et al., 2015). We found that this training scheme was highly stable across a range of values for .

4.3 Architecture

Finally, we discuss the deep learning architecture used for . To allow efficient and effective training, we take into account a key permutation invariance of the BOED problem as highlighted by the following result (proved in Appendix A).

Theorem 3 (Permutation invariance).

Consider a permutation acting on a history , yielding . For all such , we have

showing that the EIG is unchanged under permutation. Further, the optimal policies starting in and are the same.

This permutation invariance is an important and well-studied property of many machine learning problems (Bloem-Reddy and Teh, 2019). The knowledge that a system exhibits permutation invariance can be exploited in neural architecture design to enable significant weight sharing. One common approach is pooling (Edwards and Storkey, 2016; Zaheer et al., 2017; Garnelo et al., 2018a, b). This involves summing or otherwise combining representations of multiple inputs into a single representation that is invariant to their order.

Using this idea, we represent the history with a fixed dimensional representation that is formed by pooling representations of the distinct design–outcome pairs of the history

(18)

where is a neural network encoder with parameters to be learned. Note that this pooled representation is the same if we reorder the labels . By convention, the sum of an empty sequence is 0.

We then construct our design network to make decisions based on the pooled representation by setting , where is a learned emitter network. The trainable parameters are . By combining simple networks in a way that is sensitive to the permutation invariance of the problem, we facilitate parameter sharing in which the network is re-used for each input pair and for each time step . This results in significantly improved performance compared to networks that are forced to learn the relevant symmetries of the problem.

5 Related Work

Existing approaches to sequential BOED typically follow the path outlined in § 2. The posterior inference that is performed at each stage of the conventional approach has been done using sequential Monte Carlo (SMC) (Del Moral et al., 2006; Drovandi et al., 2014), population Monte Carlo (PMC) (Rainforth, 2017), variational inference (Foster et al., 2019), and Laplace approximations (Lewi et al., 2009; Long et al., 2013). The estimation of the mutual information objective at each step has been performed by nested Monte Carlo (Myung et al., 2013; Vincent and Rainforth, 2017), variational bounds (Foster et al., 2019), Laplace approximation (Lewi et al., 2009), ratio estimation (Kleinegesse et al., 2020), and hybrid methods (Senarathne et al., 2020). The optimization over designs has been performed by Bayesian optimization (Kleinegesse et al., 2020), interacting particle systems (Amzal et al., 2006), simulated annealing (Müller, 2005), utilizing regret bounds (Zheng et al., 2020), or bandit methods (Rainforth, 2017). The mutual information estimation and optimization can be combined into a single stochastic gradient procedure, with gradients estimated by perturbation analysis (Huan and Marzouk, 2014), variational lower bounds (Foster et al., 2020), or multi-level Monte Carlo (Goda et al., 2020). Other work has sought to learn a non-myopic strategy focusing on specific tractable cases (Huan and Marzouk, 2016; Jiang et al., 2020).

6 Experiments

Figure 1: An example of the designs learnt by DAD [Left] and the fixed baseline [Right] for a given sampled from the prior.

We now compare DAD to a number of baselines across a range of real-world experimental design problems. As our aim is to adapt designs in real-time, we primarily compare to strategies that are fast at deployment time. This includes random design and a fixed design strategy in which we learn constants,

, before beginning the experiment, giving a non-adaptive strategy. We also compare to tailor-made heuristics for particular models as appropriate. We implement DAD by extending PyTorch 

(Paszke et al., 2019) and Pyro (Bingham et al., 2018) to provide an implementation that is abstracted from the specific probabilistic model. Code is provided in the Supplement.

Similarly to the notion of amortization gap in amortized inference (Cremer et al., 2018), one would expect to see a gap between the performance of DAD and that of conventional (non-amortized) BOED methods that use the approach of § 2. To assess this we also compare DAD to the variational method of Foster et al. (2020) that optimizes over one-step designs using SGA. We also look at several baselines that are specifically tailored to the examples that we choose (Vincent and Rainforth, 2017; Kleinegesse et al., 2020). Rather surprisingly, we find that DAD is not only competitive compared to these non-amortized methods, but on occasions outperforms them. We discuss why in § 7.

The first performance metric that we focus on is total EIG, . When no direct estimate of

is available, we estimate both the sPCE lower bound and sNMC upper bound. We also present the standard error to indicate how the performance varies between different experiment realizations. We further consider the deployment time (i.e. the time to run the experiment itself); a critical metric for our aims. Full experiment details are given in Appendix 

D.

6.1 Location finding in 2D

Method Lower bound, Upper bound,
Random 8.30 0.04 8.32 0.04
Fixed 8.84 0.04 8.91 0.04
DAD 9.94 0.05 10.40 0.07
Variational 8.78 0.14 9.06 0.19
Table 1: Upper and lower bounds on the total information, , for the location finding experiment. Errors indicate s.e. estimated over 256 (variational) or 2048 (others) rollouts.
Figure 2: Generalizing sequence length for the location finding experiment. The DAD network and the fixed strategy were trained to perform experiments, whilst other strategies do not require pre-training. The fixed strategy cannot be generalized to sequences longer than its training regime. We present sPCE estimates with error bars computed as in Table 1.

Inspired by the acoustic energy attenuation model of Sheng and Hu (2005), we consider the problem of finding the locations of multiple hidden sources which each emits a signal whose intensity attenuates according to the inverse-square law. The total intensity is a superposition of these signals. The design problem is to choose where to make observations of the total signal to learn the locations of the sources.

We train a DAD network to perform experiments with sources. The designs learned by DAD are visualized in Figure 1 [Left]. Here our network learns a complex strategy that initially explores in a spiral pattern. Once it detects a strong signal, multiple experiments are performed close together to refine knowledge of that location. This process is repeated to learn about the second source. The fixed design strategy, displayed in Figure 1 [Right] must choose all design locations up front, leading to an evenly dispersed strategy that cannot “hone in” on the critical areas, thus gathering less information.

Table 1 reports upper and lower bounds on for each strategy and confirms that DAD significantly outperforms the fixed and random strategies. Surprisingly, the variational posterior baseline, which requires substantial computational resources at each step of the experiment, performs notably worse than DAD; we discuss this in Section 7.

In practical situations the exact number of experiments to perform may be unknown. Figure 2 indicates that our DAD network that is pretrained to perform experiments can generalize well to perform experiments at deployment time, still outperforming the baselines, indicating that DAD is robust to the length of training sequences.In Appendix D.1 we also show that the performance of DAD is stable across different training runs.

Method Deployment time (s)
Frye et al. (2016) 0.0902 0.0003
Kirby (2009) 0.0059 0.0003
Fixed 0.0048 0.0002
DAD 0.0844 0.0005
Badapted (fast) 4.3055 0.0339
Badapted (slow) 25.2679 0.1854
Table 2: Deployment times for Hyperbolic Temporal Discounting methods. We present the total design time for questions, taking the mean and s.e. over 10 realizations. Tests were conducted on a lightweight CPU (see Appendix D).
Method Lower bound Upper bound
Frye et al. (2016) 3.308 0.015 3.322 0.015
Kirby (2009) 1.861 0.008 1.864 0.009
Fixed 2.518 0.007 2.524 0.007
DAD 4.241 0.011 4.263 0.012
Badapted (fast) 3.985 0.014 4.019 0.015
Badapted (slow) 4.454 0.016 4.536 0.018
Table 3: Final lower and upper bounds on the total information for the Hyperbolic Temporal Discounting experiment. The bounds are finite sample estimates of and with . The errors indicate s.e. over the sampled histories.

6.2 Hyperbolic temporal discounting

In psychology, temporal discounting is the phenomenon that the perceived value of an item decreases the longer we have to wait to receive it (Critchfield and Kollins, 2001; Green and Myerson, 2004)

. For example, a participant might be willing to trade £90 today for £100 in a month’s time, but not for £100 in a year. A common parametric model for temporal discounting in humans is the hyperbolic model

(Mazur, 1987), we study a specific form of this model proposed by Vincent (2016); Vincent and Rainforth (2017). The model is described in detail in Appendix D.

We design a sequence of experiments, each taking the form of a binary question “Would you prefer £ today, or £100 in days?” with design that must be chosen at each stage. As real applications of this model would involve human participants, the available time to choose designs is strictly limited. We consider DAD, the aforementioned fixed design policy, and strategies that have been used specifically for experiments of this kind. Kirby (2009) proposed a hand-picked fixed design; Frye et al. (2016) proposed a problem-specific adaptive strategy; Vincent and Rainforth (2017) developed a partially customized sequential BOED method, called Badapted, that uses PMC (Cappé et al., 2004) to approximate the posterior distribution at each step and a bandit approach to optimize the EIG.

We begin by investigating the time required to deploy each of these methods. As shown in Table 2, the non-amortized Badapted method takes the longest time, and we consider two different computational budgets for it based on different numbers of PMC steps. For DAD, the total design time is less than seconds—almost imperceptible to a participant.

Table 3 shows the performance of each method. Excluding the slow Badapted method, we see that DAD performs best, outperforming bespoke design methods that have been proposed for this problem. The slower Badapted method outperforms DAD by a small margin due to its larger computation budget. This experiment demonstrates that DAD can successfully amortize the process of experimental design. It reaches a performance approaching that of the most successful non-amortized and highly problem-specific approach with a fraction of the cost during the real experiment.

Method Deployment time (s)
Fixed 0.0009 11% 2.023 0.007
DAD 0.0051 12% 2.119 0.008
Variational 1935.0000 02% 2.076 0.034
SeqBED* 25911.0 1.590
Table 4: Total EIG and deployment times for the Death Process. We present the EIG s.e. over 10,000 rollouts (fixed and DAD), 500 rollouts (variational) or *1 rollout (SeqBED). The IG can be efficiently evaluated in this case (see Appendix D). Runtimes computed as per Table 2.

6.3 Death process

We conclude with an example from epidemiology (Cook et al., 2008) in which healthy individuals become infected at rate . The design problem is to choose observations times at which to observe the number of infected individuals, selecting designs sequentially with an independent stochastic process observed at each iteration. We compare to SeqBED (Kleinegesse et al., 2020), a fixed design, and a variational BOED baseline.

First, we examine the compute time required to deploy each method for a single run of the sequential experiment. The times illustrated in Table 4 show that the adaptive strategy learned by DAD can be deployed in under seconds, many orders of magnitude faster than the non-amortized methods, with SeqBED taking hours for one rollout.

Next, we estimate the objective by averaging the information gain over simulated rollouts. The results in Table 4 reveal that DAD designs are superior to both fixed design and variational adaptive design, tending to uncover more information about the latent across many possible experimental trajectories. For comparison with SeqBED, we were unable to perform sufficient rollouts to obtain a high quality estimate of . Instead, we conducted a single rollout of each method with fixed. The resulting information gains for this one rollout were: 1.590 (SeqBED), 1.719 (Variational), 1.6780 (Fixed), 1.772 (DAD).

7 Discussion

In this paper we introduced DAD—a new method utilizing the power of deep learning to amortize the cost of sequential BOED and allow adaptive experiments to be run in real time. In all experiments DAD performed significantly better than baselines with a comparable deployment time. Further, DAD showed competitive performance against conventional BOED approaches that do not use amortization, but make costly computations at each stage of the experiment.

Surprisingly, we found DAD was often able to outperform these non-amortized approaches despite using a tiny fraction of the resources at deployment time.We suggest two reasons for this. Firstly, conventional methods must approximate the posterior at each stage. If this approximation is poor, the resulting design optimization will yield poor results regardless of the EIG optimization approach chosen. Careful tuning of the posterior approximation could alleviate this, but would increase computational time further and it is difficult to do this in the required automated manner.DAD sidesteps this problem altogether by eliminating the need for directly approximating a posterior distribution.

Figure 3: 1D location finding with 1 source, . [Left] the design function, dashed lines correspond to the first design , which is independent of . [Right] , the total EIG s.e.

Secondly, the policy learnt by DAD has the potential to be non-myopic: it does not choose a design that is optimal for the current experiment in isolation, but takes into account the fact that there are more experiments to be performed in the future. We can see this in practice in a simple experiment using the location finding example with one source in 1D with prior and with steps. This setting is simple enough to compute the exact one-step optimal design via numerical integration. Figure 3 [Left] shows the design function learnt by DAD alongside the exact optimal myopic design. The optimal myopic strategy for is to sample at the prior mean . At time

the myopic strategy selects a positive or negative design with equal probability. In contrast, the policy learnt by DAD is to sample at

, which does not optimize the EIG for in isolation, but leads to a better overall design strategy that focuses on searching the positive regime in the second experiment. Figure 3 [Right] confirms that the policy learned by DAD achieves higher total EIG from the two step experiment than the exact myopic approach.

Acknowledgements

AF gratefully acknowledges funding from EPSRC grant no. EP/N509711/1. DRI is supported by EPSRC through the Modern Statistics and Statistical Machine Learning (StatML) CDT programme, grant no. EP/S023151/1.

References

  • B. Amzal, F. Y. Bois, E. Parent, and C. P. Robert (2006) Bayesian-optimal design via interacting particle systems. Journal of the American Statistical association 101 (474), pp. 773–785. Cited by: §5.
  • J. A. Angelova (2012)

    On moments of sample mean and variance

    .
    Int. J. Pure Appl. Math 79 (1), pp. 67–85. Cited by: Appendix A.
  • A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind (2018) Automatic differentiation in machine learning: a survey. Journal of machine learning research 18. Cited by: §4.2.
  • E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman (2018) Pyro: deep universal probabilistic programming. Journal of Machine Learning Research. Cited by: Appendix D, §6.
  • B. Bloem-Reddy and Y. W. Teh (2019) Probabilistic symmetry and invariant neural networks. arXiv preprint arXiv:1901.06082. Cited by: §4.3.
  • O. Cappé, A. Guillin, J. Marin, and C. P. Robert (2004) Population monte carlo. Journal of Computational and Graphical Statistics 13 (4), pp. 907–929. Cited by: §6.2.
  • K. Chaloner and I. Verdinelli (1995) Bayesian experimental design: a review. Statistical Science, pp. 273–304. Cited by: §1.
  • A. R. Cook, G. J. Gibson, and C. A. Gilligan (2008) Optimal observation times in experimental epidemic processes. Biometrics 64 (3), pp. 860–868. Cited by: §D.3, §6.3.
  • C. Cremer, X. Li, and D. Duvenaud (2018) Inference suboptimality in variational autoencoders. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 80, pp. 1078–1086. Cited by: §6.
  • T. S. Critchfield and S. H. Kollins (2001) Temporal discounting: basic research and the analysis of socially important behavior. Journal of applied behavior analysis 34 (1), pp. 101–122. Cited by: §6.2.
  • P. Del Moral, A. Doucet, and A. Jasra (2006) Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68 (3), pp. 411–436. Cited by: §5.
  • C. C. Drovandi, J. M. McGree, and A. N. Pettitt (2014) A sequential monte carlo algorithm to incorporate model uncertainty in bayesian sequential design. Journal of Computational and Graphical Statistics 23 (1), pp. 3–24. Cited by: §5.
  • S. Dushenko, K. Ambal, and R. D. McMichael (2020) Sequential bayesian experiment design for optically detected magnetic resonance of nitrogen-vacancy centers. Physical Review Applied 14 (5), pp. 054036. Cited by: §1.
  • H. Edwards and A. Storkey (2016) Towards a neural statistician. arXiv preprint arXiv:1606.02185. Cited by: §4.3.
  • J. R. Evans and A. Mathur (2005) The value of online surveys. Internet research. Cited by: §1.
  • A. Foster, M. Jankowiak, E. Bingham, P. Horsfall, Y. W. Teh, T. Rainforth, and N. Goodman (2019) Variational Bayesian Optimal Experimental Design. In Advances in Neural Information Processing Systems 32, pp. 14036–14047. Cited by: §2, §5.
  • A. Foster, M. Jankowiak, M. O’Meara, Y. W. Teh, and T. Rainforth (2020) A unified stochastic gradient approach to designing bayesian-optimal experiments. S. Chiappa and R. Calandra (Eds.), Proceedings of Machine Learning Research, Vol. 108, Online, pp. 2959–2969. Cited by: §2.1, §5, §6.
  • C. C. Frye, A. Galizio, J. E. Friedel, W. B. DeHart, and A. L. Odum (2016) Measuring delay discounting in humans using an adjusting amount task. JoVE (Journal of Visualized Experiments) (107), pp. e53584. Cited by: §D.2, §6.2, Table 2, Table 3.
  • M. Garnelo, D. Rosenbaum, C. J. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. J. Rezende, and S. Eslami (2018a) Conditional neural processes. arXiv preprint arXiv:1807.01613. Cited by: §4.3.
  • M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. Eslami, and Y. W. Teh (2018b) Neural processes. arXiv preprint arXiv:1807.01622. Cited by: §4.3.
  • A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin (2013) Bayesian data analysis. Chapman and Hall/CRC. Cited by: §1.
  • T. Goda, T. Hironaka, and W. Kitade (2020) Unbiased mlmc stochastic gradient-based optimization of bayesian experimental designs. arXiv preprint arXiv:2005.08414. Cited by: §5.
  • J. González, M. Osborne, and N. Lawrence (2016) GLASSES: relieving the myopia of bayesian optimisation. In Artificial Intelligence and Statistics, pp. 790–799. Cited by: §3.
  • L. Green and J. Myerson (2004) A discounting framework for choice with delayed and probabilistic rewards. Psychological bulletin 130 (5), pp. 769. Cited by: §6.2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §4.2.
  • S. Hochreiter (1991) Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München 91 (1). Cited by: §4.2.
  • X. Huan and Y. M. Marzouk (2016) Sequential bayesian optimal experimental design via approximate dynamic programming. arXiv preprint arXiv:1604.08320. Cited by: §5.
  • X. Huan and Y. Marzouk (2014) Gradient-based stochastic optimization methods in bayesian experimental design. International Journal for Uncertainty Quantification 4 (6). Cited by: §5.
  • S. Jiang, H. Chai, J. Gonzalez, and R. Garnett (2020) BINOCULARS for efficient, nonmyopic sequential experimental design. In International Conference on Machine Learning, pp. 4794–4803. Cited by: §3, §4.2, §5.
  • D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §D.1, §4.2.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In ICLR, Cited by: §4.1.
  • K. N. Kirby (2009) One-year temporal stability of delay-discount rates. Psychonomic bulletin & review 16 (3), pp. 457–462. Cited by: §D.2, §6.2, Table 2, Table 3.
  • S. Kleinegesse, C. Drovandi, and M. U. Gutmann (2020) Sequential bayesian experimental design for implicit models via mutual information. arXiv preprint arXiv:2003.09379. Cited by: §D.3, §D.3, §1, §2, §5, §6.3, §6.
  • J. Kruschke (2014) Doing bayesian data analysis: a tutorial with r, jags, and stan. Cited by: §1.
  • J. Lewi, R. Butera, and L. Paninski (2009) Sequential optimal design of neurophysiology experiments. Neural Computation 21 (3), pp. 619–687. Cited by: §5.
  • D. V. Lindley (1956) On a measure of the information provided by an experiment. The Annals of Mathematical Statistics, pp. 986–1005. Cited by: §1, §2.
  • Q. Long, M. Scavino, R. Tempone, and S. Wang (2013) Fast estimation of expected information gains for Bayesian experimental designs based on Laplace approximations. Computer Methods in Applied Mechanics and Engineering 259, pp. 24–39. Cited by: §5.
  • J. Lyu, S. Wang, T. E. Balius, I. Singh, A. Levit, Y. S. Moroz, M. J. O’Meara, T. Che, E. Algaa, K. Tolmachova, et al. (2019) Ultra-large library docking for discovering new chemotypes. Nature 566 (7743), pp. 224. Cited by: §1.
  • J. E. Mazur (1987) An adjusting procedure for studying delayed reinforcement. Commons, ML.; Mazur, JE.; Nevin, JA, pp. 55–73. Cited by: §D.2, §6.2.
  • P. Müller (2005) Simulation based optimal design. Handbook of Statistics 25, pp. 509–518. Cited by: §5.
  • J. I. Myung, D. R. Cavagnaro, and M. A. Pitt (2013) A tutorial on adaptive design optimization. Journal of mathematical psychology 57 (3-4), pp. 53–67. Cited by: §1, §5.
  • S. Nowozin (2018) Debiasing evidence approximations: on importance-weighted autoencoders and jackknife variational inference. In International Conference on Learning Representations, Cited by: Appendix A.
  • J. Pasek and J. A. Krosnick (2010) Optimizing survey questionnaire design in political science. In The Oxford handbook of American elections and political behavior, Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: Appendix C, Appendix D, §4.2, §6.
  • B. Poole, S. Ozair, A. van den Oord, A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180. Cited by: §2.1.
  • T. Rainforth, R. Cornish, H. Yang, A. Warrington, and F. Wood (2018) On nesting monte carlo estimators. In International Conference on Machine Learning, pp. 4267–4276. Cited by: §1, §2, §4.1.
  • T. Rainforth (2017) Automating inference, learning, and design using probabilistic programming. Ph.D. Thesis, University of Oxford. Cited by: §1, §2, §5.
  • H. Robbins and S. Monro (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §2.1.
  • E. G. Ryan, C. C. Drovandi, J. M. McGree, and A. N. Pettitt (2016) A review of modern computational algorithms for bayesian optimal design. International Statistical Review 84 (1), pp. 128–154. Cited by: §1, §2.
  • J. Schulman, N. Heess, T. Weber, and P. Abbeel (2015) Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, External Links: 1506.05254, ISSN 10495258 Cited by: §4.2.
  • S. Senarathne, C. Drovandi, and J. McGree (2020) A laplace-based algorithm for bayesian adaptive design. Statistics and Computing 30 (5), pp. 1183–1208. Cited by: §5.
  • X. Sheng and Y. H. Hu (2005) Maximum likelihood multiple-source localization using acoustic energy measurements with wireless sensor networks. IEEE Transactions on Signal Processing. External Links: Document, ISSN 1053587X Cited by: §6.1.
  • A. Stuhlmüller, J. Taylor, and N. Goodman (2013) Learning stochastic inverses. In Advances in neural information processing systems, pp. 3048–3056. Cited by: §1.
  • A. van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.1.
  • J. Vanlier, C. A. Tiemann, P. A. Hilbers, and N. A. van Riel (2012) A Bayesian approach to targeted experiment design. Bioinformatics 28 (8), pp. 1136–1142. Cited by: §1.
  • B. T. Vincent and T. Rainforth (2017) The DARC toolbox: automated, flexible, and efficient delayed and risky choice experiments using bayesian adaptive design. Cited by: §D.2, §D.2, §5, §6.2, §6.2, §6.
  • B. T. Vincent (2016) Hierarchical bayesian estimation and hypothesis testing for delay discounting tasks. Behavior research methods 48 (4), pp. 1608–1620. Cited by: §D.2, §6.2.
  • M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola (2017) Deep sets. arXiv preprint arXiv:1703.06114. Cited by: §4.3.
  • S. Zheng, D. Hayden, J. Pacheco, and J. W. Fisher III (2020) Sequential bayesian experimental design with variable cost structure. Advances in Neural Information Processing Systems 33. Cited by: §5.
  • S. Zheng, J. Pacheco, and J. Fisher (2018) A robust approach to sequential information theoretic planning. In International Conference on Machine Learning, pp. 5941–5949. Cited by: §1.

Appendix A Proofs

Here we present proofs for all Theorems in the main paper, with each restated for convenience. See 1

Proof.

We begin by rewriting in terms of the information gain. This closely mimics the development that we presented in Section 2

. By repeated appplication of Bayes Theorem we have

(19)
(20)
(21)
(22)
(23)
(24)

Now noting that each is completely determined by and (in particular noting that is deterministic given these, while is already marginalized out in each ), we can write

(25)
(26)
and substituting in our earlier formulation for
(27)
We now observe that we can write , which allows us to rewrite this as
(28)
(29)
(30)
(31)
where the last line follows from the fact that we have a telescopic sum. To complete the proof, we rearrange this as
(32)
(33)
(34)

as required. ∎

See 2

Proof.

We first show that is a lower bound to :

(35)
(36)
(37)
now introducing the shorthand ,
(38)
Now by the systemtry on term in side the log, we see that this expectation would be the same if it were instead taken over for any (with giving the original form). Furthermore, the result is unchanged if we take the expectation over the mixture distribution and thus we have
(39)
(40)

where , which is indeed a distribution since

(41)

Now by Gibbs’ inequality the expected KL in (40) must be strictly non–negative, establishing and thus as required.

We next show monotonicity in , i.e. for , using similar argument as above

(42)