A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments

11/01/2019 ∙ by Adam Foster, et al. ∙ 23

We introduce a fully stochastic gradient based approach to Bayesian optimal experimental design (BOED). This is achieved through the use of variational lower bounds on the expected information gain (EIG) of an experiment that can be simultaneously optimized with respect to both the variational and design parameters. This allows the design process to be carried out through a single unified stochastic gradient ascent procedure, in contrast to existing approaches that typically construct an EIG estimator on a pointwise basis, before passing this estimator to a separate optimizer. We show that this, in turn, leads to more efficient BOED schemes and provide a number of a different variational objectives suited to different settings. Furthermore, we show that our gradient-based approaches are able to provide effective design optimization in substantially higher dimensional settings than existing approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The design of experiments is a key problem in almost every scientific discipline. Namely, one wishes to construct an experiment that is most informative about the process being investigated, while minimizing its cost. For example, in a psychological trial, we want to ensure questions posed to participants are pertinent and do not have predictable responses. In a pharmaceutical trial, we want to minimize the number of participants needed to successfully test our hypotheses. In an online automated help system, we want to ensure we ask questions that identify the user’s problem as quickly as possible.

In all these scenarios, our ultimate high-level aim is to choose designs that maximize the information gathered by the experiment. A powerful and broadly used approach for formalizing this aim is OED (chaloner1995; lindley1956; myung2013). In OED, we specify a Bayesian model for the experiment and then choose the design that maximizes the EIG from running it. More specifically, let denote the latent variables we wish to learn about from running the experiment and let represent the experimental design. By introducing a prior and a predictive distribution for experiment outcomes , we can calculate the EIG under this model by taking the expected reduction in posterior entropy

(1)

where represents the entropy of a distribution and . Our experimental design process now becomes that of the finding the design that maximizes .

Unfortunately, finding is typically a very challenging problem in practice. Even evaluating for a single design is computationally difficult because it represents a nested expectation and thus has no direct MC estimator (nmc). Though a large variety of approaches for performing this estimation have been suggested (myung2013; watson2017quest+; kleinegesse2018efficient; foster2019variational), the resulting OED strategies share a critical common feature: they estimate on a point-by-point basis and feed this estimator to an outer-level optimizer that selects which points to evaluate.

This framework can be highly inefficient for a number of reasons. For example, it adds an extra level of nesting to the overall computation process: must be separately estimated for each , substantially increasing the overall computational cost. Furthermore, one must typically resort to gradient-free methods to carry out the resulting optimization, which means it is difficult to scale the overall OED process to high dimensional design settings due to a dearth of optimization schemes which remain effective in such settings.

To alleviate these inefficiencies and open the door to applying OED in high-dimensional settings, we introduce an alternative to this two-stage framework by introducing unified objectives that can be directly maximized to simultaneously estimate and optimize . Specifically, by building on the work of foster2019variational, we construct variational lower bounds to that can be simultaneously optimized with respect to both the variational and design parameters. Optimizing the former ensures that we achieve a tight bound that in turn gives accurate estimates of , while simultaneously optimizing the latter circumvents the need for an expensive outer optimization process. Critically, this approach allows the optimization to be performed using stochastic gradient ascent (SGA) (robbins1951stochastic) and therefore scaled to substantially higher dimensional design problems than existing approaches.

To account for the varying needs of different problem settings, we introduce several classes of suitable variational lower bounds. Most notably, we introduce the ACE bound: an EIG variational lower bound that can be made arbitrarily tight, while remaining amenable to simultaneous SGA on both the variational parameters and designs.

We demonstrate the applicability of our unified gradient approach using a wide range of experimental design problems, including a real-world high-dimensional example from the pharmacology literature (lyu2019ultra). We find that our approaches are able to effectively optimize the EIG, consistently outperforming baseline two-stage approaches, with particularly large gains achieved for high-dimensional problems. These gains lead, in turn, to improved designs and more informative experiments.

2 Background

2.1 Bayesian optimal experimental design

When experimentation is costly, time consuming, or dangerous, it is essential to design experiments to learn the most from them. To choose between potential designs, we require a metric of the quality of a candidate design. In the OED framework dating back to lindley1956, this metric represents how much more certain we will become in our knowledge of the world after doing the experiment and analyzing the data. We prefer designs that will lead to strong conclusions even if we are not yet sure what those conclusions will be.

Specifically, the OED framework begins with Bayesian model of the experimental process. This is defined through the combination of a likelihood that predicts the experimental outcome under design and latent variable , and a prior which incorporates initial beliefs about the unknown . After conducting the experiment, our beliefs are updated to the posterior . The information gained about from doing the experiment with design and obtaining outcome is the reduction in entropy from the prior to the posterior

(2)

As it stands, information gain cannot be evaluated until after the experiment. To define a metric that will let us choose between designs before experimentation, we define the expected information gain (EIG), , by taking the expectation of IG over hypothesized outcomes using the marginal distribution under our model, , to give

(3)

which can be rewritten in the form of a mutual information between and with fixed, namely

(4)

The Bayesian optimal design, , is now the one which maximizes EIG over the set of feasible designs

(5)

In iterated experiment design, we design a sequence of experiments. At time , the prior in (4) is replaced by the posterior given the previous experiments and their observed outcomes , namely

(6)

This now provides a means to design a sequence of adaptive experiments, wherein we use information gathered from previous iterations to improve the designs used at future iterations.

2.2 Estimating expected information gain

Making even a single point estimate of EIG when solving (5) can be challenging because we must first estimate the unknown or , and then take an expectation over . Nested Monte Carlo (NMC) estimators (nmc), which make a Monte Carlo approximation of both the inner and outer integrals, converge relatively slowly: at a rate in the total computational budget .

foster2019variational noted that this approach is inefficient because it makes a separate MC approximation of the integrand for every sample of the outer integral. To share information between different samples, they proposed a number of variational estimators that used amortization, i.e. they attempted to learn the functional form of the integrand rather than approximating it afresh each time. One of their approaches was based on amortized variational inference and required an inference network which takes as input and outputs a distribution over . For any , we can construct a lower bound on . This is the BA, or posterior, lower bound (ba)

(7)

which has also found use representation learning (poole2018variational) and maximizing information transmission in noisy channels (ba).

To make high-quality approximations to , and simultaneously learn a good posterior approximation, foster2019variational maximize this bound with respect to . This approach is most effective when the bound is tight, i.e. . For , this occurs when it is possible to have , i.e. when the inference network is powerful enough to find the true posterior distribution for every .

The required optimization is performed using SGA (robbins1951stochastic), which requires unbiased gradient estimators. For , this is obtained by drawing Monte Carlo samples and forming the estimator

(8)

To obtain high-quality approximations of even when the inference network cannot capture the true posterior, foster2019variational also considered another variational estimator: variational nested Monte Carlo (VNMC). This uses the inference network in conjunction with additional samples to improve the estimate of the integrand. They showed that this leads to the following upper bound on

(9)

where the expectation is over . The inference network in VNMC is trained by minimization, in the same way is trained by maximization. has the attractive feature that the bound becomes tight as the number of inner samples becomes large, i.e. , even if is not powerful enough to directly represent the true posterior.

2.3 Optimizing the EIG

The experimental design problem is to find the design that maximizes the EIG. Therefore, as well as finding a way to estimate EIG, existing approaches subsequently need to find a way of searching across to find promising designs. At a high-level, most existing approaches propose a two-stage procedure in which noisy estimates of are made, and a separate optimization procedure selects the candidate design to evaluate next.

kleinegesse2018efficient and foster2019variational both use BO for this outer optimization step, a black-box optimization method (snoek2012practical) that is tolerant to noise in the estimates of the objective function, in this case . Some approaches (watson2017quest+; lyu2019ultra) instead select a finite number of candidate designs in and estimate at each candidate, with some refining this process further by adaptively allocating computational resources between these designs (vincent2017; rainforth2017thesis). Another suggested approach is to use MCMC methods to carry out this outer optimization (amzal2006bayesian; muller2005simulation).

3 Gradient-Based Design Optimization

Our central proposal is to replace the two-stage procedure outlined above with a single stage that both estimates and optimizes simultaneously. This has the critical advantage of allowing SGA to be directly applied to the design optimization. Not only does this provide substantial computational gains over approaches which must construct separate estimates for each hypothetical design considered, but it also provides the potential to scale to substantially higher dimensional design problems than those which can be effectively tackled with existing approaches. Since we take gradients with respect to , we henceforth assume that is continuous.

In our approach, we utilize variational lower bounds on . Specifically, suppose we have a bound with variational parameters . For fixed , the estimate of improves as we maximize with respect to . We propose to maximize jointly with respect to . As we train , the variational approximation improves; as we train our design moves to regions where the lower bound on EIG is largest. By tackling this as a single optimization problem over , we obviate the need to have an outer optimizer for . Using a lower bound is important because it allows us to perform a single maximization over , rather than a more complex optimization such as the max-min optimization that would result if we used an upper bound.

In practice, we do not have lower bounds on that we can evaluate and differentiate in closed form. Instead, we have lower bounds in the form of expectations over . Fortunately, we can still maximize the lower bound with respect to

in this case by using SGA. Stochastic gradient methods have been applied in machine learning to train millions of parameters

(bottou2010large) and are a good fit for high-dimensional experimental design.

3.1 Ba

We now make our first concrete proposal for the lower bound : the BA bound , as defined in (7). We now optimize jointly where previously only

was trained using gradients. To perform SGA, we require unbiased estimators for both

and . We use equation (8) to estimate the -gradient. For the -gradient we have

(10)

where , which is an unbiased estimator of . This is a score function estimator (williams1992simple) and other possibilities are discussed in Section 3.6.

3.2 Adaptive contrastive estimation (ACE)

The BA bound provides one specific case of our one-stage procedure for optimal experimental design. We now introduce a new lower bound that improves upon . The potential issue with the BA bound is that it may not be sufficiently tight, which happens when the inference network cannot represent the true posterior. One possible solution is to introduce additional samples, as in the VNMC estimator (9). However, we cannot use VNMC directly for a one-stage procedure: since it is an upper bound, we must minimize it with respect to , but we still wish to maximize with respect to . This precludes a joint maximization over .

Looking more closely at the VNMC bound, we see that its main failure case is when the denominator strongly under-estimates , which can happen when all the inner samples miss regions where the joint is large. In addition to the samples , we also have the original sample from which was sampled. One way to avoid the under-estimation in the denominator would be to include this sample, giving

(11)

where the expectation is with respect to . The samples can now be seen as contrasts to the original sample . For this reason, we call contrastive samples and we call (11) the adaptive contrastive estimate (ACE) of EIG. The following theorem establishes that is a valid lower bound on the EIG and that the bound becomes tight as .

Theorem 1.

For any model and inference network , we have the following:

  1. is a lower bound on the EIG and we can characterize the error term as an expected KL divergence:

  2. As , we recover the true EIG:
    .

  3. The ACE bound is monotonically increasing in : for .

  4. If the inference network equals the true posterior , then .

The proof is presented in Appendix A, along with Theorem 3 which concerns properties of the maximum of the ACE bound. Gradient estimation for ACE is discussed in Section 3.6. We note that, to the best of our knowledge, has not previously appeared in the OED literature.111Aside from a recent blog post (sobolev2019thoughts) we do not believe this bound has previously been studied in any context.

3.3 Prior contrastive estimation (PCE)

Theorem 1 tells us that can become close to in two scenarios: 1) the inference network becomes close to the true posterior , 2) we increase the number of contrastive samples . The BA bound only becomes tight in case 1). A special case of ACE is to replace the inference network with a fixed distribution and rely on the contrastive samples to make good estimates of , only becoming tight in case 2), i.e. as . This simplification can speed up training, since we no longer need to learn additional parameters .

To explore this, we propose the prior contrastive estimation (PCE) bound, in which the prior is used to generate contrastive samples:

(12)

where the expectation is over . Whilst inherently less powerful than ACE, PCE can be effective when the prior and posterior are not too different, so is a suitable proposal to estimate .

Though, to the best of our knowledge, this bound has not been applied to OED before, we note that it shares a connection to the information noise contrastive estimation (InfoNCE) bound on mutual information used in representation learning

(oord2018representation). Given data samples , corresponding representations , and a learned critic , we have

(13)

where the expectation is over , is the data distribution, and is the encoder. poole2018variational showed that the encoder density is the optimal critic, although it is rarely known in closed form in the representation learning context. Writing for and for , we note the mathematical connection between this optimal case and .

3.4 Likelihood-free ACE

In some models such as random effects models, the likelihood is not known in closed form but can be sampled from. This presents a problem when computing or its derivatives because the likelihood appears in (11). To allow ACE to be used for these kinds of models, we now show that using a unnormalized approximation to the likelihood still results in a valid lower bound on EIG. In fact, if using a parametrized likelihood approximation , it is then possible to train jointly with to approximate the likelihood, learn an inference network, and find the optimal design through the solution to a single optimization problem. The following theorem, whose proof is presented in Appendix A, shows that replacing the likelihood with an unnormalized approximation does result in a valid lower bound on EIG.

Theorem 2.

Consider a model and inference network . Let be an unnormalized likelihood approximation. Then,

(14)

where the expectation is over .

3.5 Iterated experimental design with ACE

In iterated experimental design, we replace by as per (6). We can sample by performing inference. Whilst variational inference also provides a closed form estimate of the posterior density, some other inference methods do not. This is problematic because the prior density appears in (11). Fortunately, it is sufficient to know the density up to proportionality (foster2019variational). Indeed if where does not depend on and is an unnormalized density, then

(15)

and the derivatives of are simply zero.

3.6 Gradient estimation for ACE

To optimize the ACE bound with respect to it is necessary to have unbiased gradient estimators of and .

The simplest form of the -gradient is

(16)

where the expectation is with respect to , and

(17)

Estimating the expectation (16) directly using Monte Carlo gives the score function, or REINFORCE, estimator (williams1992simple)

. Unfortunately, this is often high variance, and reducing gradient estimator variance is often important in solving challenging experimental design problems.

One variance reduction method is reparameterization. For this, we introduce random variables

whose distributions do not depend on along with representations of and as deterministic functions of these variables:

(18)

This now permits a reparameterized form of the gradient:

(19)

where the expectation is over . A Monte Carlo approximation of this expectation is typically a much lower variance estimator for the true -gradient.

Alternatively, if

is a discrete random variable we can sum over the possible values

. This approach is known as Rao-Blackwellization and gives

(20)

where the expectation is now over .

Turning to , we note that if are reparameterizable as in (18), then we can utilize the double reparameterization of tucker2018doubly; for full details see Appendix A.1.

4 Experiments

Figure 1: A sample trajectory for the death process. The grayscale shows the EIG surface (white is maximal), whilst crosses show the optimization trajectory of using ACE with pink representing later steps. See Sec. 4.2 for details.

In our experiments, we learn optimal experimental designs in five scenarios: the death process, a well known two-dimensional design problem from epidemiology; a non-conjugate regression model with a 400-dimensional design; an ablation study in the setting of advertising; a real-world biomolecular docking problem from pharmacology in 100 dimensions; and a constant elasticity of substitution iterated experimental design problem in behavioural economics with 6 dimensional designs.

4.1 Evaluating experimental designs

We first discuss which metrics we will use to judge the quality of the designs we obtain.

Our primary metric on designs is, of course, the EIG. We prefer designs with high EIGs. In some cases, we can evaluate the EIG analytically. In other cases, we can use a sufficiently large number of samples in a NMC (nmc) estimator to be sure that we have estimates that are sufficiently accurate to compare designs.

However, to explore the limits of our methods, we will also consider scenarios where neither of these approaches is suitable. In these cases, we pair the ACE lower bound (with fixed for evaluation) with the VNMC upper bound (foster2019variational) to trap the true EIG value—if the lower bound of one design is higher than the upper bound for another, we can be sure that the first design is superior (noting that the bounds themselves can be tractably estimated to a very high accuracy).

In some settings, when we know the true optimal design , we will also consider the design error , i.e. how close our design is to the optimal design.

In iterated experiment design, as well as designing experiments, we must also perform inference on the latent variable after each iteration. Here, we also investigate the quality of the final posterior. Specifically, if is the posterior after experiments, we use the posterior entropy, and the posterior RMSE . We prefer low entropy and low RMSE values.

4.2 Death process

Figure 2: Optimization of EIG for the death process as a function of wall clock time. We depict the mean and standard error (s.e.) from 100 runs. The final EIG values (rightmost points) are as follows: [ACE] , [PCE] , [BA] , [BO] . See Sec. 4.2 for details.

We consider an example from epidemiology, the death process (cook2008optimal; kleinegesse2018efficient), in which a population of individuals transitions from healthy to infected states at a constant but unknown rate . We can measure the number of infected individuals at two different times and where . Our aim is to infer the infection rate from these observations. For full details of the prior and likelihood used, see Appendix B.1.

On this problem, we apply gradient methods with Rao-Blackwellization (there are 66 possible outcomes). Figure 1 shows a sample optimization trajectory with the approximate EIG surface for illustration. We compare against BO using the Rao-Blackwellized NMC estimator of vincent2017. Figure 2 shows that, for the allowed time budget, all gradient methods perform better than BO even on this two-dimensional problem.

4.3 Regression

We now compare our one-stage gradient approaches to experimental design against a two-stage baseline on a high-dimensional design problem. We choose a general purpose Bayesian linear regression model with

observations and features. The design is an matrix; the latent variables are , where is the dimensional regression coefficient and is the scalar variance. The outcomes are generated using a Normal likelihood for . Here is the th row of . To avoid trivial solutions, we enforce the constraint for all . We use independent priors for and . See Appendix B.2 for complete details.

We set and applied five methods to this 400 dimensional design problem: BA, ACE and PCE, as well as the VNMC estimator of foster2019variational, with both BO and random search to optimize over . The results are presented in Table 1. We note that the gradient methods strongly outperform the gradient-free baselines, with about double the final EIG.


Method
EIG l.b. EIG u.b.
ACE
PCE
BA
BO + VNMC
Random Search + VNMC
Table 1: Regression results. We estimate lower and upper bounds on the final EIG and present the mean and s.e. from 10 runs. See Sec. 4.3 for details.

4.4 Advertising

We now conduct a detailed ablation study on the effects of dimension on the quality of experimental designs produced using our gradient approaches and BO. To further isolate the distinction between one-stage and two-stage approaches to OED, we choose a setting in which we can compute analytically. We give BO, but not the gradient methods, access to the EIG oracle when making point evaluations of . Thus we put BO in the best possible position and ensure any gains are due to improvements from using gradient-based optimization.

Suppose that we are given an advertising budget of dollars that we need to allocate among regions, i.e. we choose with

. After conducting an ad campaign, we observe a vector of sales

. We use this data to make inferences about the underlying market opportunities in each region. Our prior incorporates the knowledge that neighbouring regions are more correlated than distant ones—this leads to an interesting experimental design problem because information can be pooled between regions. We can also compute the true EIG and optimal design analytically. For full details, see Appendix B.3.

We compare the performance of four estimation and optimization methods on this problem, see Fig. 3 for the results. The three gradient-based methods (ACE, PCE, BA) perform best, with the BO baseline struggling in dimensions , even though the latter has access to an EIG oracle. PCE performed well in low dimensions, but degraded as the dimension increases and sampling from the prior becomes increasingly inefficient—something that ACE and BA avoid by learning adaptive proposal distributions. We note that since in this case the family of variational distributions used in ACE and BA include the true posterior, both methods yield similar performance.

Figure 3: Mean absolute EIG and design errors for the marketing model in Sec. 4.4 obtained with four different estimators and averaged over 10 runs. The EIG is normalized such that an EIG error of unity corresponds to doing no better than a uniform budget, i.e. for .

4.5 Biomolecular docking

(a) Entropy
(b) Posterior RMSE of
(c) Posterior RMSE of
(d) Posterior RMSE of
Figure 4: Improvement in the posterior in the sequential CES experiment. Each step took 120 seconds for each method. (a) Total entropy. (b)(c)(d) The root mean square error (RMSE) of the posterior approximations of , and compared to the true values. We present the mean and standard error from 10 runs. See Sec. 4.6 for details.
Figure 5: Designs for the biomolecular docking problem obtained by ACE and by lyu2019ultra. Designs consist of 100 docking scores at which to test compounds.

We now consider an experimental design problem of interest to the pharmacology community. Having demonstrated that our one-stage gradient methods compare favourably with two stage approaches, we now compare against designs crafted by domain experts.

In molecular docking, computational techniques are used to predict the binding affinity between a compound and a receptor. When synthesized in the lab, the two may bind—this is called a hit. Learning a well-calibrated hit-rate model can guide how many compounds to evaluate for additional objectives, such as drug-likeness or toxicity, before experimental testing. lyu2019ultra

modelled the probability of outcome

being a hit, given the predicted binding affinity, or docking score, , as

(21)

where with priors given in Appendix B.4.

Of 150 million compounds, lyu2019ultra selected a batch of compounds to experimentally test to best fit the sigmoid hit-rate model. They considered 6 candidate designs and selected one that maximized the EIG estimated by NMC. Here, we instead apply gradient-based BOED to search across candidate designs which consist of 100 docking scores . To evaluate our final designs, we present upper and lower bounds on the final EIG: see Table 2. We see that all gradient methods are able to outperform experts in terms of EIG, and that ACE appears the best of the gradient methods. Figure 5 shows our designs are qualitatively different to those produced by experts.

4.6 Constant elasticity of substitution

Method EIG lower bound EIG upper bound
ACE
PCE
BA
Expert
Table 2: Biomolecular docking results. We computed lower and upper bounds on the final EIG. We present the mean and s.e. from 10 runs. For the expert, we took the best design of lyu2019ultra appropriately rescaled to consist of 100 docking scores for comparison.

We now turn to iterated experimental design in which we produce designs, generate data and make inference repeatedly. This problem therefore captures the end-to-end-process of experimentation and inference.

We consider an experiment in behavioural economics that was previously also considered by foster2019variational. In this experiment, a participant is asked to compare baskets of goods. The model assumes that their response (on a slider) is based on the difference in utility of the baskets, and the constant elasticity of substitution (CES) model (arrow1961capital) governed by latent variables is then used for this utility. The aim is to learn characterizing the participant’s utility. In the experiment, there are 20 sequential steps of experimentation with the same participant. We compare our gradient-based approach against the most successful approach of foster2019variational that approximates the marginal density to form an upper bound on EIG, and BO to optimize . For full details, see Appendix B.5.

Figure 4 shows that gradient-based methods are effective on this problem; both ACE and PCE decrease the posterior entropy and RMSEs on the latent variables faster and further than the baseline, whereas BA does not do so well. We suggest that the similar performance of ACE and PCE is due to the smaller changes in the posterior at middle and late steps, after much data has been accumulated: when the posterior does not change much at each step, forms an effective proposal for estimating .

5 Conclusions

We have introduced a new approach for Bayesian experimental design that does away with the two stages of estimating EIG and separately optimizing over . We use stochastic gradients to maximize a lower bound on and so find optimal designs by solving a single optimization problem. This unification leads to substantially improved performance, especially on high-dimensional design problems.

Of the three lower bounds, and , we note that in all five experiments ACE generally did as well as the better of BA and PCE: we therefore recommend it as the default choice. BA performed well when the inference network could closely approximate the true posterior; PCE performed well when the prior was an adequate proposal for estimating and does not require the training of variational parameters.

Acknowledgements

AF gratefully acknowledges funding from EPSRC grant no. EP/N509711/1. YWT’s research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. TR gratefully acknowledges funding from Tencent AI Labs and a junior research fellowship supported by Christ Church, Oxford.

References

References

Appendix A Gradient-Based Design Optimization

We begin with the proof of Theorem 1, which we restate for convenience.

See 1

Proof.

To begin with 1., we have the error term which can be written

(22)
(23)
(24)

where the expectation is over . Note that the integrand is symmetric under a permutation of the labels , so its expectation will be the same under the distribution . This in turn implies that we can consider this as an expectation under

(25)

which is the expected KL divergence required. We therefore have .

For 2., we assume that is bounded. The ACE denominator is a consistent estimator of the marginal likelihood. Indeed,

(26)

and

(27)

as

by the Strong Law of Large Numbers, since

(28)

This establishes the a.s. pointwise convergence of the ACE integrand to . Hence by Bounded Convergence Theorem,

(29)

as .

To establish 3., we let . Then

(30)
(31)

where the expectation is over and

(32)

By symmetry, we can consider this as an expectation over (we can permute the labels ). We therefore recognise as the expectation of a KL divergence. Hence as required.

4. follows by Bayes Theorem, i.e.

(33)

which completes the proof. ∎

We also present the proof of Theorem 2.

See 2

Proof.

Initially, we note that the contrastive samples do not carry additional information about . We consider the mutual information between and the random variable

. Using the Chain Rule for mutual information we have

(34)

Now since () are conditionally independent of given . Therefore

(35)

We now use the Donsker-Varadhan representation of mutual information (donsker1975asymptotic). Specifically, for random variables

with joint distribution

and any measurable function we have

(36)

We now use this representation with and the integrand

(37)

We compute the second term in (36), .

(38)
(39)
(40)
(41)

where the second to last line follows by symmetry. This establishes that , and so (14) constitutes a valid lower bound on . That is

(42)

which completes the proof. ∎

The following theorem establishes a condition under which the maximum of the ACE objective converges to the maximum of the EIG as .

Theorem 3.

Consider a model such that

(43)

and . Let be an inference network and let

(44)

Then,

(45)

and in particular as .

Proof.

We have since is a lower bound on by Theorem 1.

Next, we consider . We have

(46)

where

(47)

we write (46) as

(48)

and we apply the inequality to give

(49)

We now observe that for , and hence, taking a partial expectation over we have

(50)
(51)

Hence

(52)
(53)