amci
Code to accompany "Amortized Monte Carlo Integration" ICML 2019
view repo
Current approaches to amortizing Bayesian inference focus solely on approximating the posterior distribution. Typically, this approximation is, in turn, used to calculate expectations for one or more target functions - a computational pipeline which is inefficient when the target function(s) are known upfront. In this paper, we address this inefficiency by introducing AMCI, a method for amortizing Monte Carlo integration directly. AMCI operates similarly to amortized inference but produces three distinct amortized proposals, each tailored to a different component of the overall expectation calculation. At runtime, samples are produced separately from each amortized proposal, before being combined to an overall estimate of the expectation. We show that while existing approaches are fundamentally limited in the level of accuracy they can achieve, AMCI can theoretically produce arbitrarily small errors for any integrable target function using only a single sample from each proposal at runtime. We further show that it is able to empirically outperform the theoretically optimal self-normalized importance sampler on a number of example problems. Furthermore, AMCI allows not only for amortizing over datasets but also amortizing over target functions.
READ FULL TEXT VIEW PDFCode to accompany "Amortized Monte Carlo Integration" ICML 2019
At its core, Bayesian modeling is rooted in the calculation of expectations: the eventual aim of modeling is typically to make a decision or to construct predictions for unseen data, both of which take the form of an expectation under the posterior (Robert, 2007). This aim can thus be summarized in the form of one or more expectations , where is a target function and is the posterior distribution on for some data , which we typically only know up to a normalizing constant . More generally, expectations with respect to distributions with unknown normalization constant are ubiquitous throughout the sciences (Robert & Casella, 2013).
Sometimes is not known up front. Here it is typically convenient to first approximate , e.g. in the form of Monte Carlo (MC) samples, and then later use this approximation to calculate estimates, rather than addressing the target expectations directly.
However, it is often the case in practice that a particular target function, or class of target functions, is known a priori. For example, in decision-based settings
takes the form of a loss function, while any posterior predictive distribution constitutes a set of expectations with respect to the posterior, parameterized by the new input. Though often overlooked, it is well established that in such
target-aware settings the aforementioned pipeline of first approximating and then using this as a basis for calculating is suboptimal as it ignores relevant information in (Hesterberg, 1988; Wolpert, 1991; Oh & Berger, 1992; Evans & Swartz, 1995; Meng & Wong, 1996; Chen & Shao, 1997; Gelman & Meng, 1998; Lacoste-Julien et al., 2011; Owen, 2013; Rainforth et al., 2018b). As we will later show, the potential gains in such scenarios can be arbitrarily large.In this paper, we extend these ideas to amortized inference settings (Stuhlmüller et al., 2013; Kingma & Welling, 2014; Ritchie et al., 2016; Paige & Wood, 2016; Le et al., 2017, 2018a; Webb et al., 2018), wherein one looks to amortize the cost of inference across different possible datasets by learning an artifact that assists the inference process at runtime for a given dataset. Existing approaches do not operate in a target-aware fashion, such that even if the inference network learns proposals that perfectly match the true posterior for every possible dataset, the resulting estimator is still sub-optimal.
To address this, we introduce AMCI, a framework for performing Amortized Monte Carlo Integration. Though still based on learning amortized proposals distributions, AMCI varies from standard amortized inference approaches in three respects. First, it operates in a target-aware fashion, incorporating information about into the amortization artifacts. Second, rather than using self-normalization, AMCI employs three distinct proposals for separately estimating , , and , before combining these into an overall estimate. This breakdown allows for arbitrary performance improvements compared to self-normalized importance sampling (SNIS). Finally, to account for cases in which multiple possible target functions may be of interest, AMCI also allows for amortization over parametrized functions .
Remarkably, AMCI is able to achieve an arbitrarily low error at run-time using only a single sample from each proposal given sufficiently powerful amortization artifacts, contrary to the fundamental limitations on the accuracy of conventional amortization approaches. This ability is based around its novel breakdown of the target expectation into separate components, the subsequent utility of which extends beyond the amortized setting we consider here.
Importance Sampling (IS), in its most basic form, is a method for approximating an expectation when it is either not possible to sample from directly, or when the simple MC estimate, where
, has problematically high variance
(Hesterberg, 1988; Wolpert, 1991). Given a proposal from which we can sample and for which we can evaluate the data, it forms the following estimate(1) | ||||
(2) |
where and is known as the importance weight of sample .
In practice, one often does not have access to the normalized form of . For example, in Bayesian inference settings, we typically have . Here we can use our samples to both approximate the normalization constant and the unnormalized integral. Thus if , we have
(3) |
where , and . This approach is known as self-normalized importance sampling (SNIS). Conveniently, we can also construct the SNIS estimate lazily by calculating the empirical measure, i.e. storing weighted samples,
(4) |
and then using this to construct the estimate in (3) when becomes available. As such, we can also think of SNIS as a method for Bayesian inference as, informally speaking, the empirical measure produced can be thought of as an approximation of the posterior.
For a general unknown target, the optimal proposal, i.e. the proposal which results in estimator having lowest possible variance, is the target distribution (see e.g. (Rainforth, 2017, 5.3.2.2)). However, this no longer holds if we have some information about . In this target-aware scenario, the optimal behavior turns out to depend on whether we are self-normalizing or not.
For the non-self-normalized case, the optimal proposal can be shown to be (Owen, 2013). Interestingly, in the case where , this leads to an exact estimator, i.e. (with as per (2)). To see this, notice that the normalizing constant for is and hence . Therefore, even when , any possible value of the resulting sample yields an satisfying .
In the self-normalized case, the optimal proposal instead transpires to be (Hesterberg, 1988). In this case, one can no longer achieve a zero variance estimator for finite and nonconstant . Instead, the achievable error is lower bounded by (Owen, 2013)
(5) |
creating a fundamental limit on the performance of SNIS, even when information about is incorporated.
Given that these optimal proposals make use of the true expectation , we will clearly not have access to them in practice. However, they provide a guide for the desirable properties of a proposal and can be used as targets for adaptive IS methods (see (Bugallo et al., 2017) for a recent review).
Inference amortization involves learning an amortization artifact that takes in datasets and produces proposals tailored to the corresponding inference problems. This amortization artifact typically takes the form of a parametrized proposal, , which takes in data and produces proposal parameters using an inference network , which itself has parameters . When clear from the context, we will use the shorthand for this proposal.
Though the exact process varies with context, the inference network is usually trained either by drawing latent-data sample pairs from the joint (Paige & Wood, 2016; Le et al., 2017, 2018b), or by drawing mini-batches from a large dataset using stochastic variational inference approaches (Hoffman et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014; Ritchie et al., 2016). Once trained, it provides an efficient means of approximately sampling from the posterior of a particular dataset, e.g. using SNIS.
Out of several variants, we focus on the method introduced by Paige & Wood (2016), as this is the one AMCI builds upon. In their approach, is trained to minimize the expectation of across possible datasets , giving the objective
(6) |
We note that the distribution over which we are taking the expectation is actually chosen somewhat arbitrarily: it simply dictates how much the network prioritizes a good amortization for one dataset over another; different choices are equally valid and imply different loss functions.
This objective requires us to be able to sample from the joint distribution
and it can be optimized using gradient methods since the gradient can be easily evaluated:(7) |
Amortized Monte Carlo integration (AMCI) is a framework for amortizing the cost of calculating expectations . Here represents changeable aspects of the reference distribution (e.g. the dataset) and represents changeable parameters of the target function . The reference distribution is typically known only up to a normalization constant, i.e. where can be evaluated pointwise, but is unknown. AMCI can still be useful in settings where is known, but here we can simply use its known value rather than constructing a separate estimator.
Amortization can be performed across and/or . When amortizing over , the function does not need to be explicitly parameterized; we just need to be able to evaluate it pointwise. Similarly, when amortizing over , the reference distribution can be fixed. In fact, AMCI can be used for a parameterized set of conventional integration problems by exploiting the fact that
(8) |
for any where for which .
For consistency of notation with the amortized inference literature, we will presume a Bayesian setting in the rest of this section, i.e. and .
Existing amortized inference methods implicitly evaluate expectations using SNIS (or some other form of self-normalized estimator (Paige & Wood, 2016; Le et al., 2018a)), targeting the posterior as the optimal proposal . Not only is this proposal suboptimal when information about the target function is available, there is a lower bound on the accuracy the SNIS approach itself can achieve as shown in (5).
AMCI overcomes these limitations by breaking down the overall expectation into separate components and constructing separate estimates for each. We can first break down the target expectation into the ratio of the “unnormalized expectation” and the normalization constant:
(9) |
where and are two separate proposals, used respectively for each of the two expectations and . We note that the proposal may depend not only on the observed dataset , but also on the parameters of the target function .
We can now generate separate MC estimates for and , and take their ratio to estimate the overall expectation:
(10) |
The key idea behind AMCI is that we can now separately train each of these proposals to be good estimators for their respective expectation, rather than rely on a single proposal to estimate both, as is implicitly the case for SNIS.
Consider, for example, the case where . If and then both and will form exact estimators (as per Section 2.1), even if . Consequently, we achieve an exact estimator for , allowing for arbitrarily large improvements over any SNIS estimator, because SNIS forces and to be the same distribution.
More generally, the optimal proposal for and are and respectively, with the latter always resulting in an exact estimator for . Thus the more varies from , the worse the conventional approach of only amortizing the posterior will perform, while the harder it becomes to construct a reasonable SNIS estimator even when information about is incorporated. Separately learning and means that each will become a better individual proposal and the overall estimator improves.
It turns out that we do not actually require the previous assumption of to achieve a zero variance estimator. Specifically, if we let^{1}^{1}1Practically, it may sometimes be beneficial to truncate the proposal about another point, , by instead using and , then adding onto our final estimate.
(11) | ||||
(12) |
denote truncations of the target function into its positive and negative components (as per the concept of posivitisation (Owen, 2013, 9.13)), then we can break down the overall expectation as follows
(13) |
where we now have three expectations and three proposals. Analogously to (10), we can construct estimates for each expectation separately and then combine them:
(14) |
which forms the AMCI estimator. The theoretical capability of this estimator is summarized in the following result, the proof for which is given in Appendix B.
If the following hold for a given and ,
(15) | |||
(16) | |||
(17) |
and we use the corresponding set of optimal proposals , , and , then the AMCI estimator defined in (14) satisfies
(18) |
for any , , and , such that it forms an exact estimator for that pair.
Though our primary motivation for developing the AMCI estimator is its attractive properties in an amortization setting, we note that it may still be of use in static expectation calculation settings. Namely, the fact that it can achieve an arbitrarily low mean squared error for a given number of samples means it forms an attractive alternative to SNIS more generally, particularly when we are well-placed to hand-craft highly effective proposals and in adaptive importance sampling settings.
We note that individual elements of this estimator have previously appeared in the literature. For example, the general concept of using multiple proposals has been established in the context of multiple importance sampling (Veach & Guibas, 1995). The use of two separate proposals for the unnormalized target and the normalizing constant (i.e. (10)), on the other hand, was recently independently suggested by Lamberti et al. (2018) in a non-amortized setting. However, we believe that the complete form of the AMCI estimator in (14) has not previously been suggested, nor its theoretical benefits or amortization considered.
To evaluate (14), we need to learn three amortized proposals , , and .
Learning is equivalent to the standard inference amortization problem and so we will just use the objective given by (6), as described in section 2.2.
The approaches for learning and are equivalent, other than the function that is used in the estimators. Therefore, for simplicity, we introduce our amortization procedure in the case where , such that we can need only learn a single proposal, , for the numerator as per (10). This trivially extends to the full AMCI setup by separately repeating the same training procedure for and .
We first consider the scenario where is fixed (i.e. we are not amortizing over function parameters ) and hence in this section we drop the dependence of on .
To learn the parameters for the first amortized proposal , we need to adjust the target in (6) to incorporate the effect of the target function. Let and , i.e. the normalized optimal proposal for . Naively adjusting (6) leads to a double intractable objective
(19) |
Here the double intractability comes from the fact we do not know and, at least at the beginning of the training process, we cannot estimate it efficiently either.
To address this, we use our previous observation that the expectation over in the above objective is chosen somewhat arbitrarily. Namely, it dictates the relative priority of different datasets during training and not the optimal proposal for each individual datapoint; disregarding the finite capacity of the network, the global optimum is still always . We thus maintain a well-defined objective if we choose a different reference distribution over datasets. In particular, if we take the expectation with respect to , we get
(20) |
where is a positive constant that does not affect the optimization—it is the normalization constant for the distribution
—and can thus be ignored. Each term in this expectation can now be evaluated directly, meaning we can again run stochastic gradient descent algorithms to optimize it. Note that this does not require evaluation of the density
, only the ability to draw samples.Interestingly, this choice of can be interpreted as giving larger importance to the values of which yield larger . Informally, we could think about this choice as attempting to minimizing the L1 errors of our estimates, that is , presuming that the error in our estimation scales as the magnitude of the true value .
More generally, if we choose for some positive evaluable function , we get a tractable objective of the form
up to a constant scaling factor and offset. We can thus use this trick to adjust the relative preference given to different datasets, while ensuring the objective is tractable.
As previously mentioned, AMCI also allows for amortization over parametrized functions, to account for cases in which multiple possible target functions may be of interest. We can incorporate this by using pseudo prior to generate example parameters during our training.
Analogously to , the choice of determines how much importance we assign to different possible functions that we would like to amortize over. Since, in practice, perfect performance is unattainable over the entire space of , the choice of is important and it will have an important effect on the performance of the system.
Incorporating is straightforward: we take the expectation of the fixed target function training objective over . In this setting, our inference network needs to take as input when determining the parameters of and hence we let . If , , and , we get an objective which is analogous to (20):
(21) |
where is again a positive constant that does not affect the optimization.
If and are mismatched, i.e. is large in regions where is low, training by naïvely sampling from can be inefficient. Instead, it is preferable to try and sample from . Though this is itself an intractable distribution, it represents a standard, rather than an amortized, inference problem and so it is much more manageable than the overall training. Namely, as the samples do not depend on the proposal we are learning or the datasets, we can carry out this inference process as a pre-training step that is substantially less costly than the problem of training the inference networks itself.
One approach is to construct an MCMC sampler targeting to generate the samples, which can be done upfront before training. Another is to use an importance sampler
(22) | |||
where is a proposal as close to as possible.
In the case of non-parameterized functions , there is no need to take an expectation over , and we instead desire to sample from .
is often dominated by a few large outliers. The dashed line shows the median of
with the corresponding to the ReMSE optimal SNIS estimator, namely as per (5), which is itself estimated (with only nominal error) using samples. We note that the error for SNIS with proposal is to a large extent flat because there is not a single sample in the estimator for which , such that they return and hence give . In Figure (b) the SNIS line reaches the ReMSE value of at and the y-axis limits have been readjusted to allow clear comparison at higher . This effect is caused by the bias of SNIS: these extremely high errors for SNIS arise when all samples happen to be drawn from distribution , for further explanation and the full picture see Figure 5 in Appendix A.Even though AMCI is theoretically able to achieve exact estimators with a finite number of samples, this will rarely be the case for practical problems, for which learning perfect proposals is not typically realistic, particularly in amortized contexts (Cremer et al., 2018). It is therefore necessary to test its empirical performance to assert that gains are possible with inexact proposals. To this end, we investigate AMCI’s performance on two illustrative examples.
Our primary baseline is the SNIS approach implicitly used by most existing inference amortization methods, namely the SNIS estimator with proposal . Though this effectively represents the previous state-of-the-art in amortized expectation calculation, it turns out to be a very weak baseline. We, therefore, introduce another simple approach one could hypothetically consider using: training separate proposals as per AMCI, but then using this to form a mixture distribution proposal for an SNIS estimator. For example, in the scenario where (such that we only need to learn two proposals), we can use
(23) |
as an SNIS proposal that takes into account the needs of both and . We refer to this method as the mixture SNIS estimator and emphasize that it represents a novel amortization approach in its own right.
We also compare AMCI to the theoretically optimal SNIS estimator, i.e. the error bound given by (5). As we will show, AMCI is often able to empirically outperform this bound, thereby giving better performance than any approach based on SNIS, whether that approach is amortized or not. This is an important result and, it particular, it highlights that the potential significance of the AMCI estimator extends beyond the amortized setting we consider here.
We further consider using SNIS with proposal . However, this transpires to perform extremely poorly throughout (far worse than ) and so we omit its results from the main paper, giving them in Appendix A.
In all experiments, we use the same number of sample from each proposal to form the estimate (i.e. ).
An implementation for AMCI and our experiments is available at http://github.com/talesa/amci.
We start with the conceptually simple problem of calculating tail integrals for Gaussian distributions, namely
(24) | ||||||
where is the dimensionality, we set , and is a fixed covariance matrix (for details see Appendix C).
This problem was chosen because it permits easy calculation of the ground truth expectations by exploiting analytic simplifications, while remaining numerically challenging for values of far away from the mean when we do not use these simplifications. We performed one and five-dimensional variants of the experiment.
We use normalizing flows (Rezende & Mohamed, 2015) to construct our proposals, providing a flexible and powerful means of representing the target distributions. Details are given in Appendix C. Training was done by using importance sampling to generate the values of and as per (22) with .
To evaluate AMCI and our baselines we use the relative mean squared error (ReMSE) , where
(25) |
and is our estimate for . We then consider summary statistics across different , such as its median when .^{2}^{2}2Variability in between different instances of is considered in Figures 7 and 8 in Appendix A. In calculating this, was separately estimated for each value of and using samples of (i.e. realizations of the estimator).
As shown in Figure 1, AMCI outperformed SNIS in both the one- and five-dimensional cases. For the one-dimensional example, AMCI significantly outperformed all of SNIS , SNIS , and the theoretically optimal SNIS estimator. SNIS , the approach implicitly taken by existing inference amortization methods, typically failed to place even a single sample in the tail of the distribution, even for large . Interestingly, SNIS closely matched the theoretical SNIS bound, suggesting that this amortized proposal is very close to the theoretically optimal one. However, this still constituted significantly worse performance than AMCI—taking about more samples to achieve the same relative error—demonstrating the ability of AMCI to outperform the best possible SNIS estimator.
For the five-dimensional example, AMCI again significantly outperformed our main baseline SNIS . Though it still also outperformed SNIS , its advantage was less than in one-dimensional case, and it did not outperform the SNIS theoretical bound. SNIS itself did not match the bound as closely as in the one-dimensional example either, suggesting that the proposals learned were worse than in the one-dimensional case. Further comparisons based on using the mean squared error (instead of ReMSE) are given in Appendix A and show qualitatively similar behavior.
To demonstrate how AMCI might be used in a more real-world scenario, we now consider an illustrative example relating to cancer diagnostic decisions. Imagine that an oncologist is trying to decide whether to administer a treatment to a cancer patient. Because the treatment is highly invasive, they only want to administer it if there is a realistic chance of it being successful, i.e. that the tumor shrinks sufficiently to allow a future operation to be carried out. However, they are only able to make noisy observations about the current size of the tumor, and there are various unknown parameters pertaining to its growth, such as the patients predisposition to the treatment. To aid in the oncologists decision, the clinic provides a simulator of tumor evolution, a model of the latent factors required for this simulator, and a loss function for administering the treatment given the final tumor size. We wish to construct an amortization of this simulator, so that we can quickly and directly predict the expected loss function for administering the treatment from a pair of noisy observations of the tumor size taken at separate points in time. A detailed description of the model and proposal setup is in the Appendix C.3.
To evaluate the learned proposals we followed the same procedure as for the tail integral example. Results are presented in Figure 2. AMCI again significantly outperformed the literature baseline of SNIS —it took about samples for SNIS to achieve the level of relative error of AMCI for . AMCI further maintained an advantage over SNIS , which itself again closely matched the optimal SNIS estimator. Further comparisons are given in Appendix A and show qualitatively similar behavior.
In all experiments AMCI performed better than SNIS with either or for its proposal. Moreover, it is clear that AMCI is indeed able to break the theoretical bound on the achievable performance of SNIS estimators: in some cases AMCI is outperforming the best achievable error by any SNIS estimator, regardless of the proposal the latter uses. Interestingly, the mixture SNIS estimator we also introduce proved to be a strong baseline as it closely matched the theoretical baseline in both experiments. However, such an effective mixture proposal is only possible thanks learning the multiple inference artifacts we suggest as part of the AMCI framework, while its performance was still generally inferior to AMCI itself.
We now consider the question of when we expect AMCI to work particularly well compared to SNIS, and the scenarios where it is less beneficial, or potentially even harmful. We first note that scaling with increasing dimensionality is a challenge for both because the importance sampling upon which they rely suffers from the curse of dimensionality. However, the scaling of AMCI should be no worse than existing amortization approaches as each of the amortized proposals is trained in isolation and corresponds to a conventional inference amortization.
We can gain more insights into the relative performance of the two approaches in different settings using an informal asymptotic analysis in the limit of a large number of samples. Assuming
for simplicity,^{3}^{3}3The results trivially generalize to general with suitable adjustment of the definition of . then both AMCI and SNIS can be expressed in the form of (10), where for SNIS we set ,, and share samples between the estimators. Separately applying the central limit theorem to
and yields(26) |
where and
(27) | ||||
(28) |
Asymptotically, the mean squared error of is dominated by its variance. Thus, by taking a first order Taylor expansion of about , we get, for large ,
(29) |
where the approximation from the Taylor expansion becomes exact in the limit and is a measure of the relative accuracy of the two estimators. See (43) in Appendix D.1 for a more verbose derivation.
For a given value of , the value of for SNIS is completed dictated by the problem: in general, the larger the mismatch between and , the larger will be. This yields the expected result that the errors for SNIS become large in this setting. For AMCI, we can control through ensuring a good proposal for both and , and, if desired, by adjusting and (relative to a fixed budget ). Consequently, we can achieve better errors than SNIS by driving down.
On the other hand, as and become increasingly well matched, then and we find that AMCI has little to gain over SNIS. In fact, we see that AMCI can potentially be worse than SNIS in this setting: when and are closely matched, we also have for SNIS, such that we observe a canceling effect, potentially leading to very low errors. Achieving can be more difficult for AMCI, potentially giving rise to a higher error. However, it could be possible to mitigate this by correlating the estimates, e.g. through common random numbers.
To assess if this theory manifests in practice, we revisit our tail integral example, comparing large and small mismatch scenarios. The results, shown in Figure 3, agree with these theoretical findings. In Appendix D we further showing that the reusing of samples for both and in AMCI can be beneficial when the targets are well matched.
More generally, as Theorem 1 tells us that the AMCI estimator can achieve an arbitrarily low error for any given target function, while SNIS cannot, we know that its potential gains are larger the more accurate we are able to make our proposals. As such, as advances elsewhere in the field allow us to produce increasingly effective amortized proposals, e.g. through advanced normalizing flow approaches (Grathwohl et al., 2019; Kingma & Dhariwal, 2018), the larger the potential gains are from using AMCI.
We would like to thank Yee Whye Teh for providing helpful discussions at the early stages of the project. AG is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. FW is supported by DARPA D3M, under Cooperative Agreement FA8750-17-2-0093, Intel under its LBNL NERSC Big Data Center, and an NSERC Discovery grant. TR is supported by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement no. 617071. His research leading to these results also received funding from EPSRC under grant EP/P026753/1.
Inference suboptimality in variational autoencoders.
Proceedings of the International Conference on Machine Learning (ICML)
, 2018.Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2011.Stochastic backpropagation and approximate inference in deep generative models.
Proceedings of the International Conference on Machine Learning (ICML), 2014.Monte Carlo integration in Bayesian statistical analysis.
Contemporary Mathematics, 115:101–116, 1991.See 1
The result follows straightforwardly from considering each estimator in isolation. Note that the normalization constants for distributions are , respectively, e.g. . Therefore, starting with , we have
(30) |
for all possible values of . Similarly, for
(31) |
for all possible values of . Analogously, we have for all possible values of . Combining all of the above, the result now follows. ∎
Let us recall the model from (24),
Comments
There are no comments yet.