1 Introduction
Probabilistic models provide a framework for describing abstract prior knowledge and using it to reason under uncertainty. Probabilistic programs are a powerful tool for probabilistic modeling. A probabilistic programming language (PPL) is a deterministic programming language augmented with random sampling and Bayesian conditioning operators. Performing inference on these programs then involves reasoning about the space of executions which satisfy some constraints, such as observed values. A universal PPL, one built on a Turingcomplete language, can represent any computable probability distribution, including openworld models, Bayesian nonparameterics, and stochastic recursion Church ; Venture ; Anglican .
If we consider a probabilistic program to define a distribution , where are (latent) intermediate variable and are (observed) output data, then sampling from this distribution is easy: just run the program forward. However, computing the posterior distribution is hard, involving an intractable integral. Typically, PPLs provide means to approximate the posterior using Monte Carlo methods (e.g. MCMC, SMC), dynamic programming, or analytic computation.
These inference methods are expensive because they (approximately) solve an intractable integral from scratch on every separate invocation. But many inference problems have shared structure: it is reasonable to expect that computing should give us some information about how to compute . In fact, there is reason to believe that this is how people are able to perform certain inferences, such as visual perception, so quickly—we have perceived the world many times before, and can leverage that accumulated knowledge when presented with a new perception task AmortizedInference . This idea of using the results of previous inferences, or precomputation in general, to make later inferences more efficient is called amortized inference AmortizedInference ; StochasticInverses .
Learning a generative model from many data points is a particularly important task that leads to many related inferences. One wishes to update global beliefs about the true generative model from individual data points (or batches of data points). While many algorithms are possible for this task, they all require some form of ‘parsing’ for each data point: doing posterior inference in the current generative model to guess values of local latent variable given each observation. Because this local parsing inference is needed many many times, it is a good candidate for amortization. It is plausible that learning to do local inference via amortization would support faster and better global learning, which gives more useful local inferences, leading to a virtuous cycle.
This paper proposes a system for amortized inference in PPLs, and applies it to model learning. Instead of computing from scratch for each , our system instead constructs a program which takes as input and, when run forward, produces samples distributed approximately according to the true posterior . We call a guide program, following terminology introduced in previous work GuidePrograms . The system can spend time upfront constructing a good approximation so that at inference time, sampling from is both fast and accurate.
There is a huge space of possible programs one might consider for this task. Rather than posing the search for as a general program induction problem (as was done in previous work GuidePrograms ), we restrict to have the same control flow as the original program , but a different data flow. That is, samples the same random choices as and in the same order, but the data flowing into those choices comes from a different computation. In our system, we represent this computation using neural networks. This design choice reduces the search for
to the much simpler continuous problem of optimizing the weights for these networks, which can be done using stochastic gradient descent.
Our system’s interface for specifying guide programs is flexible enough to subsume several popular recent approaches to variational inference, including those that perform both inference and model learning. To facilitate this common pattern we introduce the mapData construct which represents the boundary between global “model” variables and variables local to the data points. Our system leverages the independence between data points implied by mapData
to enable minibatches of data and variance reduction of gradient estimates. We evaluate our proofofconcept system on a variety of Bayesian networks, topic models, and deep generative models.
Our system has been implemented as an extension to the WebPPL probabilistic programming language WebPPL . Its source code can be found in the WebPPL repository, with additional helper code at https://github.com/probmods/webppldaipp.
2 Background
2.1 Probabilistic Programming Basics
For our purposes, a probabilistic program defines a generative model of latent variables and data . The model factors as:
(1) 
The prior probability distribution
decomposes as a product of conditionals , one for each random choice in the program. The use of indicates that a random choice can potentially depend on any or all previous choices made by the program. is the likelihood of the data and need not be a proper probability distribution (i.e. unnormalized factors are acceptable). Note thatcan vary in length across executions: a probabilistic program can sample a variable number of random variables.
Our system is implemented in the probabilistic programming language WebPPL, which we use for examples throughout this paper WebPPL . WebPPL is a PPL embedded in Javascript; that is, it adds sampling, conditioning, and inference operators to a purelyfunctional subset of Javascript. The following example program illustrates its basic features:
This program uses MCMC to compute an approximate posterior distribution over the return value of the function model. model is a simple generative model with one latent Bernoulli variable (x) and one observed Gaussian variable, which in this example is observed to have the value 0.5. The mean of the observed Gaussian variable (mu) is dependent on the value of x. Since model returns x, the result of this program is the posterior marginal distribution over the variable x. In the rest of this paper, we will build on this language, adding guide programs, amortized inference, and modellearning constructs.
2.2 Inference as Optimization: Variational Inference
Instead of approximating the posterior with a collection of samples, one could instead try to approximate it via a parameterized distribution which is itself easy to sample from. This is the premise behind variational inference VariationalInference . The goal is to find parameters such that is as close as possible to , where closeness is typically measured via KLdivergence.
To use variational inference, one must first choose a parameterized family of distributions ; one common choice is the meanfield family:
This is a fullyfactored distribution: it approximates the true posterior as an independent product of parameterized marginals, one for each random variable. Several existing generalpurpose variational inference systems use this scheme AVIPP ; BBVI . This is easy to work with, but it does not capture any of the dependencies between variables that may occur in the true posterior. This limitation is often acceptable because is defined relative to a particular observation set , and thus the parameters are reoptimized for each new . Thus, while this scheme provides an alternative to Monte Carlo methods (e.g. MCMC) that can be faster and more reliable, it still solves each inference problem from scratch.
2.3 Amortized (Variational) Inference
As mentioned in Section 1, amortized inference refers to the use of previous inference solutions (or other precomputation) to solve subsequent inference problems faster. There exists experimental evidence that people leverage experience from prior inference tasks when asked to solve related ones AmortizedInference . This idea has inspired research into developing amortized inference systems for Bayesian networks StochasticInverses ; NeuralStochasticInverses . These systems model by inverting the network topology and attempting to learn the local conditional distributions of this inverted graphical model.
Amortized inference can also be achieved through variational inference. Instead of defining a parametric family which is specific to a given , we can instead define a general family which is conditional on ; that is, it takes as input. In this setting, the mean field family no longer applies, as the factors of must now be functions of . However, we can extend the mean field family to handle input data by using neural networks (or other ‘side computations’):
Here, the parameters of each local conditional in the guide are computed via a neural network function of . This variational family supports amortized inference: one can invest time upfront optimizing the neural network weights such that is close to . When given a neverbeforeseen , the guide program forwards through the trained networks for fast inference. Several recent approaches to ‘neural variational inference’ use some instantiation of this design pattern NVIL ; DLGM ; AEVB .
Such neural guide families are easy to express in our extensions of WebPPL. Our system also allows generalizations of this pattern, such as providing neural nets with previouslymade random choices as additional input:
Here, are the random choices made before choice is sampled. Such guide families have the potential to capture posterior dependencies between latent variables
2.4 Variational Model Learning
The amortized variational inference framework presented above can also be used to learn the parameters of the generative model . If is also parameterized, i.e. , then its parameters can be optimized along with the parameters of the guide program NVIL ; DLGM ; AEVB .
Our system supports learning generative model parameters in addition to guide parameters. In the PPL setting it is natural to think of this as a particular model pattern in which there are global parameters or random choices that affect a local ‘observation model’, which in turn is assumed to generate each data point independently; we call this the mapData model pattern. We will show below how it is easy to use this pattern to do (regularized) maximumlikelihood learning, variational Bayes, or even different methods within the same model.
3 Specifying Guide Programs
In this section, we describe language extensions to WebPPL that allow for the specification of guide programs. We focus for now on manuallyspecified guide programs. In Section 7, we build on this interface to automatically derive guide programs.
3.1 Sampling with Guide Distributions
Previously, we mentioned that our system restricts guide programs to have the same control flow as the original program , meaning that the guide program samples the same variables in the same order. Our implementation enforces this restriction by defining the guide program inline with the regular program. At each sample statement, in addition to the distribution that the program samples from, one can also specify what distribution the guide program should sample from. For example, using the simple program from Section 2.1:
In this example, the guide samples from a Bernoulli with a different success probability p. This particular value happpens to give the true posterior for this program, since this toy problem is simple enough to solve in closed form. We note that the guide distribution need not be of the same family as the prior distribution; we will see later how this property can be useful.
3.2 Declaring Optimizable Parameters
In real problems, we will not know the optimal guides a prior and will instead want to learn guides by specifying guide distributions with tunable parameters:
Here, param(\{name: nm\}) declares an optimizable, realvalued parameter named nm; as we’ll see later, this function can also take a dims
argument specifying the dimensions of a vector, matrix, or tensorvalued parameter. Since the standard deviation
sigma of the Gaussian guide distribution must be positive, we use the softplus^{1}^{1}1 function to map the unbounded value returned by param to ; our system includes similar transforms for parameters with other domains (e.g. sigmoid for parameters defined over the interval ). Parameters must be named so they can be disambiguated by the optimization engine.Using variational parameters directly as the guide distribution parameters (as done above) results in a mean field approximation for the variable x, as mentioned in Section 2.2. We can also compute the guide parameters via a neural network:
Explicitly declaring parameters for and defining the structure of large neural networks can become verbose, so we can instead use the adnn^{2}^{2}2https://github.com/dritchie/adnn neural net library to include neural nets in our programs:
In this case, the nn.mlp constructor has created a guideNet object with its own parameters; these parameters are registered with the optimization engine when nnEval is called.
3.3 Iterating over Observed Data
The previous examples have thus far conditioned on a single observation. But real models condition on multiple observations. Our system expresses this pattern with the mapData function:
mapData operates much like map in a typical functional programming language, but it has two important features: (1) the optimization engine treats every execution of the mapped function as independent, and thus (2) the optimization engine can operate on stochastic minibatches of the data, sized according to the batchSize option. Property (2) is clearly important for efficient, scalable optimization; we will see in Section 4 how property (1) can also be directly leveraged to improve optimization.
3.4 Defining Learnable Models
Thus far we have focused on defining parameterized guides for inference. Parameterized guides can also be used to make models learnable. The following three code blocks show possible replacements for line 7 of the previous example, replacing the hardcoded constant mu_x = with a learnable version:
The code in the left block results in maximum likelihood estimation. By using a Delta distribution as a guide, the inference engine will optimize for the single best parameter value (i.e. the center of the Delta distribution). Maximum likelihood behavior comes from using an improper uniform distribution (i.e. distribution with probability one everywhere) as a prior. This pattern is common enough that our system provides convenient shorthand for it (the
modelParam function). In the middle code of block, we demonstrate L2regularized maximum likelihood learning by replacing the improper uniform prior with a Gaussian prior. The inference engine will still predict a point estimate for mu_x, but the Gaussian prior results in L2 regularization. Finally, the right block shows a variational Bayesian model: mu_x has a Gaussian prior distribution, and the guide samples mu_x from an approximate variational Gaussian posterior with optimizable parameters. This form learns a distribution over generative models, maintaining an estimate of uncertainty about the true model.Note that we could have alternatively implemented maximum likelihood via a direct parameterization, e.g. var mu_x = param(\{name: ’mu_x’\}). However, this style results in being parameterized in addition to . This complicates both the implementation and the theoretical analyses that we show later in this paper. In contrast, our chosen scheme has only the guide parameterized; learning the model is just part of learning the guide.
3.5 Examples: Simple Bayesian Networks
The example code we have built up in this section describes a Bayesian network with one continuous latent variable per continuous observation. Figure 1 Top shows the fully assembled code (using maximum likelihood estimation for the generative model parameters), along with a graphical model depiction using the notation of Kigma and Welling AEVB . In this diagram, solid arrows indicate dependencies in the generative model given by the main program, and dashed arrows indicate dependencies in the guide program. is shorthand for all the neural network parameters in the guide program.
Figure 1
Bottom shows how to modify this code to instead have one discrete latent variable per observation; this is equivalent to a Gaussian mixture model. In this example, the
simplex function maps a vector in to thedimensional simplex (i.e. a vector whose entries sum to one). This process produces a vector of weights suitable for use as the component probabilities of a discrete random variable.
Figure 2 Top shows a slightly more complex Bayesian network with two latent continuous variables. Note that the guide program in this example predicts the two latent variables independently given the observation . In Figure 2 Bottom, we make some small changes to the code (lines 3 and 17, highlighted in green) to instead have the guide program predict the second latent variable as a function of the first latent variable . This small change allows the guide to capture a posterior dependency that was ingored by the first version.
4 Optimizing Parameters
Now that we have seen how to author learnable guide programs, we will describe how to optimize the parameters of those programs.
4.1 ELBo: The Variational Objective
In Section 2.2, we mentioned that the goal of variational inference is to find values of the parameters for our guide program such that it is as close as possible to the true posterior , where closeness is measured via KLdivergence. The KLdivergence between two general distributions is intractable to compute; however, some straightforward algebra produces an objective that is tractable (following the derivation of Wingate and Weber AVIPP ):
(2) 
where the last inequality follows because KLdivergence is nonnegative. This in turn implies that is a lower bound on the log marginal likelihood of the data (i.e. evidence) . Accordingly, is sometimes referred to as the ‘Evidence Lower Bound’, or ELBo BBVI . Maximizing the ELBo corresponds to minimizing the KLdivergence. When can include unnormalized factors (as in our implementation), this is more properly called the variational free energy.
For an alternative derivation of the ELBo using Jensen’s inequality, see Mnih and Gregor NVIL and Jordan et al. (VariationalInference, , p. 213).
4.2 ELBo Gradient Estimators
Maximizing the ELBo requires estimating its gradient with respect to the parameters. There are two wellknown approaches to performing this estimation:
Likelihood Ratio (LR) Estimator:
In the general case, the gradient of the ELBo with respect to can be estimated by:
(3) 
This estimator goes by the names ‘likelihood ratio estimator’ LikelihoodRatioEstimator and ‘score function estimator’ ScoreFunctionEstimator
, and it is also equivalent to the REINFORCE policy gradient algorithm in the reinforcement learning literature
REINFORCE . The derivations of this estimator most relevant to our setting can be found in Wingate and Weber AVIPP and Mnih and Gregor NVIL . Intuitively, each gradient update step using the LR estimator pushes the parameters in the direction —that is, the direction that will maximize the probability of under the guide. But since the goal of optimization is to learn an approximation to the true posterior, this update is weighted based on how probableis under the joint distribution
(which is proportional to the true posterior). If has high probability under the joint but low probability under the current guide, then the term produces a large update (i.e. the guide should assign much higher probability to than it currently does). If the joint and the guide assign equal probability to , then the update will have zero magnitude. If the joint assigns lower probability to than the guide does, the resulting gradient update will move the parameters in the opposite direction of .The LR estimator is straightforward to compute, requiring only that be differentiable with respect to (the mean field and neural guide families presented in Section 2 satisfy this property). However, it is known to exhibit high variance. This problem is amenable to several variance reduction techniques, some of which we will employ later in this section.
Pathwise (PW) Estimator:
Equation 3 suffers from high variance because the gradient can no longer be pushed inside the expectation: the expectation is with respect to , and depends on the parameters with respect to which we are differentiating. However, in certain cases, it is possible to rewrite the ELBo such that the expectation distribution does not depend on . This situation occurs whenever the latent variables can be expressed as samples from an unparameterized distribution, followed by a parameterized deterministic transformation:
For example, sampling from a Gaussian distribution
can be expressed as , where. Continuous random variables which are parameterized by a location and a scale parameter naturally support this type of transformation, and other types of continuous variables can often be wellapproximated by deterministic transformations of unit normal variables
ADVI .Using this ‘reparameterization trick’ AEVB allows the ELBo gradient to be rewritten as:
(4) 
This estimator is called the ‘pathwise derivative estimator’ PathwiseEstimator . It transforms both the guide and target distributions into distributions over independent random ‘noise’ variables , followed by complex, parameterized, deterministic transformations. Given a fixed assignment to the noise variables, derivatives can propagate from the final log probabilities back to the input parameters, leading to much more stable gradient estimates than with the LR estimator.
4.3 Unified Gradient Estimator for Probabilistic Programs
A general probabilistic program makes many random choices, some of which are amenable to the reparameterization trick and others of which are not. Discrete random choices are never reparameterizable. Continuous random choices are reparameterizable if they can be expressed as a parameterized, deterministic transformation of a parameterless random choce. In our implementation, every continuous random choice type either has this property or is wellapproximated (and thus can be guided) by a random choice type that does (see Appendix A). Thus, for the rest of the paper, we will equate continuous with reparameterizable and discrete with nonreparameterizable. To optimize the ELBo for a probabilistic program and an associated guide program , we seek a single, unified gradient estimator that handles both discrete and continuous choices.
First, to simplify notation, we drop the dependency on from all derivations that follow. This is equivalent to making the parameters globally available, which is also true of our system (i.e. param can be called anywhere).
Second, we assume that all random choices made by the guide program are first drawn from a distribution and then transformed by a deterministic function . Under this assumption, the distribution defined by a guide program factors as:
As mentioned in Section 2.1, the length of can vary across executions. However, since the guide program by construction samples the same random choices in the same order as the target program , factors the same way as (Equation 1) for any given execution. (Also note that the gradient with respect to parameters that affect only variables that are not created on a given execution will be zero, allowing us to simply ignore them.)
The distributions and functions have different meanings depending on whether the variable is continuous or discrete:

Continuous : is a unit normal or uniform distribution, and is a parameterized transform. This is a direct application of the reparameterization trick. In this case, each local transformation may also depend on the previous noise variables , as choices occuring later in the program may be compound transformations of earlier choices.

Discrete : = , and is the identity function. This allows discrete choices to be represented in the reparameterization trick framework (without actually reparametrizing).
Given these assumptions, we can derive an estimator for the ELBo gradient:
(5)  
where ; see Appendix B.1 for the derivation.
This estimator includes the LR and PW estimators as special cases. If all random choices are reparameterized (i.e. they are all continuous), then does not depend on , thus is zero and the LR term drops out, leaving only the PW term. If no random choices are reparameterized (i.e. they are all discrete), then the term drops out, using the identity (see Appendix B.2). The term is also zero, since only and not is dependent on , which leaves only the LR term.
While Equation 5 is a correct estimator for the ELBo gradient, like the LR estimator, the presence of discrete (i.e. nonreparameterized) random choices can lead to high variance. Thus we modify this estimator through three variance reduction techniques:

Replace with a separate for each factor of , as there exist independencies that can be exploited to reveal zeroexpectation terms in for each .

Subtract a constant ‘baseline’ term from each ). This does not change the expectation, but it can reduce its variance, if designed carefully.

Factor in the PW term and remove factors corresponding to discrete (i.e. nonreparameterized) choices, since, as noted above, they have zero expectation.
Further details and correctness proofs for these three steps can be found in Appendix B. Applying them leads to the following estimator, which is the estimator actually used by our system:
(6) 
where and are the sets of indices for all continuous and discrete random choices in the program execution, respectively. In the second line, we used the fact (mentioned above) that for reparameterized (continuous) choices, and that and is the identity for nonreparameterized (discrete) choices.
Equation 6
is similar to the ‘surrogate loss function’ gradient estimator of Schulman et al.
StochasticComputationGraphs , as an execution of a probabilistic program corresponds to one of their stochastic computation graphs. Their analysis is concerned with general stochastic objectives, however, while we focus particularly on the ELBo.4.4 Optimization Interface
In Section 2.1, we showed how WebPPL programs use the Infer function to perform nonamortized inference on a model function. To optimize parameters for amortized inference, WebPPL provides an Optimize function with a similar interface:
The code above performs 100 gradient update steps on model using the Adam stochastic optimization method Adam . The return value params of this function is a map from parameter names to optimized parameter values.
5 Using Learned Parameters
Given a set of learned parameters, our system can predict latent variables, generate synthetic data, or further refine parameters for a new dataset or a new model.
5.1 Predicting Latent Variables
A learned guide can be used to make inferences about new, neverbeforeseen observations. As an example, we’ll use the Gaussian mixture model program in Figure 1 Bottom and show how to predict cluster assignments for new observations. Note that the observations used by the program are assigned to the obs variable, which is then passed to mapData. WebPPL is a purely functional language, so it does not support assigning a new dataset to obs. However, it does provide a special globalStore object whose fields can be reassigned. With this in mind, we modify the Gaussian mixture model program as follows^{3}^{3}3An alternative to using globalStore for mutation would be to recreate the model function, closing over the test data instead of the training data.:
Now we can easily swap datasets using globalStore.data. Given a set of learned parameters params for this program, we can obtain a sample prediction for the latent variables for a new dataset:
We can make predictions either by running the guide program forward, or if the true posterior is very complex and the learned guide only partially approximates it, we can use the guide program as an importance sampler within Sequential Monte Carlo.
5.2 Generating Synthetic Data
Forward sampling from the guide can also be used to generate synthetic data from the learned distribution. If we make a slightly modified version of the Gaussian mixture model (call it modelGen) that samples data instead of observing it, we can used forward sampling with the optimized parameters params to synthesize new data points:
5.3 Further Optimization
A set of learned parameters params can also be passed back into Optimize for further optimization:
This can be useful for e.g. finetuning existing parameters for a new dataset. Indeed, model does not even need to be the same program that was originally used to learn params; it just needs to declare some parameters (via param) with the same names as parameters in params. This can be useful for, for example, making a modification to an existing model without having to retrain its guide program from scratch, or for bootstrap training from a simpler model to a more complex one.
6 Experiments
Having detailed how to specify and optimize guide programs in our system, in this section, we experimentally evaluate how well programs written in our system can learn generative models and approximate posterior samplers. Unless stated otherwise, we use the following settings for the experiments in this section:
6.1 Gaussian Mixture Model
We first consider the simple Gaussian mixture model program from Figure 1 Bottom. This program samples discrete random choices, so its gradient estimator will include an LR term. Alternatively, we could rewrite the program slightly to explicitly marginalize out the discrete random choices; see Appendix C.1. Marginalizing out of these choices leads to a tighter bound on the marginal log likelihood, so we would expect this version of the program to achieve a higher ELBo. As an extra benefit, the gradient estimator for this program then reduces to the PW estimator, which will have lower variance. These benefits come at the cost of amortized inference, however, as this version of the program does not have a guide which can predict the latent cluster assignment given an observed point. We also consider a nonamortized, mean field version of the program for comparison.
Figure 3 illustrates the performance of these programs after training for 200 steps on a synthetic datset of 100 points. On the left, we show how the ELBo changes during optimizaton. As expected, ELBo progress asymptotes at a higher value for the marginalized model. On the right, we show the estimated negative log likelihood of a separate synthetic test set under each program after optimization. Here, we also include the true model (i.e. the model used to synthesize the test/training data) for comparsion. As suggested by its optimization performance, the model with discrete choices marginalized out performs best. Note that the amortized guide program slightly outperforms the mean field guide program, indicating that the generalization provided by amortization has benefits for training generative models, in addition to enabling fast predictions of latent variables for previouslyunseen observations.
6.2 QmrDt
We next consider a more complicated Bayesian network model based on the QMRDT medical diagnosis network QMR . QMRDT is a bipartite graphical model with one layer of nodes corresponding to latent causes (e.g. diseases, in the medical setting) and a second layer of observed effects (e.g. symptoms). All nodes are binary (i.e. Bernoulli), and the cause nodes are connected to the effects via directed noisyor links. Appendix C.2 shows our implementation.
Our amortized guide program for this model uses a neural network to jointly predict the probabilities of all latent cause variables given a set of observed effects. Since the QMRDT model contains a large number of discrete random variables, we expect the variance reduction strategies introduced in Section 4.3 to have significant effect. Thus, we consider training this guide program with no variance reduction (Amortized, step size ), with perchoice likelihood ratio weights, (+ local weights, step size ), and with both perchoice weights and baselines (+ baselines, step size ). As a point of reference, we also include a mean field model (step size ) which uses all variance reduction strategies. Data for our experiments is sampled from a randomlygenerated graph with 200 causes and 100 effects. We sample 1000 observations for the training set and an additional 100 for a heldout test set.
Figure 4
shows the results of our experiments. The left plot shows optimization progress under each condition. Without any of the variance reduction strategies, gradients are extremely noisy and optimization makes almost no progress. Using local, pervariable likelihood ratio weights allows optimization to mke progress, and adding pervariable baselines further boosts performance. Though it uses all variance reduction strategies, the mean field model trains significantly more slowly than the variancereduced amortized models. This happens because the mean field model has separate parameters for each training observation, rather than a single parameter set shared by a neural network, i.e. it has many more parameters that are each updated by very few gradient steps. Amortization thus both facilitates fast posterior prediction and exhibits faster training due to parameter sharing.
We next evaluate the guide program’s posterior prediction ability. We use the learned guide to sample latent causes given an observed set of effects from the test set, sample effects given those causes, and then record what percentage of active effects in the test set observation are correctly predicted by the effects ‘hallucinated’ from our model. Specifically, if is a vector of effect variables of length , then the metric we use is:
where are effects from the test set and are hallucinated from the model. Figure 4 Right plots the average score over 100 runs, where we compare our amortized guide program against using the prior program to sample latent causes. The learned guide program correctly predicts more than twice as many active effects.
In addition to the guide program described above, which predicts all latent causes jointly given the observed effects, we also experimented with ‘factored’ guide programs which predict each latent cause onebyone given the observed effects. We consider a guide that predicts each latent cause independently (Factored
), as well as a guide that introduces dependencies between all latent causes via a recurrent neural network (
Factored + GRU). The recurrent network receives as input the value of each latent cause as it is sampled, maintaining a persistent hidden state that, in theory, can capture the values of all latent causes sampled thus far. We use the gated recurrent unit (GRU) architecture for its ability to capture a longerrange dependencies
GRU , with a hidden state of dimension 20. The code for these programs is shown in Appendix C.2. We use stepsize 0.01 for both during optimization.Figure 5 compares these guide programs with the joint guide used earlier in this section. While the independent factored guide performs slightly less well than the joint guide, adding the recurrent neural network to capture posterior dependencies improves performance to slightly better than the joint guide. One caveat is that the Factored + GRU guide takes significantly longer to train in our implementation due to the recurrent network computation. Later in the paper, we discuss how this guide structure might point the way toward automatically deriving guide programs.
6.3 Latent Dirichlet Allocation
We also used our system to implement amortized inference in Latent Dirichlet Allocation topic models, over a data set of abstracts taken from the Stanford Computation and Cognition Lab’s publication page.^{4}^{4}4Thanks to Robert Hawkins for creating this dataset.
We experimented with two different amortized guide programs. The first is local to each word in a document (Wordlevel guide): it learns to predict the latent topic for each word, given the word and the latent topic distribution for the document. The second is local to each document (Documentlevel guide): it learns to predict the latent topic distribution for the document, given the words in the document and the latent word distributions for each topic. These two guides support amortized inference at different granularities and thus have different parameter sharing characteristics, which may lead to different learning behavior. For comparison, we also included two nonamortized conditions: a mean field model, and a mean field model with the latent choice of topic per word marginalized out (Marginalized mean field). Code for all of these programs can be found in Appendix C.3. We use five topics and learning rate 0.01 in all experiments.
Figure 6.3 shows the results of these experiments. In the optimization progress plot on the left, we see that the marginalized mean field model achieve the highest ELBo. This is consistent with the results from the Gaussian mixture model experiments: marginalizing out latent variables when possible leads to a tighter bound on the marginal log likelihood.
Of our two amortized guides, the wordlevel guide performs better (and nearly as well as the marginalized model), likely due to increased parameter sharing. The documentlevel guide performs at least as well as the mean field model while also being able to efficiently predict topic distributions for previouslyunseen documents.
Figure 6.3 Right shows the top ten highest probability words in each inferred topic for the marginalized mean field model. From left to right, these topics appear to be about experiments, pragmatic language models, knowledge acquisition, probabilistic programming languages, and a grab bag of remaining topics.
6.4 Neural Generative Models: Variational Autoencoder & Sigmoid Belief Network
Our system naturally supports generative models which use neural network components. Two prominent examples of models in this class include the Variational Autoencoder (VAE) and Sigmoid Belief Networks (SBN). Both models sample latent variables from a multivariate distribution and then transform the result via a neural network to produce observed variables, often in the form of an image. The VAE uses a latent multivariate Gaussian distribution, whereas the SBN uses a latent multivariate Bernoulli.
Appendix C shows implementations of these models in our system. Our VAE implementation follows the original description of the model by Kingma and Welling AEVB , and our SBN implementation follows that of Mnih and Gregor NVIL . The VAE uses a 20dimensional latent code, and the SBN uses a single layer of 200 hidden variables. Our system cannot express the twolayer SBN of Mnih and Gregor, because its guide model samples the latent variables in the reverse order of the generative model.
Figure 7 Left shows results of training these models on the MNIST dataset, using Adam with a step size of 0.001. While both models train quickly at first, the SBN’s training slows more noticeably than the VAE’s due to its discrete nature. It takes more than three times as many iterations to achieve the same ELBo. In Figure 7 Right, we qualitatively evaluate both models by using them to reconstruct images from the MNIST test set. We use the guide program to sample latent variables conditional on the images in the “Target” column (i.e. the ‘encoder’ phase). We then transform these latent variables using the generative model’s neural networks (i.e. the ‘decoder’ phase) to produce the reconstructed images in the “VAE” and “SBN” columns. As suggested by their training behavior, the VAE is able to generate higherquality reconstructions after less training.
Our optimization exhibits some differences from the previous work. For the VAE, Kingma and Welling AEVB exploit the closedform solution of the KL divergence between two Gaussians to create an even lowervariance estimator of the ELBo gradient. We use a more general formulation, but our system can still successfully train the model. For the SBN, Mnih and Gregor NVIL use neural networks to compute the pervariable baselines in Equation 6, whereas we use a simpler approach (see Appendix B). However, the key point is that each of these models was described in a simple WebPPL program with neural guides and optimized by the default system, without the need for additional implementation efforts.
7 Deriving Guides Automatically
Thus far, we have shown how we can succesfully create and train guide programs for several types of generative models. However, writing guide programs can sometimes be tedious and repetitive; for example, note the large amount of shared structure between the guides shown in Figures 1 and 2. Furthermore, it is not always obvious how to write a good guide program. In Figure 2, knowledge of the structure of this very simple generative model led us to add a direct dependency between the two latent variables in the guide. For general programs—especially large, complex ones—it will not always be clear what these dependencies are or how to capture them with a guide.
This section describes our early experience with automatically deriving guide programs. We first describe how our system provides sensible default behavior that can make writing some guides less cumbersome. We then outline how the system might be extended to automatically derive guides for any program using recurrent neural networks.
7.1 Mean Field by Default
If a call to sample is not provided with an explicit guide distribution, our system automatically inserts a mean field guide. For example, the code sample(Gaussian(\{mu: 0, sigma: 1\})) results in:
where parameter bounding transforms such as softplus are applied based on bounds metadata provided with each primitive distribution type. We use reparameterizable guides for continuous distributions (see Appendix A).
Since this process declares new optimizable parameters automatically, we must automatically generate names for these parameters. Our system names parameters according to where they are declared in the program execution trace, using the same naming technique as is used for random choices in probabilistic programming MCMC engines Lightweight . Since the names of these parameters are tied to the structure of the program, they cannot be reused by other programs (as in the ‘Further Optimization’ example of Section 5.3).
7.2 Beyond Mean Field: Automatic Factored Guides with Recurrent Networks
In Section 6.2, we experimented with a factored guide program for the QMR–DT model. We think that this general style of guide—predicting each random choice in sequence, conditional on the hidden state of a recurrent neural network—might be generalized to an automatic guide for any program, as any probabilistic program can be decomposed into a sequence of random choices. In our QMRDT experiments, we used a separate neural network (with separate parameters) to predict each latent variable (i.e. random choice). For complex models and large data sets, this approach would lead to a computationally unfeasible explosion in the number of parameters. Furthermore, it is likely that the prediction computations for many random choices in the program are related. For example, in the QMRDT program, latent causes that share many dependent effects may be well predicted by the same or very similar networks.
Given these insights, we imagine a universallyapplicable guide that uses a single prediction network for all random choices, but to which each random choice provides an additional identifying input. These IDs should be elements in a vector space, such that more ‘similar’ random choices have IDs which are close to one another for some distance metric in the vector space. One possible way to obtain such IDs would be to learn an embedding of the programstructural addresses of each random choice Lightweight . These might be learned in an endtoend fashion by making them learnable parameter vectors in the overall variational optimization (i.e. letting closeness in the embedding space be an emergent property of optimizing our overall objective).
8 Conclusion
In this paper, we presented a system for amortized inference in probabilistic programs. Amortization is achieved through parameterized guide programs which mirror the structure of the original program but can be trained to approximately sample from the posterior. We introduced an interface for specifying guide programs which is flexible enough to reproduce stateoftheart variational inference methods. We also demonstrated how this interface supports model learning in addition to amortized inference. We developed and proved the correctness of an optimization method for training guide programs, and we evaluated its ability to optimize guides for Bayesian networks, topic models, and deep generative models.
8.1 Future Work
There are many exciting directions of future work to pursue in improving amortized inference for probabilistic programs. The system we have presented in this paper provides a platform from which to explore these and other possibilities:
More modeling paradigms
In this paper, we focused on the common machine learning modeling paradigm in which a global generative model generates many IID data points. There are many other modeling paradigms to consider. For example, time series data is common in machine learning applications. Just as we developed mapData to facilitate efficient inference in IID data models, we might develop an analogous data processing function for time series data (i.e. foldData
). Using neural guides with such a setup would permit amortized inference in models such as Deep Kalman Filters
DeepKalmanFilters. In computer vision and computer graphics, a common paradigm for generative image models is to factor image generation into multiple steps and condition each step on the partiallygenerated image thus far
PixelCNN ; NGPM . Such ‘yieldsofar’ models should also be possible to implement in our system.Better gradient estimators
While the variance reduction strategies employed by our optimizer make inference with discrete variables tractable, it is still noticeably less efficient then with purely continuous models. Fortunately, there are ongoing efforts to develop better, generalpurpose discrete estimators for stochastic gradients MuProp ; VIMCO . It should be possible to adapt these methods for probabilistic programs.
Automatic guides
As discussed in Section 7
, we believe that automatically deriving guide programs using recurrent neural networks may soon be possible. Recent enhancements to recurrent networks may be necessary to make this a reality. For example, the external memory of the Neural Turing Machine may be better at capturing certain longrange posterior dependencies
NTM . We might also draw inspiration from the Neural ProgrammerInterpreter NPI , whose stack of recurrent networks which communicate via arguments might better capture the posterior dataflow of arbitrary programs.Other learning objectives
In this paper, we focused on optimizing the ELBo. If we flip the direction of KL divergence in Equation 2, the resulting functional is an upper bound on the log marginal likelihood of the data—an ‘Evidence Upper Bound,’ or EUBo. Computing the EUBo and its gradient requires samples from the true posterior and is thus unusable in many applications, where the entire goal of amortized inference is to find a way to tractably generate such samples. However, some applications can benefit from it, if the goal is to speed up an existing tractable inference algorithm (e.g. SMC NGPM ), or if posterior execution traces are available through some other means (e.g. input examples from the user). There may also be less extreme ways to exploit this idea for learning. For example, in a mapDatastyle program, we might interleave normal ELBo updates with steps that hallucinate data from the posterior predictive (using a guide for global model parameters) and train the local guide to correctly parse these ‘dreamedup’ examples. Such a scheme bears resemblance to the wakesleep algorithm WakeSleep .
Control flow
While our system’s onetoone mapping between random choices in the guide and in the target program makes the definition and analysis of guides simple, there are scenarios in which more flexibility is useful. In some cases, one may want to insert random choices into the guide which do not occur in the target program (e.g. using a compound distribution, such as a mixture distribution, as a guide). And for models in which there is a natural hierarchy between the latent variables and the observed variables, having the guide run ‘backwards’ from the observed variables to the topmost latents has been shown to be useful StochasticInverses ; NeuralStochasticInverses ; NVIL . It is worth exploring how to support these (and possibly even more general) control flow deviations in a generalpurpose probabilistic programming inference system.
Acknowledgments
This material is based on research sponsored by DARPA under agreement number FA87501420009. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA or the U.S. Government.
References
 [1] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. In EMNLP 2014.
 [2] M. C. Fu. Gradient Estimation. Handbooks in operations research and management science, 13, 2006.
 [3] Sam Gershman and Noah D. Goodman. Amortized Inference in Probabilistic Reasoning. In Proceedings of the ThirtySixth Annual Conference of the Cognitive Science Society, 2014.
 [4] P. Glasserman. Monte Carlo Methods in Financial Engineering. Springer Science & Business, 2003.
 [5] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10), 1990.
 [6] Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. Church: a language for generative models. In UAI 2008.
 [7] Noah D. Goodman and Andreas Stuhlmüller. The Design and Implementation of Probabilistic Programming Languages. http://dippl.org, 2014. Accessed: 20151223.
 [8] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. CoRR, arXiv:1410.5401, 2014.
 [9] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research, 5, 2004.

[10]
Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih.
MuProp: Unbiased Backpropagation for Stochastic Neural Networks.
In ICLR 2016.  [11] Georges Harik and Noam Shazeer. Variational Program Inference. CoRR, arXiv:1006.0991, 2010.
 [12] GE Hinton, P Dayan, BJ Frey, and RM Neal. The "wakesleep" algorithm for unsupervised neural networks. Science, 268, 1995.
 [13] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical models. Machine Learning, 37, 1999.
 [14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In ICLR 2015.
 [15] Diederik P. Kingma and Max Welling. AutoEncoding Variational Bayes. In ICLR 2014.
 [16] Rahul G. Krishnan, Uri Shalit, and David Sontag. Deep Kalman Filters. CoRR, arXiv:1511.05121, 2015.
 [17] Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic Variational Inference in Stan. In NIPS 2015.
 [18] J. Manning, R. Ranganath, K. Norman, and D. Blei. Black Box Variational Inference. In AISTATS 2014.
 [19] Vikash K. Mansinghka, Daniel Selsam, and Yura N. Perov. Venture: a higherorder probabilistic programming platform with programmable inference. CoRR, arXiv:1404.0099, 2014.
 [20] Andriy Mnih and Karol Gregor. Neural Variational Inference and Learning in Belief Networks. In ICML 2014.
 [21] Andriy Mnih and Danilo J. Rezende. Variational inference for Monte Carlo objectives. In ICML 2016.
 [22] B. Paige and F. Wood. Inference Networks for Sequential Monte Carlo in Graphical Models. In ICML 2016.
 [23] Scott Reed and Nando de Freitas. Neural ProgrammerInterpreters. In ICLR 2016.
 [24] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In ICML 2014.
 [25] Daniel Ritchie, Anna Thomas, Pat Hanrahan, and Noah D. Goodman. NeurallyGuided Procedural Models: Amortized Inference for Procedural Graphics Programs using Neural Networks. In NIPS 2016.
 [26] John Schulman, Nicolas Hess, Theophane Weber, and Pieter Abbeel. Gradient Estimation Using Stochastic Computation Graphs. In NIPS 2015.
 [27] M. Shwe, B. Middleton, D. Heckerman, M. Henrion, E. Horvitz, H. Lehmann, and G. Cooper. Probabilistic diagnosis using a reformulation of the INTERNIST1/QMR knowledge base. I. The probabilistic model and inference algorithms. Methods of Information in Medicine, 30.
 [28] Andreas Stuhlmüller, Jessica Taylor, and Noah D. Goodman. Learning Stochastic Inverses. In NIPS 2013.
 [29] Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional Image Generation with PixelCNN Decoders. CoRR, arXiv:1606.05328, 2016.
 [30] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8, 1992.
 [31] David Wingate, Andreas Stuhlmüller, and Noah D. Goodman. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation. In AISTATS 2011.
 [32] David Wingate and Theophane Weber. Automated Variational Inference in Probabilistic Programming. In NIPS 2012 Workshop on Probabilistic Programming.
 [33] F. Wood, J. W. van de Meent, and V. Mansinghka. A New Approach to Probabilistic Programming Inference. In AISTATS 2014.
Appendix A Appendix: Reparameterizations
Examples of primitive random choice distributions that can be reparameterized via a locationscale transform:
Distribution  

Gaussian(, )  Gaussian(, )  
LogitNormal(, )  Gaussian(, )  
LogisticNormal(, )  Gaussian(, )  
InverseSoftplusNormal(, )  Gaussian(, )  
Exponential()  Uniform(, )  
Cauchy(, )  Uniform(, ) 
Examples of primitive distributions that do not have a locationscale transform but can be guided by a reparameterizable approximating distribution:
Distribution  Guide Distribution 

Uniform  LogitNormal 
Beta  LogitNormal 
Gamma  InverseSoftplusNormal 
Dirichlet  LogisticNormal 
Appendix B Appendix: Gradient Estimator Derivations & Correctness Proofs
b.1 Derivation of Unified Gradient Estimator (Equation 5)
(7)  
Line 7 makes use of the identity .
b.2 Zero Expectation Identities
In what follows, we will make frequent use of the following:
Lemma 1.
If is a probability distribution, then:
Proof.
∎
Lemma 2.
For a discrete random choice and a function :
Proof.
where the last line makes use of Lemma 1. ∎
b.3 Variance Reduction Step 1: Zero Expectation Terms
In this section, we show that for each random choice , we can remove terms from to produce . Specifically, we prove that while , we still have .
To enable the computation of each , our system builds a directed acyclic dependency graph as the program executes. The graph is constructed as follows:

On program start: Create a root node root. Set prev = root. This holds the previous node and will be used to assign node parents.

On sample or observe: Create a new node node representing this random choice/observation. Set node.parent = prev. Update prev = node.

On mapData begin: Create two new nodes split and join. These nodes will delimit the beginning and ending of the mapData iteration. Push split, join onto a stack mapDataStack. This stack keeps track of split, join nodes when there are nested calls to mapData.

On mapData iteration begin: Retrieve split, join = top(mapDataStack). Update prev = split. This step reflects the independence assumptions of mapData: an iteration of mapData does not depend on any previous iterations, so there are no such edges in the graph. Instead, each iteration of mapData points back to the beginning of the mapData call.

On mapData iteration end: Retrieve split, join = top(mapDataStack). Add prev to join.parents. This step connects the last random choice in each mapData iteration to the join node.

On mapData end: Retrieve split, join = top(mapDataStack). Update prev = join. This step acknowledges that any subsequent computation may depend on the mapData as a whole.
In this graph, all nodes correspond to random choices or observations, except for the special mapData nodes split and join. When there are no calls to mapData, the graph has a linear topology, where each node is connected via a parent edge to the previouslysampled/observed node. mapData calls introduce fanoutfanin subgraphs: the split node fans out into separate linear chains for each mapData iteration, and the last nodes of these chains fan in to the join node. Figure 8 shows the resulting graph for one execution of a simple example program.
By construction, this graph is an overlyconservative dataflow dependency graph: if random choice flows into random choice (or observation ), then a path (or ) exists in the graph. The converse is not necessarily true (i.e. there can exist edges between nodes that have no dataflow dependency). Note also that, by construction, the existence of a path implies that was sampled after in the program execution order.
From the perpective of a random choice node , the graph nodes can be partitioned into the following subsets:

: the nodes “downstream” from (i.e. the set of all nodes for which an edge exists.

: the nodes “upstream” from (i.e the set of all nodes for which an edge exists.

: the set of nodes which are in neither nor .
Figure 9 illustrates these partitions on an example graph. For convenience, we also define .
Comments
There are no comments yet.