Using probabilistic programs as proposals

01/11/2018 ∙ by Marco F. Cusumano-Towner, et al. ∙ MIT 0

Monte Carlo inference has asymptotic guarantees, but can be slow when using generic proposals. Handcrafted proposals that rely on user knowledge about the posterior distribution can be efficient, but are difficult to derive and implement. This paper proposes to let users express their posterior knowledge in the form of proposal programs, which are samplers written in probabilistic programming languages. One strategy for writing good proposal programs is to combine domain-specific heuristic algorithms with neural network models. The heuristics identify high probability regions, and the neural networks model the posterior uncertainty around the outputs of the algorithm. Proposal programs can be used as proposal distributions in importance sampling and Metropolis-Hastings samplers without sacrificing asymptotic consistency, and can be optimized offline using inference compilation. Support for optimizing and using proposal programs is easily implemented in a sampling-based probabilistic programming runtime. The paper illustrates the proposed technique with a proposal program that combines RANSAC and neural networks to accelerate inference in a Bayesian linear regression with outliers model.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Monte Carlo approaches to approximate inference include importance sampling, Markov chain Monte Carlo, and sequential Monte Carlo

del2006sequential . It is easy to construct Monte Carlo inference algorithms that have asymptotic guarantees, but constructing Monte Carlo inference algorithms that are accurate in practice often requires knowledge of the posterior distribution. Recent work in amortized inference stuhlmuller2013learning ; paige2016inference ; ritchie2016neurally ; le2016inference aims to acquire this knowledge by training neural networks on offline problem instances and generalizing to problem instances encountered at run-time. We propose to use probabilistic programming languages as a medium for users to encode their domain-specific knowledge about the posterior, and to learn parameters of the program offline using an amortized inference approach to fill in gaps in this knowledge. In our setting, the model in which inference is being performed may or may not be itself represented by a probabilistic program.

This paper makes three contributions: First, we introduce proposal programs

, which are probabilistic programs that represent proposal distributions. Proposal programs can use internal random choices, and report estimates of their proposal probabilities instead of true proposal probabilities. We show that proposal programs can be used in place of regular proposal distributions within importance sampling and Metropolis-Hastings algorithms without sacrificing asymptotic consistency of these algorithms. Second, we give a stochastic gradient descent algorithm for offline learning of proposal program parameters. Third, we propose a class of proposal programs based on combination of domain-specific heuristic randomized algorithms for identifying high-probability regions of the target distribution with neural networks that model the uncertainty in the target distribution around the outputs of the algorithm. We also discuss how to implement proposal programs in a sampling-based probabilistic programming runtime, and give example proposal programs written in Gen.jl

cusumano2017gen , a probabilistic language embedded in Julia bezanson2017julia .

2 Notation

Consider a target distribution on latent variable(s) , and let be the unnormalized target distribution where is a normalizing constant. Let denote the problem instance—this may include the observed data, or any other context relevant to defining the inference problem. In importance sampling, we sample many values from a proposal distribution that is parameterized by , and evaluate the importance weight for each sample, producing a weighted collection of samples that can be used to estimate expectations under . In Markov chain Monte Carlo, we use one or more proposal distribution(s) to construct a Markov chain that converges asymptotically to , and use iterates of one or more such chains to estimate expectations under . Proposal distributions may be combined in cycles or mixtures tierney1994markov , and each proposal distributions may only update a subset of the variables in , leaving the remaining variables fixed.

3 Monte Carlo inference with proposal programs

A proposal program is a probabilistic program that takes input and makes random choices, where each random choice is given an address , where is some countable set, and where some subset of the addresses are the output choices. Random choices that are not output are choices are called internal choices. We assume that all output choices are realized in all possible executions of the program. We assume that all random choices are discrete—a rigorous treatment of continuous random choices is left for future work. A trace of a probabilistic program is a mapping from the address of a random choices to its value where , and where is the set of all possible values for a random choice. A function from output addresses to values is called an output trace. Let denote the output trace constructed by restricting a trace to addresses . Let if for all (i.e. if the trace agrees with an output trace on the values of the output choices). Let denote the probability of choice with address taking value given the state of the program at that point in its execution as determined by the values of previous random choices. The probability of a trace on input is defined as:


We assume that it is possible to execute the program , possibly with some output random choices fixed to given values, and to compute the probabilities for output random choices . Let denote a trace generated by executing the program on input and let denote a trace generated by sampling the program on input , but where output choices are fixed to values given in an output trace instead of being sampled (). Note that internal random choices can occur before or after output choices in program execution, and internal random choices may depend on the value of output random choices. We require that for all and the program satisfies . That is, any trace that can be generated by executing the program with output choices fixed to must also be possible under standard program execution. Let the product of probabilities of each output choice in a trace given the previous choices be denoted: and similarly for the internal choices . The marginal probability of an output trace is called the proposal probability, and is given by:

Probabilistic program
All addresses of possible random choices
Addresses of realized random choices in an execution
Possible values for random choices
Trace (record of random choices in an execution)
Output random choices
Inputs to program
Output trace
Output trace generated by restricting a trace to outputs
Sampling a trace by executing program on input
Sampling a trace with constrained output trace
Proposal probability
Figure 1: Notation for proposal programs
procedure simulate(, )
      Extract output trace
      Compute proposal probability
end procedure
procedure assess(, , )
      Compute proposal probability
end procedure
(a) Idealized proposal program interface
procedure simulate(, , )
      Extract output trace
     for  do
          Execute with output choices fixed to
     end for
      Estimate proposal probability
end procedure
procedure assess(, , , )
     for  do
          Execute with output choices fixed to
     end for
      Estimate proposal probability
end procedure
(b) Approximate proposal program interface
Figure 2: An interface that is sufficient for using proposal programs in importance sampling (IS) and Metropolis-Hastings (MH). (a) shows an ideal implementation of the interface that is intractable in the general case. (b) shows an approximate implementation of the interface that returns estimates in place of proposal probabilities. Theorem 1 and Theorem 2 show that IS and MH remain asymptotically consistent even with the approximate implementation in (b).

In order to use the marginal distribution on outputs of a proposal program as a proposal distribution within a standard importance sampling (IS) algorithm for a target distribution , we require the ability to sample from the proposal distribution and evaluate the proposal probability . The proposal probability is needed to compute the importance weight . To use the proposal program inside standard Metropolis-Hastings (MH), we require the ability to sample from the proposal and evaluate the forward proposal probability for the resulting sample (where may contain the previous iterate ), as well as the ability to compute the reverse proposal probability (the probability that the proposal generates the reverse move, where may contain the proposed iterate ). The forward and reverse proposal probabilities are needed to compute the acceptance probability:


Figure 1(a) shows an interface for proposal programs that, together with the ability to evaluate , is sufficient for implementing IS and MH. The simulate procedure executes the proposal program, and returns the output trace and the proposal probability . The assess procedure takes an output trace and returns . However, evaluating is intractable in the general case because Equation (2) contains a number of terms that is exponential in the number of internal random choices. Therefore, we propose an approximate implementation of the interface, shown in Figure 1(b). The approximate simulate procedure executes the proposal program, and returns the output trace and a specific estimate of the proposal probability . The approximate assess procedure takes an output trace and returns a specific estimate of . Algorithm 1 shows an IS algorithm that uses the approximate simulate procedure of Figure 1(b), and Algorithm 2 shows a MH transition operator that uses the approximate simulate and assess procedures. The remainder of the section shows that substituting the approximate procedures in Figure 1(b) in place of the idealized procedures in Figure 1(a) within IS and MH preserves the asymptotic consistency of both algorithms.

Target distribution ; unnormalized target distribution for some ; test function ; proposal program ; proposal arguments ; integers .
for  do
end for
Algorithm 1 Importance sampling using a proposal program
Theorem 1.

If is a target distribution, is a proposal program such that for all , is a function where , then where is the importance sampling estimate returned by Algorithm 1 using the approximate simulate procedure of Figure 1(b).

See Appendix A.1 for the proof, which is based on an auxiliary variable construction. Algorithm 2 shows a Metropolis-Hastings transition operator that uses the simulate and assess procedures of Figure 1(b).

Unnormalized distribution ; proposal program ; integer ; previous iterate .
if  then
end if
Algorithm 2 Metropolis-Hastings transition using a proposal program
Theorem 2.

The transition operator of Algorithm 2 using the approximate simulate and assess procedures of Figure 1(b) admits as a stationary distribution.

See Appendix A.2 for the proof. Note that this transition operator may be composed with other transition operators to construct an ergodic Markov chain. The non-asymptotic accuracies of Algorithm 1 and Algorithm 2 depend on the choice of , the number of executions of the program used within each invocation of simulate or assess. As , the algorithms behave like the corresponding IS and MH algorithms using the exact proposal probabilities. For finite , the deviation of these algorithms from the limit depends on the details of the proposal program. If is independent of or if is deterministic, provides exact proposal probabilities. We expect that small may be sufficient for proposal programs which have low mutual information between the internal trace and the output trace. A detailed analysis of how performance depends on is left for future work.

The procedures in Figure 1(b) can be easily implemented in a sampling-based probabilistic programming runtime that records a probability for each random choice in a trace or ‘random database’ wingate2011lightweight . We implemented such a runtime for Gen.jl cusumano2017gen , a probabilistic programming language embedded in Julia bezanson2017julia . In Gen.jl, a probabilistic program is a Julia function that has been annotated with the @probabilistic keyword, and in which some random choices have been annotated with unique addresses using the @choice keyword.

4 Learning proposal program parameters

This section describes a technique to learn the parameters of a proposal program during an offline ‘inference compilation’ phase, prior to use in importance sampling or Metropolis-Hastings. Let denote a family of target distributions indexed by the problem instance . We consider a proposal program parameterized by , with distribution on traces denoted . We assume that it is possible to sample from a training distribution on the arguments of the proposal program and the outputs trace of the proposal program, denoted , where . We factor the training distribution into a distribution on arguments , and a desired distribution on outputs given arguments . If the model in which we are doing inference is a generative model where

is a joint distribution on latent variables

and data (observation) , then we can define , and , so that the argument distribution is and the desired output distribution is the posterior . For this choice of , sampling from the training distribution is achieved by ancestral sampling of the latents and observations from the generative model. A different training distribution based on gold-standard SMC sampling instead of joint generative simulation was previously used in ritchie2016neurally . Given , we seek to find that solve the following optimization problem, where denotes the Kullback-Leibler (KL) divergence:


This is equivalent to the maximizing the expected conditional log likelihood of the training data:


Because the proposal program may use internal random choices, it may not be possible to evaluate , making direct optimization of difficult. Therefore, we instead maximize a lower bound on :


where . We maximize using stochastic gradient ascent. For some , let . Let . Also, define functions and :


We then use the following unbiased estimator of the gradient

based on the ‘per-sample’ baseline of mnih2016variational :

The first term accounts for the effect of on the internal random choices and the second term accounts for the direct effect of on the output random choices. The resulting algorithm is shown below:

Training distribution ; proposal program ; integer , initial parameters , mini-batch size .
while not converged do
     for  do
          Sample from training distribution
         for  do
               Execute with output choices fixed to
         end for
     end for
end while
Algorithm 3 Offline optimization of proposal program

Algorithm 3 can be easily implemented on top of a sampling-based probabilistic programming runtime that includes reverse-mode automatic differentiation. The runtime must provide (1) the log probability of output random choices and its gradient and (2) the gradient of the log probability of internal random choices, . The gradient appearing in can be computed from the collection of for each .

Note that the log probability of internal random choices is not required (but its gradient is). Recall that the log probability of the internal random choices is the sum of the log probability of each internal random choice in the internal trace. For internal random choices that do not depend deterministically on , the log probability contribution does not depend on . Therefore, it suffices for the runtime to only instrument the internal random choices that deterministically depend on . This permits use of black box code within the proposal program, which can reduce the overhead introduced by the probabilistic runtime in performance-critical sections of the proposal, and is important for our proposed methodology for using proposal programs in Section 5. We implemented Algorithm 3 on top of the Gen.jl runtime cusumano2017gen . In Gen.jl, the user annotates random choices in a probabilistic program with a unique identifier. The user is free to use randomness in a probabilistic program without annotating it. In our implementation of Algorithm 3 we only accumulate the gradient taking into account random choices that were annotated, relying on the user to ensure that there is no deterministic dependency of un-annotated random choices on . In future work, automatic dependency analysis based on the approach of schulman2015gradient could remove this reliance on the user.

5 Application: Proposals based on randomized mode-finding algorithms

Heuristic algorithm
Figure 3: Proposed high-level data flow for a proposal program based on a heuristic randomized algorithm. is the input to the program, are the parameters of the program to be optimized, are the parameters of the heuristic algorithm that are difficult to set manually, and is the output of the proposal program. The internal random choices of the proposal program include and any internal random choices made by the heuristic algorithm, which need not be instrumented by the probabilistic runtime.

Because Algorithm 3 minimizes the KL divergence from the posterior (or its proxy) to the proposal distribution, optimization will tend to force the proposal to have sufficient variability to cover the posterior mass, which provides a degree of robustness even when the heuristic is brittle. On problem instances where the heuristic gives wrong answers, the proposal variability should increase. On problem instances where the heuristic produces values near the posterior modes, the proposal variability should attempt to match the variability of each posterior mode.

Proposal programs permit the inference programmer to express their knowledge about the target distribution in the form of a sampling program. Consider the case when the user possesses a heuristic randomized algorithm that finds areas of high probability of the target distribution. Although such an algorithm contains some knowledge about the target distribution, the algorithm in isolation is not useful for performing sound probabilistic inference—it does not necessarily attempt to represent the variability with the mode(s) of the target distribution, and its support may not even include the posterior support. The heuristic may also be brittle—it have parameters that need to be tuned for it to work on a particular problem instance, or it may fail completely on some problem instances.

To construct robust and sound probabilistic inference algorithms that can take advantage of heuristic randomized algorithms, we construct proposal programs that use the heuristic algorithm as a subroutine, and use the proposal program in importance sampling (Algorithm 1) or MCMC (Algorithm 2). Let denote the problem instance (e.g. the observed data or context variables); and let , , and denote the parameters, internal random choices, and output of the algorithm, respectively. To construct the proposal program from the heuristic algorithm, we:

  1. Make the parameters that are difficult to set manually into random choices whose distribution may depend on (other parameters can be hardcoded or can have a hardcoded dependency on ).

  2. Add random choices that are equal to the output of the heuristic algorithm plus some amount of noise, such that the support of given any and includes the support of the posterior . The choices are the output choices of the proposal program.

  3. Parametrize the distribution of given and the distribution of given and using the output of neural network(s) with parameters . The networks take as input and generate parameter values for the heuristic algorithm, as well as the amount of noise used to add to the output of the algorithm.

Figure 3 shows the high-level data flow of the resulting proposal program. We optimize the parameters during an offline inference compilation phase using Algorithm 3. Note that the internal random choices of the heuristic algorithm () do not depend deterministically on . Therefore, to use Algorithm 3, the heuristic algorithm need not be instrumented by the probabilistic runtime, and can be fast black-box code.

6 Example: Proposal program combining RANSAC with a neural network

1function ransac(xs, ys, params::RANSACParams)
2 best_num_inliers::Int = -1
3 best_slope::Float64 = NaN
4 best_intercept::Float64 = NaN
5 for i=1:params.num_iters
7 # Select a random pair of points
8 rand_ind = StatsBase.sample(1:length(xs), 2, replace=false)
9 subset_xs = xs[rand_ind]
10 subset_ys = ys[rand_ind]
12 # Estimate slope and intercept using least squares
13 A = hcat(subset_xs, ones(length(subset_xs)))
14 slope, intercept = A\subset_ys
16 # Count the number of inliers for this (slope, intercept) hypothesis
17 ypred = intercept + slope * xs
18 inliers = abs.(ys - ypred) .< params.epsilon
19 num_inliers = sum(inliers)
21 if num_inliers > best_num_inliers
22 best_slope, best_intercept = slope, intercept
23 best_num_inliers = num_inliers
24 end
25 end
27 # return the hypothesis that resulted in the most inliers
28 (best_slope, best_intercept)
Figure 4: A Julia implementation of a RANSAC-based heuristic algorithm for generating hypotheses (lines parameterized by a slope and intercept) that explain a given data set of x-y coordinates.
1@probabilistic function ransac_neural_proposal(xs, ys, params)
3 # Generate parameters for RANSAC using learned parameters
4 epsilon = @choice(gamma(exp(params.eps_alpha),
5 exp(params.eps_beta)), "epsilon")
6 num_iters = @choice(categorical(params.iter_dist), "iters")
7 ransac_params = RANSACParams(num_iters, epsilon)
9 # Run RANSAC (uses many un-annotated random choices)
10 slope_guess, intercept_guess = ransac(xs, ys, ransac_params)
12 # Predict output variability using learned neural network
13 nn_hidden = ewise(sigmoid, params.h_weights * vcat(xs, ys) + params.h_biases)
14 nn_out = params.out_weights * nn_hidden + params.out_biases
15 slope_scale, intercept_scale = (exp(nn_out[1]), exp(nn_out[2]))
17 # Add noise
18 slope = @choice(cauchy(slope_guess, slope_scale), "slope")
19 intercept = @choice(cauchy(intercept_guess, intercept_scale), "intercept")
21 # Generate outlier statuses from conditional distribution
22 for (i, (x, y)) in enumerate(zip(xs, ys))
23 p_outlier = conditional_outlier(x, y, slope, intercept)
24 @choice(flip(p_outlier), "outlier-$i")
25 end
Figure 5: Proposal program in Gen.jl that uses RANSAC and a neural network to predict the line, followed by conditional sampling for the outlier variables. The output choices are "slope", "intercept", and "outlier-1", "outlier-2", etc. In Gen.jl, the set of output choices is defined when querying the program using the methods of Figure 1(b), not in the program text itself.
1@probabilistic function neural_proposal(xs, ys, params)
3 # Generate randomness
4 dim = 2
5 latent = mvnormal(zeros(dim), eye(dim))

# Construct feature vector

8 features = vcat(latent, xs, ys)
10 # Use neural network to predict parameters of line distribution
11 hidden = ewise(sigmoid, params.hidden_weights * features + params.hidden_biases)
12 output = params.output_weights * hidden + params.output_biases
13 slope_mu = output[1]
14 intercept_mu = output[2]
15 slope_var = exp(output[3])
16 intercept_var = exp(output[4])
18 # Generate line
19 slope = @choice(cauchy(slope_guess, slope_scale), "slope")
20 intercept = @choice(cauchy(intercept_guess, intercept_scale), "intercept")
22 # Generate outlier statuses from conditional distribution
23 for (i, (x, y)) in enumerate(zip(xs, ys))
24 p_outlier = conditional_outlier(x, y, slope, intercept)
25 @choice(flip(p_outlier), "outlier-$i")
26 end
Figure 6: Proposal program in Gen.jl that a neural network to predict the line, followed by conditional sampling for the outlier variables

We illustrate Algorithm 1 and approach of Section 5, on a Bayesian linear regression with outliers inference problem. The data set is a set of x-y pairs , the latent variables consist of the parameters of the line

and binary variables

indicating whether each data point is an outlier or not. The generative model , which is conditioned on the x-coordinates , is defined below:

The unnormalized target distribution is the joint probability .

Random sample consensus (RANSAC, fischler1981random ) is an iterative algorithm that can quickly find lines near the posterior modes of the posterior on this problem, using deterministic least-squares algorithm within each iteration. Figure 4 shows a Julia implementation of a RANSAC algorithm for our linear-regression with outliers problem. However, the output distribution of the algorithm is an atomic set that does not sufficiently support the posterior distribution on lines, which is defined on . Also, the algorithm contains a parameter epsilon that needs to be roughly calibrated to the inlier variability in a particular data set, and it is not clear how many iterations of the algorithm to run.

To make use of RANSAC for robust and sound probabilistic inference, we wrote a proposal program in Gen.jl following the methodology proposed in Section 5, optimized the unknown parameters of the resulting proposal programs using Algorithm 3, and employed the optimized proposal programs within importance sampling (Algorithm 1). The proposal program, shown in Figure 5, (1) generates the parameter epsilon and the number of iterations of the RANSAC algorithm from distributions whose parameters are part of

, then (2) runs RANSAC, and (3) adds Cauchy-distributed noise to the resulting line parameters (slope and intercept), where the variability of the noise is determined by the output of a neural network whose parameters are part of

, and finally (4) samples the outlier binary variables from their conditional distributions given the line. Note that the random choices inside the ransac function are not annotated—the RANSAC procedure is executed as regular Julia code by Gen.jl’s probabilistic runtime without incurring any overhead.

(a) IS (prior)
(c) IS (NN)
(d) Training objective
Figure 7: (a) shows a dataset (points), and approximate posterior samples (lines) from an importance sampling algorithm using a prior proposal. (b) shows samples produced by Algorithm 1 using a proposal program (RANSAC+NN, Figure 5) that combines RANSAC with a neural network. (c) shows samples produced Algorithm 1 using a proposal program (NN, Figure 6) based on a neural network. Six particles () were used for both proposals. Significantly more accurate samples are obtained with comparable throughput using the RANSAC+NN proposal program. (d) shows the estimated approximation error of the three proposals as the parameters of proposals are tuned offline using ADAM. Error is quantified using the expected KL divergence from the target distribution to the proposal distribution up to an unknown constant that does not depend on the proposal distribution, where the expectation is taken under datasets sampled from the model.

For comparison, we also implemented a proposal program (Figure 6) that does not use RANSAC, but instead generates the slope and intercept of the line from a distribution that depends on a neural network. We optimized both proposal programs using a modified version of Algorithm 3 that uses ADAM kingma2014adam instead of standard SGD. We used the generative model as the training distribution, and we used during training and minibatches of size , and iterations. The results are shown and discussed in Figure 7.

7 Related work

Some probabilistic programming systems support combining automated and custom inference strategies mansinghka2014venture . However, custom proposal distributions in mansinghka2014venture require user implementation of a density-evaluation function for the proposal (mirroring the interface of Figure 1(a)). Other probabilistic programming systems support the specification of ‘guide programs’ for use in importance sampling, sequential Monte Carlo, or variational inference ritchie2016neurally ; ritchie2016deep . However, guide programs follow the control flow of a model program, or are restricted to using the same set of random choices. We focus on using general probabilistic programs to define proposal distributions, whether in the context of probabilistic programs for the model or not.

The Monte Carlo algorithms presented here derive their theoretical justification from auxiliary-variable constructions similar to those of pseudo-marginal MCMC andrieu2009 and random-weight particle filter fearnhead2008particle . In particular, our Metropolis-Hastings algorithm can be seen as an application of the general auxiliary-variable MCMC formalism of storvik2011flexibility to the setting where proposals are defined as probabilistic programs.

Recent work in ‘amortized’ or ‘compiled’ inference has studied techniques for offline optimizing of proposal distributions in importance sampling or sequential Monte Carlo stuhlmuller2013learning ; paige2016inference ; ritchie2016neurally ; le2016inference . Others have applied similar approaches to optimize proposal distributions for use in MCMC cusumano2017probabilistic ; wang2017neural . However, the focus of these efforts is optimizing over neurally-parameterized distributions that inherit their structure from the generative model being targeted, do not contain their own internal random choices, and are therefore not suited to use with heuristic randomized algorithms. In contrast, we seek to allow the user to easily express and compute with custom proposal distributions that are defined as arbitrary probabilistic programs that are independent of any possible structure in the probabilistic model being targeted, and may include ‘internal’ random choices not present in the target model.

There has been much recent work in probabilistic machine learning on training generative latent variable models using using stochastic gradient descent

kingma2013auto ; mnih2014neural ; mnih2016variational . Our procedure for optimizing proposal programs is an application of the Monte Carlo variational objective approach of mnih2016variational

to the setting where the generative model is itself a proposal distribution in a different probabilistic model. The observation that random variables can permit optimization of probabilistic computations that utilize black-box randomized code has been previously observed and used in reinforcement learning

williams1992simple ; schulman2015gradient .

8 Discussion

This paper formalized proposal programs, discussed how to implement proposal programs in a sampling-based probabilistic runtime, showed how to use proposal programs within importance sampling and Metropolis-Hastings, how to optimize proposal programs offline, and suggested an application of the formalism to using randomized heuristics to accelerate Monte Carlo inference. Several directions for future work seem promising.

First, we note that the efficiency of Monte Carlo inference with proposal programs depends on the number of internal replicates used within the proposed simulate and assess procedures. It is important to better characterize how the efficiency depends on . Also, the proposed simulate and assess procedures make use of forward execution of the proposal program to generate traces that are used for estimating the proposal probability. Following the ‘probabilistic module’ formalism cusumano2016encapsulating , other distributions on traces that attempt to approximate the conditional distribution on traces given output choices can be used instead for more accurate proposal probability estimates (and therefore more efficient Monte Carlo inference for a fixed proposal program). These distributions, which constitute nested inference samplers, should be able to make use of existing inference machinery in probabilistic programming languages that was originally developed for inference in model programs. It also seems promising to explore proposal programs that combine neural networks and domain-specific heuristic algorithms in different ways. For example, a proposal program may contain a switch statement that decides whether to propose using a heuristic (as in Section 5) or according to a pure-neural network proposal (as shown in Figure 6), where the probability of taking the two different paths can itself be predicted from the problem instance using a neural network. The proposal program formalism also suggests that other approaches for inference in probabilistic programs such as symbolic integration gehr2016psi

could find applications in estimating or computing proposal probabilities. Finally, a rigorous theoretical analysis of importance sampling, Metropolis-Hastings, and offline optimization with proposal programs that contain continuous random variables would be useful.


This research was supported by DARPA (PPAML program, contract number FA8750-14-2-0004), IARPA (under research contract 2015-15061000003), the Office of Naval Research (under research contract N000141310333), the Army Research Office (under agreement number W911NF-13-1-0212), and gifts from Analog Devices and Google. This research was conducted with Government support under and awarded by DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.


Appendix A Proofs

a.1 Consistency of importance sampling using proposal program

Consider a proposal program and a target distribution such that for all where there exists a trace of where and . Also assume that for all . Consider the following extended target distribution:


Note that could hypothetically be sampled from by first sampling from and then executing using fixed output choices . Consider the following extended proposal distribution:


Note that can be sampled from by first sampling an index uniformly from , then executing to produce , extracting the output choices , and executing an additional times using fixed output to produce for . First note that since implies and . The importance weight for the extended target distribution and the extended proposal distribution is:


Note that invoking simulate in Algorithm 1 is equivalent to sampling from the extended proposal distribution and that the importance weight in Algorithm 1 is equal to Equation (11). Therefore, Algorithm 1 is a standard self-normalized importance sampler for the extended target distribution . By convergence of standard self-normalized importance sampling, .

a.2 Stationary distribution for Metropolis-Hastings using proposal program

Algorithm 2 defines a transition operator that takes as input an output trace of proposal program , denoted , and produces as output another output trace, denoted . Note that other state besides (i.e. state that is not mutated by the transition operator) can be included in the input to the operator, but we write to simplify notation. Let denote the Kronecker delta. Let denote the tuple . To show that the this transition operator admits the target distribution as a stationary distribution, we first define an extended target distribution on tuples where and are both output traces of , where , and where each for are traces of :


One could hypothetically sample from this distribution by first sampling , then sampling uniformly from , extracting the output choices , then executing the program on input to produce trace , and executing the program times on input with output choices fixed to values given in . We also define a proposal kernel on this extended space:


This kernel can be sampled from by first executing on input , times, with output random choices fixed to values in , producing traces , then sampling from in proportion to , and then setting and . Consider a Metropolis-Hastings (MH) move on the extended space, using as the proposal kernel and the extended target as the target. Assume that , , , and . The MH acceptance ratio for this move is:


Consider a sampling process that begins with then executes Algorithm 2 on input . Observe that the invocation of simulate in Algorithm 2 is equivalent to sampling from . Therefore, after simulate, we have a sample from the extended target distribution: . Next, note that the invocation of assess in Algorithm 2, on input and , is equivalent to sampling from the extended proposal kernel on input . Note that is not literally sampled in assess but could be without changing the behavior of assess. Finally, note that Equation (18) is the acceptance ratio used in the accept/reject step of Algorithm 2. Therefore, we can interpret Algorithm 2 as first extending the input sample onto the extended space by sampling from the conditional distribution , and then performing an MH step on the extended space. Since , we have that . Since the MH step on the extended space has the extended target distribution as a stationary distribution, we have and therefore . Therefore is a stationary distribution of Algorithm 2.

Appendix B Offline optimization of proposal programs

To see that , note that:


By Jensen’s inequality: