Perhaps somewhat counterintuitively, most probabilistic programs are deterministic. The property that distinguishes a probabilistic program is that a probabilistic program computes the (unnormalized) probability of its execution(Wood et al., 2014). An execution is summarized as an instantiation of the program trace, and the probability is a function of the trace. Some probabilistic programming frameworks require that the trace shape is specified upfront (Carpenter et al., 2017; Tolpin, 2019), others allow introduction of trace components dynamically (Goodman et al., 2008; Pfeffer, 2009; Goodman and Stuhlmüller, 2014; Ge et al., 2018; Tolpin et al., 2016), in the course of a program execution.
A probabilistic programming framework passes the probabilistic program as an argument to an inference algorithm. The algorithm may collect any program output for later summarization, but otherwise ignores the output and uses only the trace probability for inference. In some probabilistic programming languages the locations where the probability is computed are syntactically explicit, and inference algorithms may exploit the program structure. For example, in Stan (Carpenter et al., 2017) operator and an increment of variable target update the trace probability; in Anglican (Tolpin et al., 2016), both sample and observe update the trace probability, in addition, sample adds a component to the trace; but neither nor sample introduce randomness in the computation of probability.
However, stochastic computation of trace probability naturally comes up in at least two different contexts. In one context, nuisance parameters must be marginalized: instead of inferring the posterior distribution over both parameters of interest and nuisance parameters, and marginalizing over a representation of the posterior (for example, by summing up samples in Monte Carlo approximations), the program can compute the execution probability stochastically by sampling nuisance parameters, and perform marginalization during inference. For example, stochastic computation can be used to represent flipping a coin with an unobserved outcome. The other context is the handling of nondeterminism, where the posterior distribution of the trace should be inferred under any outcome of non-deterministic choices following a certain distribution, such as for policy learning in stochastic domains (van de Meent et al., 2016). These two contexts admit different inference schemes, but both involve stochastic computation of trace probability and would benefit from consistent representation in probabilistic programs.
In this work, we formally define a stochastic probabilistic program (Section 2) and then introduce inference schemes for both marginalization and nondeterminism (Section 3). We present a reference implementation of probabilistic programming facility that support stochastic probabilistic programs and show that such programs can be supported with little extra effort compared to traditional, deterministic probabilistic programs (Section 4). In case studies (Section 5), we implement benchmark problems from the literature as stochastic probabilistic programs and compare both specification of the programs and inference in them to deterministic implementations. We also bring examples of models which are naturally represented as stochastic probabilistic programs but are difficult to implement without introducing stochasticity. Probabilistic program examples in Sections 2–3 are in a subset of the Go programming language (Go team, 2009), but should be legible for readers who are not familiar with Go.
This work brings the following contributions:
definition of a stochastic probabilistic program;
posterior inference on stochastic probabilistic programs for marginalization and nondeterminism;
a reference implementation of probabilistic programming with support for stochastic probabilistic programs.
2. A Stochastic Probabilistic Program
Different definitions of (deterministic) probabilistic programs are given in the literature and reflect different views and accents. For the purpose of this work, let us give the following broad definition:
Definition 0 ().
A deterministic probabilistic program that defines a distribution over traces is a computer program that accepts a trace assignment as one of its arguments, and returns the unnormalized probability of with respect to , conditioned on other arguments :
where is an unknown normalization constant.
The trace assignment
may have the form of a vector, or of a list of address-value pairs, or any other form suitable for a particular implementation. Accordingly, let us define a stochastic probabilistic program:
Definition 0 ().
A stochastic probabilistic program that defines a distribution over traces is a computer program that accepts a trace assignment as one of its arguments, and returns the unnormalized probability of with respect to , conditioned on other arguments
and a random variabledrawn from distribution , conditioned on :
where is an unknown normalization constant.
A rationale for this definition is that corresponds to nuisance parameters or nondeterministic choices inside the program. Let us illustrate a stochastic probabilistic program on an example:
Example 0 ().
A survey is conducted among company’s employees. The survey contains a single question: ”Are you satisfied with your compensation?” To preserve employees’ privacy, the employee flips a coin before answering the question. If the coin shows head, the employee answers honestly; otherwise, the employee flips a coin again, and answers ’yes’ on head, ’no’ on tail. Based on survey outcomes, we want to know how many of the employees are satisfied with their compensations.
Stochastic probabilistic program
The stochastic probabilistic program modelling the survey (Figure 1) receives two parameters: probability theta that a randomly chosen employee is satisfied with the compensation and vector y of boolean survey outcomes. The program trace consists of a single variable theta. There are nuisance random variables coin representing point flips which are sampled inside the program from their prior distribution, but are not included in the trace — this makes the probabilistic program stochastic. The unnormalized probability of the trace is accumulated in variable prob. First, a Beta prior is imposed on theta (line 2). Then, prob is multiplied by the probability of each answer given theta and coin (lines 5–9).
The program in Figure 1 can be rewritten as a deterministic probabilistic program (Figure 2). Instead of flipping a coin as in line 4 of the stochastic program in Figure 1, the probability of observation y[i] is computed in lines 5–6 as the sum of probabilities given either head or tail, weighted by the probabilities of head and tail (both are 0.5). Due to explicit marginalization, the generative structure of the program becomes less obvious, but one can still argue that a stochastic probabilistic program can be straightforwardly reduced to a deterministic one.
It is though easy to imagine a problem that is almost as simple as in the example but for which manual marginalization is not possible. Imagine that the distribution of coin flip outcomes is not known but obtained from a black-box source. The survey still can be conducted, as long as the source is stationary, if the probability of each outcome is unknown and even if consequent coin flips are not independent, by replacing line 4 in Figure 1 by
where Coins is a black-box random source. However, after such modification the program code cannot be marginalized, and there is no obvious deterministic equivalent of the stochastic probabilistic program. Worse yet, even if all instantiations of coin are included in the trace, making the trace 1+len(y)-dimensional, inference cannot be performed on the resulting deterministic probabilistic program because the probability of a coin flip outcome with respect to the random source Coins is unknown. Meanwhile, the stochastic probabilistic program constitutes a valid inference model, as we show in the following sections.
3. Inference in Stochastic Probabilistic Programs
A stochastic probabilistic program of the same shape may appear in two different contexts. In one context, like in Example 2.3, stochasticity models nuisance parameters, such as latent noise, and the posterior distribution of the trace is marginalized over the nuisance parameters. In the other context, stochasticity models nondeterminism, and the posterior distribution of the trace is inferred in expectation over all possible outcomes of nondeterministic choices. As an illustration of the latter context, consider the following example:
Example 0 ().
A player throws a ball into the basket at distance . Knowing that the initial velocity of the ball varies with a known distribution , at which angle should the player throw the ball to hit the basket?
For simplicity, let’s assume that the player’s hands and the basket are at the same level and that air drag can be ignored. Then, for given and the final distance of the ball at the basket level is:
where is the acceleration of gravity. The corresponding stochastic probabilistic program is shown in Figure 3. In the program code, Pi denotes constant and g is the acceleration of gravity .
Note that the question we want to answer with this probabilistic program is different from just inferring the posterior distribution of conditioned on observation , with as a nuisance parameter (which would be the case if one observed the ball falling at distance ). To model nondeterminism, one needs to ‘cut’ the dependency between the distribution of and the observation, and infer for all drawn from .
In the rest of the section, we describe posterior inference for both contexts. We limit analysis to the case of and differentiable by . We represent the posterior inference as the problem of drawing samples from posterior distribution given unnormalized probability density and a random source of . In either context, the inference can be based on stochastic gradient Hamiltonian Monte Carlo (sgHMC) (Chen et al., 2014), but differs in the way the stochastic gradient is computed. Similarly, maximum a posterioriestimation can be performed using a stochastic gradient optimization algorithm (Ruder, 2016).
The unnormalized probability density is a marginalization of over :
By Monte Carlo approximation,
where are drawn from the random source of . Posterior inference and maximum a posteriori estimation require a stochastic estimate of :
By single-sample Monte Carlo gradient estimation,
Note that (7) involves two independent Monte Carlo approximations, of the numerator and of the denominator, and therefore is computed twice with respect to different samples of , and
. In practice, this is implemented by calling the probabilistic program twice: first to compute the probability and the gradient in the numerator, then to compute the probability in the denominator. The variance of estimate (7) can be reduced, at the cost of increased computation time, by using multiple samples to estimate the denominator. Due to Monte Carlo approximation in the denominator, Metropolis-Hastings correction is necessary.
When a stochastic probabilistic program is used to represent nondeterminism, the posterior inference is performed with respect to all possible instantiations of in (2) (rather than any instantiation, as in the case of marginalization). Thus, is a geometric integral of over :
By Monte Carlo approximation,
The stochastic gradient estimate is then just the average of Monte Carlo samples of the gradient:
And, by single-sample Monte Carlo gradient estimation:
The case of nondeterminism bears similarity to the setting for which sgHMC was introduced — posterior inference on large or streaming data, Equation (5) in (Chen et al., 2014). A difference is that in (Chen et al., 2014), the gradient scales up with the data size, and the influence of the prior distribution vanishes as the data size grows. In the nondeterminism setting, the expectation of the gradient estimate (10) remains the same independently of the number of Monte Carlo samples.
We implemented inference in stochastic probabilistic programs using Infergo (Tolpin, 2019). Thanks to reliance of Infergo on pure Go code for probabilistic programs, no changes were necessary to the way probabilistic programs are represented — it is sufficient just to draw samples from a random number generator provided by one of Go libraries, for example, the standard library or Gonum (authors, 2017).
Implementations of sgHMC variants for case studies in this paper are available at http://bitbucket.org/dtolpin/spp-studies and will be included in a future version of Infergo.
5. Case Studies
We evaluated inference in stochastic probabilistic programs on several models. In each evaluation, we specified two models: a stochastic model and a corresponding deterministic model. Inference on stochastic models was performed using either sgHMC for nondeterminism, or MHMC, a variant of sgHMC with the gradient computed as in (6), for marginalization; HMC was used on deterministic models. In each of the case studies we checked that stochastic and deterministic models yield consistent results. The probabilistic programs used in case studies are available at http://bitbucket.org/dtolpin/spp-studies. In the appendix, we provide the source code of stochastic and deterministic probabilistic programs for three studies:
In this paper, we introduced the notion of a stochastic probabilistic program and inference schemes for the contexts of marginalization and nondeterminism, in which stochastic probabilistic programs come up. Stochastic probabilistic programs facilitate natural specification of probabilistic models, and support inference in the models, for which deterministic probabilistic programs are impossible or difficult to write. We focused our analysis on the case of differentiable probabilistic programs with a real-valued parameter vector, however this is not a limitation in general — stochastic probabilistic programs can be specified for support of any type, albeit different inference algorithms may be required.
Our understanding of stochastic probabilistic programs, classes of models for which they are best suited, and inference schemes is still at an early stage and evolving. In particular, we are working, towards future publications, on the analysis of stochastic gradient estimate for the marginalization context. Another research direction worth exploring is programmatic transformation of stochastic probabilistic programs into deterministic ones when the latter exist. Such transformation would allow specifying models as stochastic probabilistic programs where such representation is natural but applying efficient and well developed inference techniques for deterministic probabilistic programs on transformed representations. Last but not least, we are looking into introducing stochastic probabilistic program support into existing probabilistic programming frameworks, through transformation to deterministic probabilistic programs or by extending the frameworks.
Development of Infergo is supported by PUB+, http://pubplus.com/.
- authors (2017) Gonum authors. 2017. Gonum numerical packages. http://gonum.org/.
- Carpenter et al. (2017) Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: a probabilistic programming language. Journal of Statistical Software, Articles 76, 1 (2017), 1–32.
et al. (2014)
Tianqi Chen, Emily B.
Fox, and Carlos Guestrin.
Stochastic Gradient Hamiltonian Monte Carlo. In
Proceedings of the 31st International Conference on International Conference on Machine Learning(ICML’14). JMLR.org, II–1683–II–1691.
et al. (2018)
Hong Ge, Kai Xu, and
Zoubin Ghahramani. 2018.
Turing: composable inference for probabilistic
International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain. 1682–1690.
- Go team (2009) The Go team. 2009. The Go programming language. http://golang.org/.
- Goodman et al. (2008) Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. 2008. Church: a language for generative models. In Proc. of Uncertainty in Artificial Intelligence.
- Goodman and Stuhlmüller (2014) N. D. Goodman and A. Stuhlmüller. 2014. The Design and Implementation of Probabilistic Programming Languages. http://dippl.org/ electronic; retrieved 2019/3/29.
- Pfeffer (2009) Avi Pfeffer. 2009. Figaro: An Object-Oriented Probabilistic Programming Language. In Charles River Analytics Technical Report (2009).
- Ruder (2016) Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv:1609.04747
- Tolpin (2019) David Tolpin. 2019. Deployable Probabilistic Programming. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2019). ACM, New York, NY, USA, 1–16.
- Tolpin et al. (2016) David Tolpin, Jan-Willem van de Meent, Hongseok Yang, and Frank Wood. 2016. Design and implementation of probabilistic programming language Anglican. In Proceedings of the 28th Symposium on the Implementation and Application of Functional Programming Languages (IFL 2016). ACM, New York, NY, USA, Article 6, 12 pages.
- van de Meent et al. (2016) Jan-Willem van de Meent, Brooks Paige, David Tolpin, and Frank Wood. 2016. Black-box policy search with probabilistic programs. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016. 1195–1204.
- Wood et al. (2014) Frank Wood, Jan-Willem van de Meent, and Vikash Mansinghka. 2014. A new approach to probabilistic programming inference. In Artificial Intelligence and Statistics.
Appendix A Source code for case studies
a.1. Compensation survey
a.2. Ball throw
To keep the deterministic model simple, the player throws the ball with one of two velocities: either Vw (weak throw) or Vs (strong throw) with equal probability. The stochastic model can handle any continuous or discrete random source of velocities.