 # Composing Modeling and Inference Operations with Probabilistic Program Combinators

Probabilistic programs with dynamic computation graphs can define measures over sample spaces with unbounded dimensionality, and thereby constitute programmatic analogues to Bayesian nonparametrics. Owing to the generality of this model class, inference relies on "black-box" Monte Carlo methods that are generally not able to take advantage of conditional independence and exchangeability, which have historically been the cornerstones of efficient inference. We here seek to develop a "middle ground" between probabilistic models with fully dynamic and fully static computation graphs. To this end, we introduce a combinator library for the Probabilistic Torch framework. Combinators are functions that accept models and return transformed models. We assume that models are dynamic, but that model composition is static, in the sense that combinator application takes place prior to evaluating the model on data. Combinators provide primitives for both model and inference composition. Model combinators take the form of classic functional programming constructs such as map and reduce. These constructs define a computation graph at a coarsened level of representation, in which nodes correspond to models, rather than individual variables. Inference combinators - such as enumeration, importance resampling, and Markov Chain Monte Carlo operators - assume a sampling semantics for model evaluation, in which application of combinators preserves proper weighting. Owing to this property, models defined using combinators can be trained using stochastic methods that optimize either variational or wake-sleep style objectives. As a validation of this principle, we use combinators to implement black box inference for hidden Markov models.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian nonparametric models have traditionally leveraged exchangeability in order to define predictive distributions that marginalize over an unbounded number of degrees of freedom. In recent years, the field of probabilistic programming has explored a different (yet related) class of models. A probabilistic program can be thought of as a stochastic simulator that is conditioned on observed variables. When the probabilistic programming language supports recursion, probabilistic programs can define priors that sample from models with an unbounded number of random variables, providing a programmatic alternative to classic Bayesian nonparametric models.

A probabilistic program must support two operations. First, it must be possible to generate samples by evaluating the program. In general, any (halting) evaluation instantiates some finite set of random variables, whose values are referred to as a trace. The second operation that must be implemented is the evaluation of the unnormalized density function of a program for any trace. These operations can be formalized in two equivalent forms of denotational semantics in which a program with inputs either evaluates to an unnormalized measure , or a weighted sample scibior2017denotational ; Scibior:2018:FPM:3243631.3236778 . Inference seeks to characterize the target density . As in other inference problems, the integral is typically intractable, and is typically approximated using Monte Carlo methods.

In both nonparametric models and general probabilistic programs, we can significantly improve the performance of approximate Bayesian inference by imposing some

a priori

assumptions about the graphical form of the joint distribution to be conditioned on our observations. For example, in some models we can alternate between updates to local and global plates of variables. In a Hidden Markov Model (HMM), for instance, we can predict transition probabilities from state sequences, or vice versa. Research in probabilistic programming has traditionally emphasized the development of assumption-free inference methods. To address model-specific inference optimizations, we develop abstractions to modularly and compositionally specify models and inference strategies.

## 2 Model and Inference Composition

In recent years, there have been a number of efforts to develop specialized inference methods for probabilistic programming. The Venture mansinghka2014venture platform provides primitives for inference programming that can act on subsets of variables in an execution trace. There has also been work to formalize notions of valid inference composition. The Hakaru language narayanan2016probabilistic frames inference as program transformations, which can be composed so as to preserve a measure-theoretic denotation zinkov2017composing . Work by Scibior et al. scibior2017denotational defines measure-theoretic validity criteria for compositional inference transformations.

Models in Probabilistic Torch are written in Python and can make use of expressions, loops, and other control flow constructs. Models can dynamically instantiate random variables in a data-dependent manner, although the computation graph becomes difficult to analyze staticallyschulman2015gradient . To reason without static guarantees, we postulate these requirements for model and inference composition:

1. Composition is static, evaluation is dynamic. A model is statically composed from other models, while each evaluation based on data generates a unique trace. In our later HMM example, we can compose a model that samples global parameters with a model for a sequence of variably many states and observations. Evaluations which traverse the same control-flow path while sampling different random values yield traces we can use as samples from the same distribution.

2. Inference operations preserve proper weights. A program defines a measure , which may be unnormalized, conditioned on some set of inputs . This measure can represent a prior distribution, or a distribution that is conditioned using observed variables or factors. We here assume that valid evaluation strategies for a program yield properly weighted naesseth2015nested samples such that, for all measurable functions ,

 E[h(X)W] =∫h(x)γf(x∣y)dx. (1)

This property implies that , i.e. the weight

is an unbiased estimator of the normalizer. A default evaluation strategy that satisfies this assumption is likelihood weighting, in which samples are proposed from the program prior and conditioning operations define an importance weight.

We require that any inference combinator must preserve proper weighting. Operations that satisfy this requirement include importance sampling, importance resampling, Sequential Monte Carlo (SMC), and application of a transition kernel. It follows that any composition of these operations also preserves proper weighting, resulting in an inference strategy that is properly weighted by constructionscibior2017denotational .

## 3 Model Combinators

A model is a stochastic computation that returns a properly weighted sample. Model evaluation produces a trace, an object that holds values and densities for the set of random variables instantiated during a particular evaluation of the model. Traces can be conditioned on other traces to implement proposals. Combinators accept models as inputs and return a model.

Table 1 shows a number of combinators corresponding to functional programming constructs, with their semantics. In addition to those for which we give the semantics, we have also implemented reduce as a basic folding combinator, and such common model families as ssm (state-space model), mixture, and hmm (a hidden Markov model). Of particular note is that we can give semantics for higher-order stochastic functions built using partial and compose.

## 4 Inference Composition

Consider running a probabilistic program to draw a sample from an unnormalized density . We denote drawing a properly weighted sample with weight from via as . Since proper weights are ratios of unnormalized densities, any joint distribution formed by a probabilistic program constitutes a proper weight. We can thus express as a properly weighted sampler any inference technique which only requires producing samples from programs.

Table 2 shows inference rule semantics for several inference combinators in terms of how they take properly weighted samplers as arguments and return properly weighted samplers in turn. Note that when denotes a transition kernel that satisfies detailed balance, as used in Markov chain Monte Carlo methods, the new proper weight .

The proposed combinator framework is a natural fit for modern varational methods for training deep probabilistic models. Because the weight is an unbiased estimator of the normalizer, we can use its logarithm as an evidence lower bound (ELBO) kucukelbir2017automatic or evidence upper bound (EUBO) bornschein2015reweighted

to perform variational inference by automatic differentiation in PyTorch. For a parameterized density

, we can approximate the gradient using the Monte Carlo estimator

 ∑kwk∑lwl∇θlogγθ(xk∣y).

When we sample from an inference model we can perform wake-sleep style inference by minimizing the objective using the estimator

 −∑kwk∑lwl∇ϕlogqϕ(xk∣y).

Note that the gradient w.r.t.  computes whereas the gradient w.r.t.  computes . In other words, we can perform variational inference in any properly weighted model by automatic differentiation on the importance weights.

## 5 Evaluation Figure 1: Combinator-based variational inference in hidden Markov models (HMM). a) A bouncing ball trajectory with initial velocity. b) The displacement along x and y axis, respectively. c) Inferred travel directions and transition probabilities from combinator-based wake-sleep Sequential Monte Carlo (SMC). d) Inferred travel directions and transition probabilities from Variational Bayesian Expectation Maximization (VBEM).

Figure 1 shows inference results on simulated data. The data models a bouncing particle trajectory in a closed box (Fig. 1a). This trajectory has a piece-wise constant noisy velocity, which means that the displacements at each time step (Fig. 1b) can be described by an HMM with Gaussian observations, where each state’s observation mean corresponds to the average velocity along one of four possible directions of motion. We compare wake-sleep SMC inference results for a combinator-based implementation (Fig. 1c) to those obtained using variational Bayesian expectation maximization (VBEM) (Fig. 1d), for a set of 30 time series that each contain 200 time points. VBEM optimizes the exclusive Kullback-Leibler (KL) divergence, , while in our combinator-based inference we optimized

, approximating the posterior with greater variance. The combinator-based HMM implementation required 64 lines of model specific code.

## 6 Extensions and Future Work

In this work, we have considered the fewest assumptions possible about the structural form of the generative model, and so we believe these inference techniques to be the most suited to dynamic probabilistic programs with unbounded dimensionality. Our next step will be to apply combinators to modeling intuitive physics in perception. A variety of extensions are possible that make use of some information about the graphical structure of a model. One opportunity is to apply enumeration or belief propagation to components of the model that are amenable to such operations. Recent work by the Pyro team on sum-product implementations for deep probabilistic programs is relevant in this context uber2018ubersum . Recent just-in-time compilation strategies pytorch2018script could be adapted to construct static graphs for such models, which an optional API could expose to implement inference optimizations.

### Acknowledgments

The authors would like to thank David Tolpin for his early interest in this line of work, and the two anonymous reviewers at the Bayesian Nonparametrics workshop for their detailed feedback.

## References

•  Jörg Bornschein and Yoshua Bengio. Reweighted Wake-Sleep. International Conference on Learning Representations, 2015.
•  Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic differentiation variational inference.

The Journal of Machine Learning Research

, 18(1):430–474, 2017.
•  Vikash Mansinghka, Daniel Selsam, and Yura Perov. Venture: A higher-order probabilistic programming platform with programmable inference. arXiv, pages 78–78, March 2014.
•  Christian Naesseth, Fredrik Lindsten, and Thomas Schon. Nested sequential monte carlo methods. In International Conference on Machine Learning, pages 1292–1301, 2015.
•  Praveen Narayanan, Jacques Carette, Wren Romano, Chung-chieh Shan, and Robert Zinkov. Probabilistic inference by program transformation in Hakaru (system description). In

International Symposium on Functional and Logic Programming

, pages 62–79. Springer, 2016.
•  PyTorch. Torch Script.
•  John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient Estimation Using Stochastic Computation Graphs. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3528–3536. Curran Associates, Inc., 2015.
•  Adam Ścibior, Ohad Kammar, and Zoubin Ghahramani. Functional programming for modular bayesian inference. Proc. ACM Program. Lang., 2(ICFP):83:1–83:29, July 2018.
•  Adam Ścibior, Ohad Kammar, Matthijs Vákár, Sam Staton, Hongseok Yang, Yufei Cai, Klaus Ostermann, Sean K. Moss, Chris Heunen, and Zoubin Ghahramani. Denotational Validation of Higher-order Bayesian Inference. Proc. ACM Program. Lang., 2(POPL):60:1–60:29, December 2017.
•  Uber AI Labs. Pyro documentation.
•  Robert Zinkov and Chung-chieh Shan. Composing inference algorithms as program transformations.

Uncertainty in Artificial Intelligence

, 2017.