An Introduction to Probabilistic Programming

09/27/2018 ∙ by Jan-Willem van de Meent, et al. ∙ Northeastern University KAIST 수리과학과 The Alan Turing Institute The University of British Columbia 2

This document is designed to be a first-year graduate-level introduction to probabilistic programming. It not only provides a thorough background for anyone wishing to use a probabilistic programming system, but also introduces the techniques needed to design and build these systems. It is aimed at people who have an undergraduate-level understanding of either or, ideally, both probabilistic machine learning and programming languages. We start with a discussion of model-based reasoning and explain why conditioning as a foundational computation is central to the fields of probabilistic machine learning and artificial intelligence. We then introduce a simple first-order probabilistic programming language (PPL) whose programs define static-computation-graph, finite-variable-cardinality models. In the context of this restricted PPL we introduce fundamental inference algorithms and describe how they can be implemented in the context of models denoted by probabilistic programs. In the second part of this document, we introduce a higher-order probabilistic programming language, with a functionality analogous to that of established programming languages. This affords the opportunity to define models with dynamic computation graphs, at the cost of requiring inference methods that generate samples by repeatedly executing the program. Foundational inference algorithms for this kind of probabilistic programming language are explained in the context of an interface between program executions and an inference controller. This document closes with a chapter on advanced topics which we believe to be, at the time of writing, interesting directions for probabilistic programming research; directions that point towards a tight integration with deep neural network research and the development of systems for next-generation artificial intelligence applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Grammars

A constant value or primitive function.
A variable.
A user-defined procedure.
\ |\ \ | (let [\ ]\ )\ | (if\ \ \ ) | (\ \ ...\ )
| (\ \ ...\ ) | ( sample\ ) | ( observe\ \ )
An expression in the first-order probabilistic
programming language (FOPPL).
\ |\ \ | (if\ \ \ ) | (\ \ ...\ )
An expression in the (purely deterministic) target language.
\ |\ \ |\ \ | (if\ \ \ )\ |\ (\ \ ...\ )
|\ ( sample\ )\ |\ ( observe\ \ ) | (fn [\ ...\ ]\ )
An expression in the higher-order probabilistic
programming language (HOPPL).
\ | (defn\ \ [\ ...\ ]\ )\ 
A program in the FOPPL or the HOPPL.

Sets, Lists, Maps, and Expressions

A set of constants
( refers to elements).
A list of constants
( indexes elements ).
A map from variables to constants
( indexes entries ).
A map update in which
replaces .
An in-place update in which replaces .
The set of keys in a map.
An expression literal.
An expression in which a constant
replaces the variable .
The free variables in an expression.

Directed Graphical Models

A directed graphical model.
The variable nodes in the graph.
The observed variable nodes.
The unobserved variable nodes.
An observed variable node.
An unobserved variable node.
The directed edges between parents and children .

The probability mass or density for each variable

, represented as a target language expression
The observed values .
The set of parents of a variable .

Factor Graphs

A factor graph.
The variable nodes in the graph.
The factor nodes in the graph.
The undirected edges between variables and factors .
Potentials for factors , represented as target language expressions .

Probability Densities

The joint density over all variables.
The prior density over unobserved variables.
The likelihood of observed variables given unobserved variables .
The posterior density for unobserved variables given unobserved variables .
A trace of values assocated with the instantiated set of variables .
The probability density evaluated at a trace .
A probability mass or density function for a variable with parameters .
The language expression that evaluates to the probability mass or density .

1.1 Model-based Reasoning

Model-building starts early. Children build model airplanes then blow them up with firecrackers just to see what happens. Civil engineers build physical models of bridges and dams then see what happens in scale-model wave pools and wind tunnels. Disease researchers use mice as model organisms to simulate how cancer tumors might respond to different drug dosages in humans.

These examples show exactly what a model is: a stand-in, an imposter, an artificial construct designed to respond in the same way as the system you would like to understand. A mouse is not a human but it is often close enough to get a sense of what a particular drug will do at particular concentrations in humans anyway. A scale-model of an earthen embankment dam has the wrong relative granularity of soil composition but studying overtopping in a wave pool still tells us something about how an actual dam might respond.

As computers have become faster and more capable, numerical models have come to the fore and computer simulations have replaced physical models. Such simulations are by nature approximations. However, now in many cases they can be as exacting as even the most highly sophisticated physical models – consider that the US was happy to abandon physical testing of nuclear weapons.

Numerical models emulate stochasticity, i.e. using pseudorandom number generators, to simulate actually random phenomena and other uncertainties. Running a simulator with stochastic value generation leads to a many-worlds-like explosion of possible simulation outcomes. Every little kid knows that even the slightest variation in the placement of a firecracker or the most seemly minor imperfection of a glue joint will lead to dramatically different model airplane explosions. Effective stochastic modeling means writing a program that can produce all possible explosions, each corresponding to a particular set of random values, including for example the random final resting position of a rapidly dropped lit firecracker.

Arguably this intrinsic variability of the real world is the most significant complication for modeling and understanding. Did the mouse die in two weeks because of a particular individual drug sensitivity, because of its particular phenotype, or because the drug regiment trial arm it was in was particularly aggressive? If we are interested in average effects, a single trial is never enough to learn anything for sure because random things almost always happen. You need a population of mice to gain any kind of real knowledge. You need to conduct several wind-tunnel bridge tests, numerical or physical, because of variability arising everywhere – the particular stresses induced by a particular vortex, the particular frailty of an individual model bridge or component, etc. Stochastic numerical simulation aims to computationally encompass the complete distribution of possible outcomes.

When we write model we generally will mean stochastic simulator and the measurable values it produces. Note, however, that this is not the only notion of model that one can adopt. Notably there is a related family of models that is specified solely in terms of an unnormalized density or “energy” function; this is treated in Chapter 3.

Models produce values for things we can measure in the real world; we call such measured values observations. What counts as an observation is model, experiment, and query specific – you might measure the daily weight of mice in a drug trial or you might observe whether or not a particular bridge design fails under a particular load.

Generally one does not observe every detail produced by a model, physical or numerical, and sometimes one simply cannot. Consider the standard model of physics and the large hadron collider. The standard model is arguably the most precise and predictive model ever conceived. It can be used to describe what can happen in fundamental particle interactions. At high energies these interactions can result in a particle jet that stochastically transitions between energy-equivalent decompositions with varying particle-type and momentum constituencies. It is simply not possible to observe the initial particle products and their first transitions because of how fast they occur. The energy of particles that make up the jet deposited into various detector elements constitute the observables.

So how does one use models? One way is to use them to falsify theories. To this one needs encode the theory as a model then simulate from it many times. If the population distribution of observations generated by the model is not in agreement with observations generated by the real world process then there is evidence that the theory can be falsified. This describes science to a large extent. Good theories take the form of models that can be used to make testable predictions. We can test those predictions and falsify model variants that fail to replicate observed statistics.

Models also can be used to make decisions. For instance when playing a game you either consciously or unconsciously use a model of how your opponent will play. To use such a model to make decisions about what move to play next yourself, you simulate taking a bunch of different actions, then pick one amongst them by simulating your opponent’s reaction according to your model of them, and so forth until reaching a game state whose value you know, for instance, the end of the game. Choosing the action that maximizes your chances of winning is a rational strategy that can be framed as model-based reasoning. Abstracting this to life being a game whose score you attempt to maximize while living requires a model of the entire world, including your own physical self, and is where model-based probabilistic machine learning meets artificial intelligence.

A useful model can take a number of forms. One kind takes the form of a reusable, interpretable abstraction with a good associated inference algorithm that describes summary statistic or features extracted from raw observable data. Another kind consists of a reusable but non-interpretable and entirely abstract model that can accurately generate complex observable data. Yet another kind of model, notably models in science and engineering, takes the form of a problem-specific simulator that describes a generative process very precisely in engineering-like terms and precision. Over the course of this introduction it will become apparent how probabilistic programming addresses the complete spectrum of them all.

All model types have parameters. Fitting these parameters, when few, can sometimes be performed manually, by intensive theory-based reasoning and a priori experimentation (the masses of particles in the standard model), by measuring conditional subcomponents of a simulator (the compressive strength of various concrete types and their action under load), or by simply fiddling with parameters to see which values produce the most realistic outputs.

Automated model fitting describes the process of using algorithms to determine either point or distributional estimates for model parameters and structure. Such automation is particularly useful when the parameters of a model are uninterpretable or many. We will return to model fitting in Chapter

7 however it is important to realize that inference can be used for model learning too, simply by lifting the inference problem to include uncertainty about the model itself (e.g. see the neural network example in 2.3 and the program induction example in 5.3).

The key point now is to understand that models come in many forms, from scientific and engineering simulators in which the results of every subcomputation are interpretable to abstract models in statistics and computer science which are, by design, significantly less interpretable but often are valuable for predictive inference none-the-less.

1.1.1 Model Denotation

An interesting thing to think about, and arguably the foundational idea that led to the field of probabilistic programming, is how such models are denoted and, respectively, how such models are manipulated to compute quantities of interest.

To see what we mean about model denotation let us first look at a simple statistical model and see how it is denoted. Statistical models are typically denoted mathematically, subsequently manipulated algebraically, then “solved” computationally. By “solved” we mean that an inference problem involving conditioning on the values of a subset of the variables in the model is answered. Such a model denotation stands in contrast to simulators which are often denoted in terms of software source code that is directly executed. This also stands in contrast, though less so, to generative models in machine learning which usually take the form of probability distributions whose factorization properties can be read from diagrams like graphical models or factor graphs.

Nearly the simplest possible model one could write down is a beta-Bernoulli model for generating a coin flip from a potentially biased coin. Such a model is typically denoted

(1.1)

where and are parameters, is a latent variable (the bias of the coin) and is the value of the flipped coin. A trained statistician will also ascribe a learned, folk-meaning to the symbol and the keywords and . For example means that, given the value of arguments and we can construct what is effectively an object with two methods. The first method being a probability density (or distribution) function that computes

and the second a method that draws exact samples from said distribution. A statistician will also usually be able to intuit not only that some variables in a model are to be observed, here for instance , but that there is an inference objective, here for instance to characterize . This denotation is extremely compact, and being mathematical in nature means that we can use our learned mathematical algebraic skills to manipulate expressions to solve for quantities of interest. We will return to this shortly.

In this tutorial we will generally focus on conditioning as the goal, namely the characterization of some conditional distribution given a specification of a model in the form of a joint distribution. This will involve the extensive use of Bayes rule

(1.2)

Bayes rule tells us how to derive a conditional probability from a joint, conditioning tells us how to rationally update our beliefs, and updating beliefs is what learning and inference are all about.

The constituents of Bayes rule have common names that are well known and will appear throughout this text: the likelihood, the prior, the marginal likelihood (or evidence), and the posterior. For our purposes a model is the joint distribution of the observations and the random choices made in the generative model , also called latent variables.

scene description image
simulation simulator output
program source code program return value
policy prior and world simulator rewards
cognitive decision making process observed behavior
Table 1.1: Probabilistic Programming Models

The subject of Bayesian inference, including both philosophical and methodological aspects, is in and of itself worthy of book length treatment. There are a large number of excellent references available, foremost amongst them the excellent book by

Gelman et al. [26]. In the space of probabilistic programming arguably the recent books by Davidson-Pilon [17] and Pfeffer [88] are the best current references. They all aim to explain what we expect you to gain an understanding of as you continue to read and build experience, namely, that conditioning a joint distribution – the fundamental Bayesian update – describes a huge number of problems succinctly.

Before continuing on to the special-case analytic solution to the simple Bayesian statistical model and inference problem, let us build some intuition about the power of both programming languages for model denotation and automated conditioning by considering Table 1.1. In this table we list a number of , pairs where denoting the joint distribution of is realistically only doable in a probabilistic programming language and the posterior distribution is of interest. Take the first, “scene description” and “image.” What would such a joint distribution look like? Thinking about it as is somewhat hard, however, thinking about as being some kind of distribution over a so-called scene graph – the actual object geometries, textures, and poses in a physical environment – is not unimaginably hard, particularly if you think about writing a simulator that only needs to stochastically generate reasonably plausible scene graphs. Noting that then all we need is a way to go from scene graph to observable image and we have a complete description of a joint distribution. There are many kinds of renderers that do just this and, although deterministic in general, they are perfectly fine to use when specifying a joint distribution because they map from some latent scene description to observable pixel space and, with the addition of some image-level pixel noise reflecting, for instance, sensor imperfections or Monte-Carlo ray-tracing artifacts, form a perfectly valid likelihood.

An example of this “vision as inverse graphics” idea [52] appearing first in Mansinghka et al. [61] and then subsequently in Le et al. [55, 54] took the image to be a Captcha image and the scene description to include the obscured string. In all three papers the point was not Captcha-breaking per se but instead demonstrating both that such a model is denotable in a probabilistic programming language and that such a model can be solved by general purpose inference.

Let us momentarily consider alternative ways to solve such a “Captcha problem.” A non-probabilistic programming approach would require gathering a very large number of Captchas, hand-labeling them all, then designing and training a neural network to regress from the image to a text string [12]. The probabilistic programming approach in contrast merely requires one to write a program that generates Captchas that are stylistically similar to the Captcha family one would like to break – a model of Captchas – in a probabilistic programming language. Conditioning such a model on its observable output, the Captcha image, will yield a posterior distribution over text strings. This kind of conditioning is what probabilistic programming evaluators do.

Figure 1.1 shows a representation of the output of such a conditioning computation. Each Captcha/bar-plot pair consists of a held-out Captcha image and a truncated marginal posterior distribution over unique string interpretations. Drawing your attention to the middle of the bottom row, notice that the noise on the Captcha makes it more-or-less impossible to tell if the string is “aG8BPY” or “aG8RPY.” The posterior distribution arrived at by conditioning reflects this uncertainty.

Figure 1.1: Posterior uncertainties after inference in a probabilistic programming language model of 2017 Facebook Captchas (reproduced from Le et al. [54])

By this simple example, whose source code appears in Chapter 5 in a simplified form, we aim only to liberate your thinking in regards to what a model is (a joint distribution, potentially over richly structured objects, produced by adding stochastic choice to normal computer programs like Captcha generators) and what the output of a conditioning computation can be like. What probabilistic programming languages do is to allow denotation of any such model. What this tutorial covers in great detail is how to develop inference algorithms that allow computational characterization of the posterior distribution of interest, increasingly very rapidly as well (see Chapter 7).

1.1.2 Conditioning

Returning to our simple coin-flip statistics example, let us continue and write out the joint probability density for the distribution on and . The reason to do this is to paint a picture, by this simple example, of what the mathematical operations involved in conditioning are like and why the problem of conditioning is, in general, hard.

Assume that the symbol denotes the observed outcome of the coin flip and that we encode the event “comes up heads” using the mathematical value of the integer 1 and 0 for the converse. We will denote the bias of the coin, i.e. the probability it comes up heads, using the symbol and encode it using a real positive number between 0 and 1 inclusive, i.e. . Then using standard definitions for the distributions indicated by the joint denotation in Equation (1.1) we can write

(1.3)

and then use rules of algebra to simplify this expression to

(1.4)

Note that we have been extremely pedantic here, using words like “symbol,” “denotes,” “encodes,” and so forth, to try to get you, the reader, to think in advance about other ways one might denote such a model and to realize if you don’t already that there is a fundamental difference between the symbol or expression used to represent or denote a meaning and the meaning itself. Where we haven’t been pedantic here is probably the most interesting thing to think about: What does it mean to use rules of algebra to manipulate Equation (1.3) into Equation (1.4)? To most reasonably trained mathematicians, applying expression transforming rules that obey the laws of associativity, commutativity, and the like are natural and are performed almost unconsciously. To a reasonably trained programming languages person these manipulations are meta-programs, i.e. programs that consume and output programs, that perform semantics-preserving transformations on expressions. Some probabilistic programming systems operate in exactly this way [76]. What we mean by semantics preserving in general is that, after evaluation, expressions in pre-simplified and post-simplified form have the same meaning; in other words, evaluate to the same object, usually mathematical, in an underlying formal language whose meaning is well established and agreed. In probabilistic programming semantics preserving generally means that the mathematical objects denoted correspond to the same distribution [109]. Here, after algebraic manipulation, we can agree that, when evaluated on inputs and , the expressions in Equations (1.3) and (1.4) would evaluate to the same value and thus are semantically equivalent alternative denotations. In Chapter 7 we touch on some of the challenges in defining the formal semantics of probabilistic programming languages.

That said, our implicit objective here is not to compute the value of the joint probability of some variables, but to do conditioning instead, for instance, to compute . Using Bayes rule this is theoretically easy to do. It is just

(1.5)

In this special case the rules of algebra and semantics preserving transformations of integrals can be used to algebraically solve for an analytic form for this posterior distribution.

To start the preceding expression can be simplified to

(1.6)

which still leaves a nasty looking integral in the denominator. This is the complicating crux of Bayesian inference. This integral is in general intractable as it involves integrating over the entire space of the latent variables. Consider the Captcha example: simply summing over the latent character sequence itself would require an exponential-time operation.

This special statistics example has a very special property, called conjugacy, which means that this integral can be performed by inspection, by identifying that the integrand is the same as the non-constant part of the beta distribution and using the fact that the beta distribution must sum to one

(1.7)

Consequently,

(1.8)

which is equivalent to

(1.9)

There are several things that can be learned about conditioning from even this simple example. The result of the conditioning operation is a distribution parameterized by the observed or given quantity. Unfortunately this distribution will in general not have an analytic form because, for instance, we usually won’t be so lucky that the normalizing integral has an algebraic analytic solution nor, in the case that it is not, will it usually be easily calculable.

This does not mean that all is lost. Remember that the operator is overloaded to mean two things, density evaluation and exact sampling. Neither of these are possible in general. However the latter, in particular, can be approximated, and often consistently even without being able to do the former. For this reason amongst others our focus will be on sampling-based characterizations of conditional distributions in general.

1.1.3 Query

Either way, having such a handle on the resulting posterior distribution, density function or method for drawing samples from it, allows us to ask questions, “queries” in general. These are best expressed in integral form as well. For instance, we could ask: what is the probability that the bias of the coin is greater than , given that the coin came up heads? This is mathematically denoted as

(1.10)

where is an indicator function which evaluates to when its argument takes value true and

otherwise, which in this instance can be directly calculated using the cumulative distribution function of the beta distribution.

Fortunately we can still answer queries when we only have the ability to sample from the posterior distribution owing to the Markov strong law of large numbers which states under mild assumptions that

(1.11)

for general distributions and functions . This technique we will exploit repeatedly throughout. Note that the distribution on the right hand side is approximated by a set of samples on the left and that different functions can be evaluated at the same sample points chosen to represent after the samples have been generated.

This more or less completes the small part of the computational statistics story we will tell, at least insofar as how models are denoted then algebraically manipulated. We highly recommend that unfamiliar readers interested in the fundamental concepts of Bayesian analysis and mathematical evaluation strategies common there to read and study the “Bayesian Data Analysis” book by Gelman et al. [26].

The field of statistics long-ago, arguably first, recognized that computerized systemization of the denotation of models and evaluators for inference was essential and so developed specialized languages for model writing and query answering, amongst them BUGS [106] and, more recently, STAN [107]. We could start by explaining these and only these languages but this would do significant injustice to the emerging breadth and depth of the the field, particularly as it applies to modern approaches to artificial intelligence, and would limit our ability to explain, in general, what is going on under the hood in all kinds of languages not just those descended from Bayesian inference and computational statistics in finite dimensional models. What is common to all, however, is inference via conditioning as the objective.

1.2 Probabilistic Programming

The Bayesian approach, in particular the theory and utility of conditioning, is remarkably general in its applicability. One view of probabilistic programming is that it is about automating Bayesian inference. In this view probabilistic programming concerns the development of syntax and semantics for languages that denote conditional inference problems and the development of corresponding evaluators or “solvers” that computationally characterize the denoted conditional distribution. For this reason probabilistic programming sits at the intersection of the fields of machine learning, statistics, and programming languages, drawing on the formal semantics, compilers, and other tools from programming languages to build efficient inference evaluators for models and applications from machine learning using the inference algorithms and theory from statistics.

Figure 1.2: Probabilistic programming, an intuitive view.

Probabilistic programming is about doing statistics using the tools of computer science. Computer science, both the theoretical and engineering discipline, has largely been about finding ways to efficiently evaluate programs, given parameter or argument values, to produce some output. In Figure 1.2 we show the typical computer science programming pipeline on the left hand side: write a program, specify the values of its arguments or situate it in an evaluation environment in which all free variables can be bound, then evaluate the program to produce an output. The right hand side illustrates the approach taken to modeling in statistics: start with the output, the observations or data , then specify a usually abstract generative model , often denoted mathematically, and finally use algebra and inference techniques to characterize the posterior distribution, , of the unknown quantities in the model given the observed quantities. Probabilistic programming is about performing Bayesian inference using the tools of computer science: programming language for model denotation and statistical inference algorithms for computing the conditional distribution of program inputs that could have given rise to the observed program output.

Thinking back to our earlier example, reasoning about the bias of a coin is an example of the kind of inference probabilistic programming systems do. Our data is the outcome, heads or tails, of one coin flip. Our model, specified in a forward direction, stipulates that a coin and its bias is generated according to the hand-specified model then the coin flip outcome is observed and analyzed under this model. One challenge, the writing of the model, is a major focus of applied statistics research where “useful” models are painstakingly designed for every new important problem. Model learning also shows up in programming languages taking the name of program induction, machine learning taking the form of model learning, and deep learning, particularly with respect to the decoder side of autoencoder architectures. The other challenge is computational and is what Bayes rule gives us a theoretical framework in which to calculate: to computationally characterize the posterior distribution of the latent quantities (e.g. bias) given the observed quantity (e.g. “heads” or “tails”). In the beta-Bernoulli problem we were able to analytically derive the form of the posterior distribution, in effect, allowing us to transform the original inference problem denotation into a denotation of a program that completely characterizes the inverse computation.

When performing inference in probabilistic programming systems, we need to design algorithms that are applicable to any program that a user could write in some language. In probabilistic programming the language used to denote the generative model is critical, ranging from intentionally restrictive modeling languages, such as the one used in BUGS, to arbitrarily complex computer programming languages like C, C++, and Clojure. What counts as observable are the outputs generated from the forward computation. The inference objective is to computationally characterize the posterior distribution of all of the random choices made during the forward execution of the program given that the program produces a particular output.

There are subtleties, but that is a fairly robust intuitive definition of probabilistic programming. Throughout most of this tutorial we will assume that the program is fixed and that the primary objective is inference in the model specified by the program. In the last chapter we will talk some about connections between probabilistic programming and deep learning, in particular through the lens of semi-supervised learning in the variational autoencoder family where parts of or the whole generative model itself, i.e. the probabilistic program or “decoder,” is also learned from data.

Before that, though, let us consider how one would recognize or distinguish a probabilistic program from a non-probabilistic program. Quoting Gordon et al. [34], “probabilistic programs are usual functional or imperative programs with two added constructs: the ability to draw values at random from distributions, and the ability to condition values of variables in a program via observations.” We emphasize conditioning here. The meaning of a probabilistic program is that it simultaneously denotes a joint and conditional distribution, the latter by syntactically indicating where conditioning will occur, i.e. which random variable values will be observed. Almost all languages have pseudo-random value generators or packages; what they lack in comparison to probabilistic programming languages is syntactic constructs for conditioning and evaluators that implement conditioning. We will call languages that include such constructs probabilistic programming languages. We will call languages that do not but that are used for forward modeling stochastic simulation languages or, more simply, programming languages.

There are many libraries for constructing graphical models and performing inference; this software works by programmatically constructing a data structure which represents a model, and then, given observations, running graphical model inference. What distinguishes between this kind of approach and probabilistic programming is that a program is used to construct a model as a data structure, rather than considering the “model” that arises implicitly from direct evaluation of the program expression itself. In probabilistic programming systems, either a model data structure is constructed explicitly via a non-standard interpretation of the probabilistic program itself (if it can be, see Chapter 3

), or it is a general Markov model whose state is the evolving evaluation environment generated by the probabilistic programming language evaluator (see Chapter 

4). In the former case, we often perform inference by compiling the model data structure to a density function (see Chapter 3), whereas in the latter case, we employ methods that are fundamentally generative (see Chapters 4 and 6).

1.2.1 Existing Languages

The design of any tutorial on probabilistic programming will have to include a mix of programming languages and statistical inference material along with a smattering of models and ideas germane to machine learning. In order to discuss modeling and programming languages one must choose a language to use in illustrating key concepts and for showing examples. Unfortunately there exist a very large number of languages from a number of research communities; programming languages: Hakaru [76], Augur [119], R2 [78], Figaro [87], IBAL [86]), PSI [25]; machine learning: Church [31], Anglican [128] (updated syntax [129]), BLOG [66], Turing.jl [24], BayesDB [63], Venture [62], Probabilistic-C [82], webPPL [32], CPProb [13], [50], [114]; and statistics: Biips [115], LibBi [73], Birch [72], STAN [107], JAGS [89], BUGS [106]111sincere apologies to the authors of any languages left off this list.

In this tutorial we will not attempt to explain each of the languages and catalogue their numerous similarities and differences. Instead we will focus on the concepts and implementation strategies that underlie most, if not all, of these languages. We will highlight one extremely important distinction, namely, between languages in which all programs induce models with a finite number of random variables and languages for which this is not true. The language we choose for the tutorial has to be a language in which a coherent shift from the former to the latter is possible. For this and other reasons we chose to write the tutorial using an abstract language similar in syntax and semantics to Anglican. Anglican is similar to WebPPL, Church, and Venture and is essentially a Lisp-like language which, by virtue of its syntactic simplicity, also makes for efficient and easy meta-programming, an approach many implementors will take. That said the real substance of this tutorial is entirely language agnostic and the main points should be understood in this light.

We have left off of the preceding extensive list of languages both one important class of language – probabilistic logic languages ([46],[101] – and sophisticated, useful, and widely deployed libraries/embedded domain-specific languages for modeling and inference (Infer.NET [68], Factorie [64], Edward [117], PyMC3 [100]

). One link between the material presented in this tutorial and these additional languages and libraries is that the inference methodologies we will discuss apply to advanced forms of probabilistic logic programs

[3, 45] and, in general, to the graph representations constructed by such libraries. In fact the libraries can be thought of as compilation targets for appropriately restricted languages. In the latter case strong arguments can be made that these are also languages in the sense that there is an (implicit) grammar, a set of domain-specific values, and a library of primitives that can be applied to these values. The more essential distinction is the one we have structured this tutorial around, that being the difference between static languages in which the denoted model can be compiled to a finite-node graphical model and dynamic languages in which no such compilation can be performed.

1.3 Example Applications

Before diving into specifics, let us consider some motivating examples of what has been done with probabilistic programming languages and how phrasing things in terms of a model plus conditioning can lead to elegant solutions to otherwise extremely difficult tasks.

We argue that, besides the obvious benefits that derive from having an evaluator that implements inference automatically, the main benefit of probabilistic programming is having additional expressivity, significantly more compact and readable than mathematical notation, in the modeling language. While it is possible to write down the mathematical formalism for a model of latents and observables for each of the examples shown in Table 1.1, doing so is usually neither efficient nor helpful in terms of intuition and clarity. We have already given one example, Captcha from earlier in this chapter. Let us proceed to more.

Constrained Simulation
Figure 1.3: Posterior samples of procedurally generated, constrained trees (reproduced from [97])

The constrained procedural graphics [97] is a visually compelling and elucidating application of probabilistic programming. Consider how one makes a computer graphics forest for a movie or computer game. One does not hire one thousand designers to make each create a tree. Instead one hires a procedural graphics programmer who writes what we call a generative model – a stochastic simulator that generates a synthetic tree each time it is run. A forest is then constructed by calling such a program many times and arranging the trees on a landscape. What if, however, a director enters the design process and stipulates, for whatever reason, that the tree cannot touch some other elements in the scene, i.e. in probabilistic programming lingo we “observe” that the tree cannot touch some elements? Figure 1.3 shows examples of such a situation where the tree on the left must miss the back wall and grey bars and the tree on the right must miss the blue and red logo. In these figures you can see, visually, what we will examine in a high level of detail throughout the tutorial. The random choices made by the generative procedural graphics model correspond to branch elongation lengths, how many branches diverge from the trunk and subsequent branch locations, the angles that the diverged branches take, the termination condition for branching and elongation, and so forth. Each tree literally corresponds to one execution path or setting of the random variables of the generative program. Conditioning with hard constraints like these transforms the prior distribution on trees into a posterior distribution in which all posterior trees conform to the constraint. Valid program variable settings (those present in the posterior) have to make choices at all intermediate sampling points that allow all other sampling points to take at least one value that can result in a tree obeying the statistical regularities specified by the prior and the specified constraints as well.

Program Induction

How do you automatically write a program that performs an operation you would like it to? One approach is to use a probabilistic programming system and inference to invert a generative model that generates normal, regular, computer program code and conditions on its output, when run on examples, conforming to the observed specification. This is the central idea in the work of Perov and Wood [85] whose use of probabilistic programming is what distinguishes their work from the related literature [37, 42, 59]

. Examples such as this, even more than the preceding visually compelling examples, illustrate the denotational convenience of a rich and expressive programming language as the generative modeling language. A program that writes programs is most naturally expressed as a recursive program with random choices that generates abstract syntax trees according to some learned prior on the same space. While models from the natural language processing literature exist that allow specification and generation of computer source code (e.g. adaptor grammars

[43]), they are at best cumbersome to denote mathematically.

Recursive Multi-Agent Reasoning

Some of the most interesting uses for probabilistic programming systems derive from the rich body of work around the Church and WebPPL systems. The latter, in particular, has been used to study the mutually-recurisive reasoning among multiple agents. A number of examples on this are detailed in an excellent online tutorial [32].

The list goes on and could occupy a substantial part of a book itself. The critical realization to make is that, of course, any traditional statistical model can be expressed in a probabilistic programming framework, but, more importantly, so too can many others and with significantly greater ease. Models that take advantage of existing source code packages to do sophisticated nonlinear deterministic computations are particularly of interest. One exciting example application under consideration at the time of writing is to instrument the stochastic simulators that simulate the standard model and the detectors employed by the large hadron collider [7]. By “observing” the detector outputs, inference in the generative model specified by the simulation pipeline may prove to be able to produce the highest fidelity event reconstruction and science discoveries.

This last example highlights one of the principle promises of probabilistic programming. There exist a large number of software simulation modeling efforts to simulate, stochastically and deterministically, engineering and science phenomena of interest. Unlike in machine learning where often the true generative model is not well understood, in engineering situations (like building, engine, or other system modeling) the forward model is sometimes in fact incredibly well understood and already written. Probabilistic programming techniques and evaluators that work within the framework of existing languages should prove to be very valuable in disciplines where significant effort has been put into modeling complex engineering or science phenomena of interest and the power of general purpose inverse reasoning has not yet been made available.

1.4 A First Probabilistic Program

Just before we dig in deeply, it is worth considering at least one simple probabilistic program to informally introduce a bit of syntax and relate a model denotation in a probabilistic programming language to the underlying mathematical denotation and inference objective. There will be source code examples provided throughout, though not always with accompanying mathematical denotation.

Recall the simple beta-Bernoulli model from Section 1.1. This is one in which the probabilistic program denotation is actually longer than the mathematical denotation. But that is largely unique to such trivially simple models. Here is a probabilistic program that represents the beta-Bernoulli model.

(let [prior (beta a b)
      x ( sample prior)
      likelihood (bernoulli x)
      y 1]
  ( observe likelihood y)
  x))
Program 1: The beta-Bernoulli model as a probabilistic program

This program is written in the Lisp dialect we will use throughout, and which we will define in glorious detail in the next chapter. Evaluating this program performs the same inference as described mathematically before, specifically to characterize the distribution on the return value x that is conditioned on the observed value y . The details of what this program means and how this is done form the majority of the remainder of this tutorial.

2.1 Syntax

   variable
   constant value or primitive operation
   procedure
    |  | (let [ ] ) |  (if   )
     | (   ) | (   )
     | ( sample ) | ( observe  )
    | (defn  [  ] ) 
Language 1: First-order probabilistic programming language (FOPPL)

The FOPPL is a Lisp variant that is based on Clojure [41]. Lisp variants are all substantially similar and are often referred to as dialects. The syntax of the FOPPL is specified by the grammar in Language 1. A grammar like this formulates a set of production rules, which are recursive, from which all valid programs must be constructed.

We define the FOPPL in terms of two sets of production rules: one for expressions and another for programs . Each set of rules is shown on the right hand side of separated by a . We will here provide a very brief self-contained explanation of each of the production rules. For those who wish to read about programming languages essentials in further detail, we recommend the books by Abelson et al. [2] and Friedman and Wand [23].

The rules for state that a program can either be a single expression , or a function declaration (defn$\;f\;\ldots

) followed by any valid program . Because the second rule is recursive, these two rules together state that a program is a single expression that can optionally be preceded by one or more function declarations.

The rules for expressions are similarly defined recursively. For example, in the production rule (if   ) , each of the sub-expressions , , and can be expanded by choosing again from the matching rules on the left hand side. The FOPPL defines eight expression types. The first six are “standard” in the sense that they are commonly found in non-probabilistic Lisp dialects:

  • A constant can be a value of a primitive data type such as a number, a string, or a boolean, a built-in primitive function such as +

    , or a value of any other data type that can be constructed using primitive procedures, such as lists, vectors, maps, and distributions, which we will briefly discuss below.

  • A variable is a symbol that references the value of another expression in the program.

  • A let form (let [ ] ) binds the value of the expression to the variable , which can then be referenced in the expression , which is often referred to as the body of the let expression.

  • An if form (if   ) takes the value of when the value of is logically true and the value of when is logically false.

  • A function application (   ) calls the user-defined function , which we also refer to as a procedure, with arguments through . Here the notation refers to a variable-length sequence of arguments, which includes the case () for a procedure call with no arguments.

  • A primitive procedure applications (   ) calls a built-in function , such as + .

The remaining two forms are what makes the FOPPL a probabilistic programming language:

  • A sample form ( sample ) represents an unobserved random variable. It accepts a single expression , which must evaluate to a distribution object, and returns a value that is a sample from this distribution. Distributions are constructed using primitives provided by the FOPPL. For example, (normal 0.0 1.0)

    evaluates to a standard normal distribution.

  • An observe form ( observe  ) represents an observed random variable. It accepts an argument , which must evaluate to a distribution, and conditions on the next argument , which is the value of the random variable.

Some things to note about this language are that it is simple, i.e. the grammar only has a small number of special forms. It also has no input/output functionality which means that all data must be inlined in the form of an expression. However, despite this relative simplicity, we will see that we can express any graphical model as a FOPPL program. At the same time, the relatively small number of expression forms makes it much easier to reason about implementations of compilation and evaluation strategies.

Relative to other Lisp dialects, the arguably most critical characteristic of the FOPPL is that, provided that all primitives halt on all possible inputs, potentially non-halting computations are disallowed; in fact, there is a finite upper bound on the number of computation steps and this upper bound can be determined in compilation time. This design choice has several consequences. The first is that all data needs to be inlined so that the number of data points is known at compile time. A second consequence is that FOPPL grammar precludes higher-order functions, which is to say that user-defined functions cannot accept other functions as arguments. The reason for this is that a reference to user-defined function is in itself not a valid expression type. Since arguments to a function call must be expressions, this means that we cannot pass a function as an argument to another function .

Finally, the FOPPL does not allow recursive function calls, although the syntax does not forbid them. This restriction can be enforced via the scoping rules in the language. In a program of the form

    (defn  ) (defn  ) 

we can call inside of , but not vice versa, since is defined after . Similarly, we impose the restriction that we cannot call inside , which we can intuitively think of as not having been defined yet. Enforcing this restriction can be done using a pre-processing step.

A second distinction between the FOPPL relative to other Lisp is that we will make use of vector and map data structures, analogous to the ones provided by Clojure:

  • Vectors (vector   ) are similar to lists. A vector can be represented with the literal [  ] . This is often useful when representing data. For example, we can use [1 2] to represent a pair, whereas the expression (1 2) would throw an error, since the constant 1 is not a primitive function.

  • Hash maps (hash-map     ) are constructed from a sequence of key-value pairs . A hash-map can be represented with the literal \{\ \ \ \ \} .

Note that we have not explicitly enumerated primitive functions in the FOPPL. We will implicitly assume existence of arithmetic primitives like + , - , * , and / , as well as distribution primitives like normal and discrete . In addition we will assume the following functions for interacting with data structures

  • (first ) retrieves the first element of a list or vector .

  • (last ) retrieves the last element of a list or vector .

  • (append  ) appends to the end of a list or vector .111Readers familiar with Lisp dialects will notice that append differs somewhat from the semantics of primitives like cons , which prepends to a list, or the Clojure primitive conj which prepends to a list and appends to a vector.

  • (get  ) retrieves an element at index from a list or vector , or the element at key from a hash map .

  • (put   ) replaces the element at index/key with the value in a vector or hash-map .

  • (remove  ) removes the element at index/key with the value in a vector or hash-map .

Note that FOPPL primitives are pure functions. In other words, the append , put , and remove primitives do not modify in place, but instead return a modified copy of . Efficient implementations of such functionality may be advantageously achieved via pure functional data structures [80].

Finally we note that we have not specified any type system or specified exactly what values are allowable in the language. For example, ( sample e) will fail if at runtime e does not evaluate to a distribution-typed value.

(defn observe-data [slope intercept x y]
  (let [fx (+ (* slope x) intercept)]
    ( observe (normal fx 1.0) y)))
(let [slope ( sample (normal 0.0 10.0))]
  (let [intercept  ( sample (normal 0.0 10.0))]
    (let [y1 (observe-data slope intercept 1.0 2.1)]
    (let [y2 (observe-data slope intercept 2.0 3.9)]
    (let [y3 (observe-data slope intercept 3.0 5.3)]
    (let [y4 (observe-data slope intercept 4.0 7.7)]
    (let [y5 (observe-data slope intercept 5.0 10.2)]
      [slope intercept]))))))).
Program 2:

Bayesian linear regression in the FOPPL.

Now that we have defined our syntax, let us illustrate what a program in the FOPPL looks like. Program 2 shows a simple univariate linear regression model. The program defines a distribution on lines expressed in terms of their slopes and intercepts by first defining a prior distribution on slope and intercept and then conditioning it using five observed data pairs. The procedure observe-data conditions the generative model given a pair (x ,y ), by observing the value y from a normal centered around the value (+ (* slope x) intercept) . Using a procedure lets us avoid rewriting observation code for each observation pair. The procedure returns the observed value, which is ignored in our case. The program defines a prior on slope and intercept using the primitive procedure normal for creating an object for normal distribution. After conditioning this prior with data points, the program return a pair [slope intercept] , which is a sample from the posterior distribution conditioned on the 5 observed values.

2.2 Syntactic Sugar

The fact that the FOPPL only provides a small number of expression types is a big advantage when building a probabilistic programming system. We will see this in Chapter 3

, where we will define a translation from any FOPPL program to a Bayesian network using only 8 rules (one for each expression type). At the same time, for the purposes of writing probabilistic programs, having a small number of expression types is not always convenient. For this reason we will provide a number of alternate expression forms, which are referred to as syntactic sugar, to aid readability and ease of use.

We have already seen two very simple forms of syntactic sugar: [] is a sugared form of (vector ) and \{\} is a sugared form for (hash-map ) . In general, each sugared expression form can be desugared, which is to say that it can be reduced to an expression in the grammar in Language 1. This desugaring is done as a preprocessing step, often implemented as a macro rewrite rule that expands each sugared expression into the equivalent desugared form.

2.2.1 Let forms

The base let form (let [ ] ) binds a single variable in the expression . Very often, we will want to define multiple variables, which leads to nested let expressions like the ones in Program 2

. Another distracting piece of syntax in this program is that we define dummy variables

y1 to y5 which are never used. The reason for this is that we are not interested in the values returned by calls to observe-data ; we are using this function in order to observe values, which is a side-effect of the procedure call.

To accommodate both these use cases in let forms, we will make use of the following generalized let form

(let [ 
      
       ]
     ).

This allows us to simplify the nested let forms in Program 2 to

(let [slope ( sample (normal 0.0 10.0))
      intercept ( sample (normal 0.0 10.0))]
  (observe-data slope intercept 1.0 2.1)
  (observe-data slope intercept 2.0 3.9)
  (observe-data slope intercept 3.0 5.3)
  (observe-data slope intercept 4.0 7.7)
  (observe-data slope intercept 5.0 10.2)
  [slope intercept])

This form of let is desugared to the following expression in the FOPPL

(let [ ]
  
  (let [ ]
    (let [_ ]
       
       (let [_ ]
         )))).

Here the underscore _ is a second form of syntactic sugar, that will be expanded to a fresh (i.e. previously unused) variable. For instance

(let [_ ( observe (normal 0 1) 2.0)] )

will be expanded by generating some fresh variable symbol, say x284xu ,

(let [x284xu ( observe (normal 0 1) 2.0)] )

We will assume each instance of _ is a guaranteed-to-be-unique or fresh symbol that is generated by some gensym primitive in the implementing language of the evaluator. We will use the concept of a fresh variable extensively throughout this tutorial, with the understanding that fresh variables are unique symbols in all cases.

2.2.2 For loops

A second syntactic inconvenience in Program 2 is that we have to repeat the expression (observe-data ) once for each data point. Just about any language provides looping constructs for this purpose. In the FOPPL we will make use of two such constructs. The first is the foreach form, which has the following syntax

(foreach 
  [    ]
    )

This form desugars into a vector containing let forms

(vector
  (let [ (get  0)
        
         (get  0)]
      )
  
  (let [ (get  (-  1))
        
         (get  (-  1))]
      ))

Note that this syntax looks very similar to that of the let form. However, whereas let binds each variable to a single value, the foreach form associates each variable with a sequence and then maps over the values in this sequence for a total of steps, returning a vector of results. If the length of any of the bound sequences is less than , then let form will result in a runtime error.

With the foreach form, we can rewrite Program 2 without having to make use of the helper function observe-data

(let [y-values [2.1 3.9 5.3 7.7 10.2]
      slope ( sample (normal 0.0 10.0))
      intercept ( sample (normal 0.0 10.0))]
  (foreach 5
    [x (range 1 6)
     y y-values]
    (let [fx (+ (* slope x) intercept)]
      ( observe (normal fx 1.0) y)))
  [slope intercept])

There is a very specific reason why we defined the foreach syntax using a constant for the number of loop iterations (foreach  [] ) . Suppose we were to define the syntax using an arbitrary expression (foreach  [] ) . Then we could write programs such as

(let [m ( sample (poisson 10.0))]
  (foreach m []
    ( sample (normal 0 1))))

This defines a program in which there is no upper bound on the number of times that the expression ( sample (normal 0 1)) will be evaluated. By requiring to be a constant, we can guarantee that the number of iterations is known at compile time.

Note that there are less obtrusive mechanisms for achieving the functionality of foreach , which is fundamentally a language feature that maps a function, here the body, over a sequence of arguments, here the let -like bindings. Such functionality is much easier to express and implement using higher-order language features like those discussed in Chapter 5.

2.2.3 Loop forms

The second looping construct that we will use is the loop form, which has the following syntax.

(loop      )

Once again, must be a non-negative integer constant and a procedure, primitive or user-defined. This notation can be used to write most kinds of for loops. Desugaring this syntax rolls out a nested set of lets and function calls in the following precise way

(let [ 
       
         
       ]
  (let [ ( 0    )]
    (let [ ( 1    )]
      (let [ ( 2    )]
        
          (let [ ( (-  1)    )]
            )  )))

where and are fresh variables. Note that the loop sugar computes an iteration over a fixed set of indices.

To illustrate how the loop form differs from the foreach form, we show a new variant of the linear regression example in Program 3. In this version of the program, we not only observe a sequence of values according to a normal centered at , but we also compute the sum of the squared residuals . To do this, we define a function regr-step , which accepts an argument n , the index of the loop iteration. It also accepts a second argument r2 , which represents the sum of squares for the preceding datapoints. Finally it accepts the arguments xs , ys , slope , and intercept , which we have also used in previous versions of the program.

(defn regr-step [n r2 xs ys slope intercept]
  (let [x (get xn n)
        y (get ys n)
        fx (+ (* slope x) intercept)
        r (- y fx)]
    ( observe (normal fx 1.0) y)
    (+ r2 (* r r))))
(let [xs [1.0 2.0 3.0 4.0  5.0]
      ys [2.1 3.9 5.3 7.7 10.2]
      slope ( sample (normal 0.0 10.0))
      bias  ( sample (normal 0.0 10.0))
      r2 (loop 5 0.0 regr-step xs ys slope bias)]
  [slope bias r2])
Program 3: The Bayesian linear regression model, written using the loop form.

At each loop iteration, the function regr-step computes the residual and returns the value (+ r2 (* r r)) , which becomes the new value for r2 at the next iteration. The value of the entire loop form is the value of the final call to regr-step , which is the sum of squared residuals.

In summary, the difference between loop and foreach is that loop can be used to accumulate a result over the course of the iterations. This is useful when you want to compute some form of sufficient statistics, filter a list of values, or really perform any sort of computation that iteratively builds up a data structure. The foreach form provides a much more specific loop type that evaluates a single expression repeatedly with different values for its variables. From a statistical point of view, we can thing of loop as defining a sequence of dependent variables, whereas foreach creates conditionally independent variables.

2.3 Examples

Now that we have defined the fundamental expression forms in the FOPPL, along with syntactic sugar for variable bindings and loops, let us look at how we would use the FOPPL to define some models that are commonly used in statistics and machine learning.

2.3.1 Gaussian mixture model

We will begin with a three-component Gaussian mixture model

[65]. A Gaussian mixture model is a density estimation model often used for clustering, in which each data point is assigned to a latent class . We will here consider the following generative model

(let [data [1.1 2.1 2.0 1.9 0.0 -0.1 -0.05]
      likes (foreach 3 []
              (let [mu ( sample (normal 0.0 10.0))
                    sigma ( sample (gamma 1.0 1.0))]
                (normal mu sigma)))
      pi ( sample (dirichlet [1.0 1.0 1.0]))
      z-prior (discrete pi)]
  (foreach 7 [y data]
    (let [z ( sample z-prior)]
      ( observe (get likes z) y)
      z)))
Program 4: FOPPL - Gaussian mixture model with three components
(2.1)
(2.2)
(2.3)
(2.4)
(2.5)

Program 4 shows a translation of this generative model to the FOPPL. In this model we first sample the mean mu

and standard deviation

sigma for 3 mixture components. For each observation y we then sample a class assignment z , after which we observe according to the likelihood of the sampled assignment. The return value from this program is the sequence of latent class assignments, which can be used to ask questions like, “Are these two datapoints similar?”, etc.

2.3.2 Hidden Markov model

(defn hmm-step [t states data trans-dists likes]
  (let [z ( sample (get trans-dists
                       (last states)))]
    ( observe (get likes z)
             (get data t))
    (append states z)))
(let [data [0.9 0.8 0.7 0.0 -0.025 -5.0 -2.0 -0.1
            0.0 0.13 0.45 6 0.2 0.3 -1 -1]
      trans-dists [(discrete [0.10 0.50 0.40])
                   (discrete [0.20 0.20 0.60])
                   (discrete [0.15 0.15 0.70])]
      likes [(normal -1.0 1.0)
             (normal 1.0 1.0)
             (normal 0.0 1.0)]
      states [( sample (discrete [0.33 0.33 0.34]))]]
  (loop 16 states hmm-step
    data trans-dists likes))
Program 5:

FOPPL - Hidden Markov model

As a second example, let us consider Program 5 which denotes a hidden Markov model (HMM) [91] with known initial state, transition, and observation distributions governing sequential observations.

In this program we begin by defining a vector of data points data , a vector of transition distributions trans-dists and a vector of state likelihoods likes . We then loop over the data using a function hmm-step , which returns a sequence of states.

At each loop iteration, the function hmm-step does three things. It first samples a new state z from the transition distribution associated with the preceding state. It then observes data point at time t according to the likelihood component of the current state. Finally, it appends the state z to the sequence states . The vector of accumulated latent states is the return value of the program and thus the object whose joint posterior distribution is of interest.

2.3.3 A Bayesian Neural Network

(let [weight-prior (normal 0 1)
      W_0 (foreach 10 []
            (foreach 1 [] ( sample weight-prior)))
      W_1 (foreach 10 []
            (foreach 10 [] ( sample weight-prior)))
      W_2 (foreach 1 []
            (foreach 10 [] ( sample weight-prior)))
      b_0 (foreach 10 []
            (foreach 1 [] ( sample weight-prior)))
      b_1 (foreach 10 []
            (foreach 1 [] ( sample weight-prior)))
      b_2 (foreach 1 []
            (foreach 1 [] ( sample weight-prior)))
      x   (mat-transpose [[1] [2] [3] [4] [5]])
      y   [[1] [4] [9] [16] [25]]
      h_0 (mat-tanh (mat-add (mat-mul W_0 x)
                             (mat-repmat b_0 1 5)))
      h_1 (mat-tanh (mat-add (mat-mul W_1 h_0)
                             (mat-repmat b_1 1 5)))
      mu  (mat-transpose
             (mat-tanh
               (mat-add (mat-mul W_2 h_1)
                        (mat-repmat b_2 1 5))))]
   (foreach 5 [y_r y
               mu_r mu]
     (foreach 1 [y_rc y_r
                 mu_rc mu_r]
       ( observe (normal mu_rc 1) y_rc)))
   [W_0 b_0 W_1 b_1])
Program 6: FOPPL - A Bayesian Neural Network

Traditional neural networks are fixed-dimension computation graphs which means that they too can be expressed in the FOPPL. In the following we demonstrate this with an example taken from the documentation for Edward [118]

, a probabilistic programming library based on fixed computation graph. The example shows a Bayesian approach to learning the parameters of a three-layer neural network with input of dimension one, two hidden layer of dimension ten, an independent and identically Gaussian distributed output of dimension one, and

tanh activations at each layer. The program inlines five data points and represents the posterior distribution over the parameters of the neural network. We have assumed, in this code, the existence of matrix primitive functions, e.g. mat-mul , whose meaning is clear from context (matrix multiplication), sensible matrix-dimension-sensitive pointwise mat-add and mat-tanh functionality, vector of vectors matrix storage, etc.

This example provides an opportunity to reinforce the close relationship between optimization and inference. The task of estimating neural-network parameters is typically framed as an optimization in which the free parameters of the network are adjusted, usually via gradient descent, so as to minimize a loss function. This neural-network example can be seen as doing parameter learning too, except using the tools of inference to discover the posterior distribution over model parameters. In general, all parameter estimation tasks can be framed as inference simply by placing a prior over the parameters of interest as we do here.

It can also be noted that, in this setting, any of the activations of the neural network trivially could be made stochastic, yielding a stochastic computation graph [102], rather than a purely deterministic neural network.

Finally, the point of this example is not to suggest that the FOPPL is the

language that should be used for denoting neural network learning and inference problems, it is instead to show that the FOPPL is sufficiently expressive to neural networks based on fixed computation graphs. Even though we have shown only one example of a multilayer perceptron, it is clear that convolutional neural networks, recurrent neural networks of fixed length, and the like, can all be denoted in the FOPPL.

2.3.4 Translating BUGS models

# data
list(t = c(94.3, 15.7, 62.9, 126, 5.24,
           31.4, 1.05, 1.05, 2.1, 10.5),
     y = c(5, 1, 5, 14, 3, 19, 1, 1, 4, 22),
     N = 10)
# inits
list(a = 1, b = 1)
# model
{
  for (i in 1 : N) {
     theta[i] ~ dgamma(a, b)
     l[i] <- theta[i] * t[i]
     y[i] ~ dpois(l[i])
  }
  a ~ dexp(1)
  b ~ dgamma(0.1, 1.0)
}
Program 7: The Pumps example model from BUGS [81].
(defn data []
  [[94.3 15.7 62.9 126 5.24 31.4 1.05 1.05 2.1 10.5]
   [5 1 5 14 3 19 1 1 4 22]
   [10]])
(defn t [i] (get (get (data) 0) i))
(defn y [i] (get (get (data) 1) i))
(defn loop-iter [i _ alpha beta]
  (let [theta ( sample (gamma a b))
        l     (* theta (t i))]
    ( observe (poisson l) (y i))))
(let [a ( sample (exponential 1))
      b ( sample (gamma 0.1 1.0))]
  (loop 10 nil loop-iter a b)
  [a b])
Program 8: FOPPL - the Pumps example model from BUGS

The FOPPL language as specified is sufficiently expressive to, for instance, compile BUGS programs to the FOPPL. Program 7 shows one of the examples included with the BUGS system [81]. This model is a conjugate gamma-Poisson hierarchical model, which is to say that it has the following generative model:

(2.6)
(2.7)
(2.8)
(2.9)

Program 7 shows this model in the BUGS language. Program 8 show a translation to the FOPPL that was returned by an automated BUGS-to-FOPPL compiler. Note the similarities between these languages despite the substantial syntactic differences. In particular, both require that the number of loop iterations is fixed and finite. In BUGS the variables whose values are known appear in a separate data block. The symbol is used to define random variables, which can be either latent or observed, depending on whether a value for the random variable is present. In our FOPPL the distinction between observed and latent random variables is made explicit through the syntactic difference between sample and observe. A second difference is that a BUGS program can in principle be used to compute a marginal on any variable in the program, whereas a FOPPL program specifies a marginal of the full posterior through its return value. As an example, in this particular translation, we treat as a nuisance variable, which is not returned by the program, although we could have used the loop construct to accumulate a sequence of values.

These minor differences aside, the BUGS language and the FOPPL essentially define equivalent families of probabilistic programs. An advantage of writing this text using the FOPPL rather than an existing language like BUGS is that FOPPL program are comparatively easy to reason about and manipulate, since there are only 8 expression forms in the language. In the next chapter we will exploit this in order to mathematically define a translation from FOPPL programs to Bayesian networks and factor graphs, keeping in mind that all the basic concepts that we will employ also apply to other probabilistic programming systems, such as BUGS.

2.4 A Simple Purely Deterministic Language

There is no optimal place to put this section so it appears here, although it is very important for understanding what is written in the remainder of this tutorial.

In subsequent chapters it will become apparent that the FOPPL can be understood in two different ways – one way as being a language for specifying graphical-model data-structures on which traditional inference algorithms may be run, the other as a language that requires a non-standard interpretation in some implementing language to characterize the denoted posterior distribution.

In the case of graphical-model construction it will be necessary to have a language for purely deterministic expressions. To foreshadow, this language will be used to express link functions in the graphical model. More precisely, and contrasting to the usual definition of link function from statistics, the pure deterministic language will encode functions that take values of parent random variables and produce distribution objects for children. These link functions cannot have random variables inside them; such a variable would be another node in the graphical model instead.

Moreover we can further simplify this link function language by removing user defined functions, effectively requiring their function bodies, if used, to be inlined. This yields a cumbersome language in which to manually program but an excellent language to target and evaluate because of its simplicity.

2.4.1 Deterministic Expressions

We will call expressions in the FOPPL that do not involve user-defined procedure calls and involve only deterministic computations, e.g. (+ (/ 2.0 6.0) 17) “0th-order expressions”. Such expressions will play a prominent role when we consider the translation of our probabilistic programs to graphical models in the next chapter. In order to identify and work with these deterministic expressions we define a language with the following extremely simple grammar:

    constant value or primitive operation
    
     |  | (if   ) | ( )
Language 131: Sub-language for purely deterministic computations

Note that neither sample nor observe statements appear in the syntax, and that procedure calls are allowed only for primitive operations, not for defined procedures. Having these constraints ensures that expressions cannot depend on any probabilistic choices or conditioning.

The examples provided in this chapter should convince you that many common models and inference problems from statistics and machine learning can be denoted as FOPPL programs. What remains is to translate FOPPL programs into other mathematical or programming language formalisms whose semantics are well established so that we can define, at least operationally, the semantics of FOPPL programs, and, in so doing, establish in your mind a clear idea about how probabilistic programming languages that are formally equivalent in expressivity to the FOPPL can be implemented.

3.1 Compilation to a Graphical Model

Programs written in the FOPPL specify probabilistic models over finitely many random variables. In this section, we will make this aspect clear by presenting the translation of these programs into finite graphical models. In the subsequent sections, we will show how this translation can be exploited to adapt inference algorithms for graphical models to probabilistic programs.

We specify translation using the following ternary relation , similar to the so called big-step evaluation relation from the programming language community.

(3.1)

In this relation, is a mapping from procedure names to their definitions, is a logical predicate for the flow control context, and is an expression we intend to compile. This expression is translated to a graphical model and an expression in the deterministic sub-language described in Section 2.4.1. The expression is deterministic in the sense that it does not involve sample nor observe. It describes the return value of the original expression in terms of random variables in . Vertices in represent random variables, and arcs dependencies among them. For each random variable in , we will define a probability density or mass in the graph. For observed random variables, we additionally define the observed value, as well as a logical predicate that indicates whether the observe expression is on the control flow path, conditioned on the values of the latent variables.

Definition of a Graphical Model

We define a graphical model as a tuple containing (i) a set of vertices that represent random variables; (ii) a set of arcs (i.e. directed edges) that represent conditional dependencies between random variables; (iii) a map from vertices to deterministic expressions that specify the probability density or mass function for each random variable; (iv) a partial map that for each observed random variable contains a pair consisting of a deterministic expression for the observed value, and a predicate expression that evaluates to true when this observation is on the control flow path.

Before presenting a set of translation rules that can be used to compile any FOPPL program to a graphical model, we will illustrate the intended translation using a simple example:

  (let [z ( sample (bernoulli 0.5))
        mu (if (= z 0) -1.0 1.0)
        d (normal mu 1.0)
        y 0.5]
    ( observe d y)
    z)
Program 132: A simple example FOPPL program.

(\  (if (=\  0) -1.0 1.0) 1.0)

(\  0.5)
Figure 3.1: The graphical model corresponding to Program 132

This program describes a two-component Gaussian mixture with a single observation. The program first samples

from a Bernoulli distribution, based on which it sets a likelihood parameter

to or , and observes a value from a normal distribution with mean . This program defines a joint distribution . The inference problem is then to characterize the posterior distribution . Figure 3.1 shows the graphical model and pure deterministic link functions that correspond to Program 132.

In the evaluation relation , the source code of the program is represented as a single expression . The variable is an empty map, since there are no procedure definitions. At the top level, the flow control predicate is true . The graphical model and the result expression that this program translates to are

The vertex set of the net contains two variables, whereas the arc set contains a single pair to mark the conditional dependence relationship between these two variables. In the map , the probability mass for is defined as the target language expression . Here refers to a function in the target languages that implements probability mass function for the Bernoulli distribution. Similarly, the density for is defined using

, which implements the probability density function for the normal distribution. Note that the expression for the program variable

mu has been substituted into the density for . Finally, the map contains a single entry that holds the observed value for .

Assigning Symbols to Variable Nodes

In the above example we used the mathematical symbol to refer to the random variable associated with the expression ( sample (bernoulli 0.5)) and the symbol to refer to the observed variable with expression ( observe d y) . In general there will be one node in the network for each sample and observe expression that is evaluated in a program. In the above example, there also happens to be a program variable z that holds the value of the sample expression for node , and a program variable y that holds the observed value for node , but this is of course not necessarily always the case. A particularly common example of this arises in programs that have procedures. Here the same sample and observe expressions in the procedure body can be evaluated multiple times. Suppose for example that we were to modify our program as follows:

(defn norm-gamma
  [m l a b]
  (let [tau ( sample (gamma a b))
        sigma (/ 1.0 (sqrt tau))
        mu ( sample (normal m (/ sigma (sqrt l)))]
    (normal mu sigma))))
(let [z ( sample (bernoulli 0.5))
      d0 (norm-gamma -1.0 0.1 1.0 1.0)
      d1 (norm-gamma 1.0 0.1 1.0 1.0)]
  (