1 Introduction
Probabilistic approaches have become standard in system identification, machine learning and statistics, particularly in situations where the quantification of uncertainty or assessment of risk is paramount. A typical workflow proceeds through several stages, from experimental design, to data collection, to model development, to prior elicitation, to inference, to decision making. At least part of this workflow involves computer code. For the inference stage, this is often bespoke code tailored to a particular study: it couples the implementation of the model with the implementation of the chosen inference method. The model code is not easy to reuse with a different method, nor the method code with a different model.
As data size and compute capacity increase, the complexity of models, and their implementations, increases too. Complex models arise in numerous fields, including nonparametrics, where there is an unbounded number of variables, in object tracking and phylogenetics, where data structures such as random finite sets and random trees appear, and in numerical weather prediction and oceanography, where specialized numerical methods are used for continuoustime systems and partial differential equations. For such complex models, bespoke implementations may involve nontrivial—even tedious—manual work, such as deriving the full conditionals of the posterior distribution, or calculating the gradients of a complex likelihood function, or tuning numerical methods for stability. The effort may need repeating if the model or inference method is later changed.
A more scalable approach to implementation is desirable. Recognizing this, there has been a tradition of software that separates model specification from method implementation for the purposes of inference (e.g. WinBUGS [28], OpenBUGS [27], JAGS [44, 43], Stan [49], Infer.NET [32], LibBi [33], Biips [50]
). Typically, this software supports one predominant method but many possible models. The methods include Markov chain Monte Carlo (MCMC) methods such as the Gibbs sampler for WinBUGS, OpenBUGS and JAGS, and Hamiltonian Monte Carlo (HMC) for Stan, variational methods for Infer.NET, and Sequential Monte Carlo (SMC) methods for LibBi and Biips. The software provides a way to adapt the method for a large number of models and automate routine procedures, such as adaptation of Markov kernels in WinBUGS
[48, p. 6], and automatic differentiation in Stan. It is typical for these languages to restrict the set of probabilistic models that can be expressed, in order to provide an inference method that works well for this restricted set. Stan, for example, works only with differentiable models using HMC, while LibBi works only with statespace models using SMCbased methods. Models outside of these sets may require more specialist tools. In phylogenetics, for example, RevBayes [21] provides the particular modeling feature of tree plates to represent phylogenetic trees, for which specialized Markov kernels can be applied within MCMC.Naturally, methods have also become more complex to accommodate these more complex models and larger data sets. Modern Monte Carlo methods often nest multiple baseline algorithms, such as SMC within MCMC, as in particle MCMC [2], or SMC within SMC, as in SMC [7]. Data subsamplingbased algorithms [3] are becoming standard for dealing with large data sets. Monte Carlo samplers increasingly use gradient information, such as the Metropolisadjusted Langevin algorithm (MALA) [45], HMC [36], and deterministic piecewise samplers [6, 4, 54]. Various methods manipulate the stream of random numbers (e.g. [34, 15, 10]) or potential functions (e.g. [56, 9]
) to improve estimates. Software has begun to address this complexity in methods, too. NIMBLE
[8], for example, uses models similar to those of WinBUGS, OpenBUGS and JAGS, but provides manual customization of the Markov kernels used within MCMC.We see value in flexible tools that allow for the implementation of both complex models and complex methods, and in moving from onetomany tools (one method, many models) to manytomany tools (many methods, many models). To this end, we consider the potential of probabilistic programming: a programming paradigm that aims to accelerate workflow with new programming languages and software tools tailored for probabilistic modeling and inference. In particular, it aims to develop Turingcomplete programming languages for model implementation, extending existing languages for model specification with programming concepts such as conditionals, recursion (loops), and higherorder functions, for greater expressivity. It aims to decouple
the implementation of models and methods into modular components that can be reassembled and reused in multiple configurations in a manytomany manner. It aims to automate the selection of an inference method for a given model, and to automate the tuning necessary for it to work well in practice. These goals remain aspirational, and an active area of research across disciplines including machine learning, statistics, system identification, artificial intelligence and programming languages.
A number of probabilistic programming languages have been developed with such aims in recent years. Examples include Church [17], BLOG [31], Venture [30], WebPPL [16], Anglican [51], Figaro [41], Turing [14], Edward [52], and Pyro [1]. These all explore different approaches that reflect, in the first instance, the different problem domains to which they are orientated, and in the second instance, the relatively young age of the field. All of these languages are considered universal probabilistic programming languages—i.e. Turingcomplete programming languages that admit arbitrary models rather than restricted sets. This is not to say, of course, that efficient inference is possible for all models that are admitted—but these languages can work well for a large class of models for which they do have efficient inference methods, they do provide useful libraries for implementing probabilistic models, and they may support the development or customization of inference methods from within the same language.
We introduce a new universal probabilistic programming language called Birch (www.birchlang.org), which implements the ideas presented in this work. Birch is an imperative language geared toward objectoriented and generic programming paradigms. It draws inspiration from several sources, notably from LibBi [33]—for which it is something of a successor—but in moving from model specification language to universal probabilistic programming language it draws ideas from modern objectoriented programming languages such as Swift, too. Birch is Turing complete, with control flow statements such as conditionals and loops, support for unbounded recursion and higherorder functions, and dynamic memory management. Birch code compiles to C++14 code, providing ready access to the established ecosystem of C/C++ libraries available for scientific and numeric computing. A key component of Birch is its implementation of delayed sampling [35]
, a heuristic to provide optimizations via partial analytical solutions to inference problems. While broadly applicable across problem domains, invariably the approach taken in Birch is flavored by the perspective of its developers, and so by applications in statistics, machine learning and system identification.
This work is intended as a “big picture” perspective on the probabilistic workflow, and how new ideas in probabilistic programming can assist this. Birch is the concrete manifestation of these ideas. Throughout, we make use of the statespace model as a running example. While the ideas presented are not restricted to such models, they concretely illustrate some of the core concepts, and have numerous practical applications. In Section 2 we introduce a formal description of the class of models considered in probabilistic programming. In Section 3 we introduce some methods of inference and consider some of the ways in which probabilistic programming languages can automate the manytomany matching of models with inference methods. In Section 4 we introduce the Birch probabilistic programming language as a specific implementation of these ideas. In Section 5 we work through the concrete example of a multiple object tracking model—a statespace model with random size—and show how it is implemented in Birch, and the inference results obtained. Finally, we summarize in Section 6.
2 Models expressed in a programming language
Probabilistic programming considers models expressed in Turingcomplete programming languages. Such models are usually referred to as probabilistic programs—qualified to universal probabilistic programs when one wishes to regard only the broadest class in terms of expressivity—and described in programming language nomenclature. Here, we adopt probabilistic nomenclature instead, to provide a more accessible treatment for the intended audience. Taking the lead from the term graphical model—a model expressed in a graphical language—we suggest that the term programmatic model—a model expressed in a programming language—might be more appropriate for this audience, and adopt this term throughout. Specifically, we avoid the use of the term program
when referring only to a model implementation, as in ordinary usage one thinks of a computer program as combining the implementation of both a model and an inference method, which can cause confusion. The term can also be misleading given unrelated but similarlynamed concepts in system identification, such as linear programs and stochastic programs.
We follow the statistics convention of using uppercase letters to denote random variables (e.g. ) and lowercase letters to denote instantiations of them (e.g. ), with . We then adopt measure theory notation to clearly distinguish between distributions (which we will ultimately simulate) and likelihood functions (which we will ultimately evaluate): the distribution of a random variable is denoted
, while evaluation of an associated probability density function (pdf, for continuousvalued random variables) or probability mass function (pmf, for discretevalued random variables) is denoted
.Assume that we have a countably infinite set of random variables
, with a joint probability distribution over them, which has been implemented in code in some programming language. The only stochasticity available to the code is via these random variables. We execute the code, and as it runs it encounters a finite subset of the random variables in some order determined by that code. Denote this order by a permutation
, with its (random) length denoted , defining a sequence . The first element, , is always the same. Each subsequent element, , is given by a function of the random variables encountered so far, denoted (for next), so that . This function is implied by the code. Note that is a deterministic function given preceding random variates, as there is no stochasticity available to the code except via these. This is also why is always the same: no source of stochasticity precedes it.As each random variable is encountered, the code associates it with a distribution
where (for parents) is a deterministic function of the preceding random variates, selecting from them a subset on which the distribution of depends. It is possible that the distribution also depends on exogenous factors such as user input; we leave this implicit to simplify notation.
At some point the execution terminates, having established the distribution
We will execute the code several times. The th execution will be associated with the distribution , given by
with . Subscript is used to denote executiondependent variables. For different executions and , it is possible for the number of random variables encountered ( and ) to differ, for the sequences of random variables and to differ, and even for the two subsets of random variables and to be disjoint (recall that the first random variable to be encountered is always the same). In general, we should therefore assume that and are not the same, but rather components of a larger mixture.
The above describes the class of models that we refer to as programmatic models. The permutation reflects the fact that a program can make conditional choices during execution that are based on the simulation of random variables, and that these may lead to very different outcomes. Consider, for example, a model implementation that begins with a coin flip: on heads it executes one model, on tails some other model. Such an implementation represents a mixture of two models, but each execution can encounter only one.
We are interested in two properties of a programmatic model from an inference perspective:

structure
, by which we mean the factorization of the joint distribution
into conditional distributions , and 
form, by which we mean the precise mathematical definition of the conditional distributions .
2.1 Structure
The structure of a probabilistic model is typically defined as the dependency relationships between random variables. Popular model classes such as hidden Markov models (HMMs), statespace models (SSMs), Markov random fields, etc, encode particular structures for specialist purposes such as dynamical systems and spatial systems. Generalizing these, structure is perhaps most explicitly encoded by
graphical models (see e.g. [23, 5, 24]), where a probabilistic model is represented as a graph, with nodes as random variables, and edges encoding the relationships between them. Generic inference techniques such as the sumproduct algorithm [39] make explicit use of the graph—and thus the structure of the model—to perform inference efficiently with respect to the number of computations required.We are interested in the same for programmatic models. For illustration, we can readily compare the class of programmatic models to the class of directed graphical models. Like all graphical models, the nodes of a directed graphical model represent random variables. The edges are directed and represent a conditional dependency relationship from a parent at the tail of the arrow, to a child at the head of the arrow. The entire graph must be acyclic.
We can represent directed graphical models as programmatic models within the formal definition above. The nodes and edges are known a priori and establish a joint distribution, over random variables, of:
where we use to denote the set of parents of under the graph. The model implementation in code must necessarily encounter the nodes in a valid topological ordering given the edges. On each execution , the same finite set of random variables is encountered. A directed edge from to indicates that, if is the th random variable to be encountered, then must necessarily have been encountered already, and . The conditional distribution assigned to any is always the same . Consequently, the execution establishes the distribution:
(1) 
This is to say that, while executions may encounter the random variables in different orders, according to how the directed graphical model has been implemented, each execution will always encounter the same finite subset of variables, and establish the same structure and form. If this is not the case, then the code must not be a correct implementation of the directed graphical model that was given.
This motivates the difficulty of dealing with a programmatic model: unlike for a directed graphical model, the structure of a programmatic model is not known a priori. If a programmatic model were expressed as a graph, the nodes of the graph would not be known until revealed through the function , and only at that same time would incoming edges to the node be revealed through the function . The model must be executed to discover its structure. But, furthermore, the nodes and edges may differ between executions, so it is not simply a matter of executing the program once to determine the complete structure.
We should not get too caught up on the general at the expense of the useful, however. Directed graphical models are useful, and many models used in practice can be expressed this way—it is those that cannot be expressed this way that motivate the more general treatment of programmatic models. Probabilistic models tend not be entangled assemblies of random variables connected in arbitrary ways, but rather arranged in recursive substructures such as chains, grids, stars, trees, and layers. We refer to these as structural motifs (see Figure 1). These motifs occur so frequently in probabilistic models that there have been attempts to automatically learn model structure based on them [13]
. For example, the chain motif is the dominant feature of HMMs and SSMs, the grid motif that of spatial models, the tree motif that of phylogenetic trees or stream networks, the layer motif that of neural networks. The star motif occurs whenever there are repeated observations of some system, for example repeated time series observations of the same dynamical system, where the parameters are identified across time series, or multiple individuals in a medical study sharing common parameters. Parameters with global influence are also a common occurrence, determining the conditional probability distributions over variables in a chain or grid, for example.
These motifs can be used to characterize a model, especially for the purposes of selecting inference methods. While the nature of a programmatic model is that its structure may change between executions, it may be the case that a particular motif persists between executions, and that this is known a priori. If we can characterize this motif, we can leverage specialized inference methods for it, while reserving generalized inference methods for the remainder of the structure. We return to this idea in Section 4.
2.2 Form
The form of a probabilistic model refers to the mathematical definition of its distributions. In the case of programmatic models, this refers to the conditional probability distributions .
In many cases these are common parametric forms such as Gaussian, Poisson, binomial, beta and gamma distributions, with parameters given by the parents of the random variable. Parametric distributions such as these are readily simulated using standard algorithms (see e.g.
[11]), and admit either a pdf or pmf that can be evaluated, and perhaps differentiated with respect to its parameters. More difficult forms are those that can be simulated or evaluated but not both. For example, a nonlinear diffusion process defines a distribution that can be simulated (at least numerically), but it may be prohibitively expensive to evaluate the associated pdf for given values and , or this may have no closed form. Conversely, the classic Ising model is defined as a product of potentials that readily permits evaluation of the likelihood of any state, but requires expensive iterative computations to simulate.Form may also carry information across structure. For example, where the form of the parents of a random variable is conjugate to the form of that random variable, a conjugate prior relationship is established. In such cases, analytical marginalization and conditioning optimizations may be possible within an inference method.
2.3 Example
We demonstrate how these abstract ideas apply to the concrete case of an SSM. The SSM consists of a latent process , observed process , and optional parameters . The joint probability distribution is given by:
(2) 
where we have assigned common names to the various conditional probability distributions that make up this joint.
There are alternative representations. In the engineering literature, it is common to represent dependencies between the random variables using deterministic functions plus noise:
Here, the parameter and initial distributions are as per (2), but now the transition and observation distributions are expressed via the (possibly nonlinear) functions and , plus independent—often Gaussian—noise terms and :
(3a)  
(3b) 
where denotes the distribution of the process noise , and the distribution of the observation noise .
This is a mathematical description of the standard SSM. To understand it as a programmatic model, denote the set of random variables as . In this case the set is finite (or, equivalently, the infinite complement of the set is never encountered). An implementation of the model in Birch may look like the following (variable and function declarations have been removed for brevity):
Recall that is the function that denotes the next random variable to be encountered given the values of those encountered so far, and is the function that denotes the parents of that random variable, given the same values. If we think through executing the above code linebyline we see that, for example, and ; also and . For some execution , the order of random variables encountered, , will be:
i.e.  
i.e.  
i.e.  
i.e.  
i.e.  
Now consider an alternative implementation:
This expresses precisely the same mathematical model, but in a different programmatic form. When the code is executed, the order in which the random variables are encountered is different to the previous example. We can readily write down the new , and the resulting order is given by the trivial permutation . The function differs because the permutation does, but it establishes the same parent relationships as before. This is not surprising: an SSM can be represented as a directed graphical model, so that the joint distribution associated with each execution is always the same joint distribution that appears in (2), as explained in (1).
This is a simple example. One can imagine more complex code that includes conditionals (e.g. if statements and while loops) that may cause only a subset of the random variables to be encountered on any single execution. The random variables may even be encountered in different orders, or may be countably infinite (rather than finite) in number. SSMs that exhibit this complexity occur in, for example, multiple object tracking. We provide such an example in Section 5.
3 Inference methods for programmatic models
We wish to infer the conditional distribution of one set of random variables, given values assigned to some other set. In a Bayesian context, this amounts to inferring the posterior distribution. For this purpose, we partition into two disjoint sets: the observed set (where denotes the power set) containing the indices of all those random variables for which a value has been given, and the latent set with all other indices, so that . We then clamp the observed random variables to have the given values, i.e. , where we use subscript to select a subset rather than a single variable. Inference involves computing the posterior distribution:
(4) 
which decomposes into likelihood, prior and evidence terms as annotated. Having obtained the posterior distribution, we may be interested in estimating the posterior expectation of some test function of interest, say :
and subsequently making decisions based on this result.
A particular execution of the model code may encounter some subset of the variables in and , which we denote and . The distribution that results from the execution is then:
(5)  
(6) 
As before, different executions and may yield different distributions, which may be interpreted as different components of a mixture. In this case, this mixture is the posterior distribution.
A baseline method for inference is importance sampling from the prior. The model code is executed: when encountering a random variable in it is simulated from the prior, and when encountering a random variable in a cumulative weight is updated by multiplying in the likelihood of the given value. We have then simulated from the first product (5), and assigned a weight according to the second product (6).
Use of the prior distribution as a proposal is likely to produce estimates with high variance. We can improve upon this in a number of ways, such as by maintaining multiple executions simultaneously and selecting from amongst them in a resampling step, producing a particle filter
[18]. The only limitation here is the alignment of resampling points between multiple executions in order that resampling is actually beneficial—each execution may encounter observations in different orders. But because importance sampling and the most basic particle filters—up to alignment—require only forward simulation of the prior and pointwise evaluation of the likelihood, they are unaffected by many of the complexities of programmatic models, and so particularly suitable for inference. For this reason they have become a common choice for inference in probabilistic programming (see e.g. [58, 38, 30]). Various optimizations are available, such as attaching alternative proposal distributions to random variables in , or marginalizing out one or more of these. The manual use of these optimizations is well understood (see [12] for a review), although a key ingredient of probabilistic programming is to automate their use (see e.g. [40, 35]). For a tutorial introduction to the use of the particle filter for nonlinear system identification we refer to [46, 47].Because the structure of a programmatic model may change between executions, MCMC methods can be difficult to apply. Markov kernels on programmatic models are, in general, transdimensional, so that techniques such as reversible jump [19] are necessary. There are approaches to automating the design of reversible jump kernels that can work well in practice (see e.g. [57]). Randomwalk Metropolis–Hastings kernels and morerecent gradientbased kernels do not support transdimensional moves, but might still be applied to the full conditional distribution of a set of random variables within, for example, a Gibbs sampler.
Particular structural motifs may suggest particular inference methods. For example, the chain motif suggests the use of specialized Bayesian filtering methods, while tree and grid motifs are conducive to divideandconquer methods [25]. Within probabilistic programming, recent attempts have been made to match inference methods to model substructures, using both manual and automated techniques (see e.g. [30, 14, 42, 29]).
The precise choice of inference method depends not only on structure, but also on form. For example, while the chain motif suggests the use of a Bayesian filtering method based on structure, the precise choice of filter depends on form: the Kalman filter for linearGaussian forms, the forwardbackward algorithm on HMMs for discrete forms, the particle filter otherwise. In all cases the structure is the same, but the form differs. Recognizing this within program code requires compiler or library support.
Preferably, the choice of inference method based on structure and form is automated, and ideally by the programming language compiler [26], which has full access to the abstract syntax tree of the model code to inspect structure and form. There are fundamental limits to what can be known at compile time, however. In the general case, at least some structure and form is unknown until the program is run. For example, it is not possible to bound the trip count of a loop at compile time [37] if this is a stochastic quantity with unbounded support. The optimal inference method for a problem may also depend on posterior properties, such as correlations between random variables, that—by definition—are unknown a priori. In such situations it may be necessary to require manual hints provided by the programmer, or to use dynamic mechanisms that adapt the inference method during execution.
The delayed sampling [35] heuristic is an example of the latter. Delayed sampling works for programmatic models in a similar way to the sumproduct algorithm [39] for graphical models. By keeping a graph of relationships between random variables as they are encountered by a program, and delaying the simulation of latent variables for as long as possible, it opens opportunities for analytical optimizations based on conjugate prior and other relationships. This includes analytical conditioning, variable elimination, Rao–Blackwellization, and locallyoptimal proposals. It is a heuristic in the sense that it must make myopic decisions based on the current state of the running program, without knowledge of its future execution, so that it may miss potential optimizations. Delayed sampling works through the control flow statements of a Turingcomplete programming language, such as conditionals and loops, but does not attempt to marginalize over multiple branches in a single execution.
3.1 Example
For the SSM, the task is to infer the latent variables and given observations . In a Bayesian context, this is to infer the posterior distribution
(7) 
The first factor in (7) provides information about the states. For , its marginals are called the filtering distributions, and are the target of Bayesian filtering methods. The second factor is the target of parameter estimation methods. We may be interested in obtaining the posterior distribution over parameters, or obtaining the maximum likelihood estimate instead by solving the optimization problem
(8) 
In either case the central object of interest is the likelihood . By repeated use of conditional probabilities this can be written
(9) 
with the convention that . The terms in the likelihood are recursively computed via marginalization as
(10) 
so that we obtain
(11) 
There are numerous ways to compute this likelihood, but for this particular structure, the family of Bayesian filtering methods is ideal. Within this family, the preferred method depends on the form of the model. Recall Code 1. In general, the functions and must be considered nonlinear, and for such cases the particle filter is the Bayesian filtering method of choice. Now consider the code that results from the particular choice :
A particle filter can still be used, but it is possible to improve its performance with Rao–Blackwellization and a locallyoptimal proposal (see e.g. [12]), by using the local conjugacy between the Gaussian transition and Gaussian observation models. If we make the further choice that we have:
On inspection, it is clear that this is now a linearGaussian SSM. In this case we would like to choose the Kalman filter, which is the optimal Bayesian filtering method for such a form.
Note the important distinction here: the structure of the model is the same in all cases—that of the SSM—and suggests the use of a Bayesian filtering method over other means to compute the likelihood. But the form of the model differs, and it is these various forms that suggests the specific Bayesian filtering method to use: the particle filter, the particle filter with various optimizations, or the Kalman filter.
4 Implementation in Birch
In Birch, models are ideally implemented by specifying the joint probability distribution. Where possible, this means that the code for the model does not distinguish between latent and observed random variables. Instead, the value of any random variable can be clamped at runtime to establish, from the joint distribution, particular conditional distributions of interest. Methods are also implemented in the Birch language, rather than being external to the system.
We are particularly interested in considering the structure and form of a model when choosing a method for inference. This requires some interface that allows the method implementation to query the model implementation, and perhaps manipulate its execution. This becomes complicated with the random structure of programmatic models.
Birch uses a twofold approach. Firstly, metaprogramming techniques are used to represent finegrain structure and form within the various mathematical expressions that make up a model. Secondly, the model programmer has a range of classes available from which they can implement their model to explicitly describe coarsegrain structure. We deal with each of these in turn.
4.1 Finegrain structure and form
Birch adopts the metaprogramming technique of expression templates to represent mathematical expressions involving random variables. When such expressions are encountered, they are not evaluated immediately, but rather arranged into a data structure for later evaluation. In the meantime, it is possible to inspect and modify that data structure to facilitate optimizations based on lazy or reordered evaluation. This provides some of the power of a compiler to inspect structure and form via an abstract syntax tree, but within the language itself.
Expression templates are common in linear algebra libraries such as Boost uBLAS and Eigen [20], where they are used to omit unnecessary memory allocations, reads and writes, and so produce more efficient code. They are also used in reversemode automatic differentiation to compute the gradient of a function, such as in Stan for HMC, and to implement delayed sampling. At this stage, they are primarily used in Birch for this latter purpose.
Recall Code 1; declarations were omitted but might be as follows:
This declares , and the arrays x and y, as being ordinary variables of type Real. We can instead declare them as random variates of type Real like so:
Various mathematical functions and operators are overloaded for this type Random<Real>, to construct a data structure for lazy evaluation, rather than eager evaluation.
When Code 1 is executed using these random types, a data structure such as that in Figure 2 is constructed. This represents all the functions that are called, and their arguments, without evaluating them until necessary. As the functions and are opaque, there is little of value to discover here at first. If, however, we were to code the model with and , as in Code 4, we would obtain the data structure in Figure 3. Now, Birch identifies a chain of Gaussian random variables for which it is able to marginalize and condition forward analytically, using closed form solutions with which it has been programmed. This path is annotated in Figure 3 (it is precisely the path used in delayed sampling [35]). Ultimately, the computations performed are identical to running a Kalman filter forward, followed by recursively sampling backward. This utilizes structure and form to yield an exact sample from the posterior distribution, using a method that is more efficient that other means.
4.2 Coarsegrain structure
Birch provides abstract classes for various structural motifs and model classes. These automate some of the task of assembling a model by encapsulating boilerplate code, and provide crucial information on coarsegrain structure that can be used by an inference method. Consider, again, Code 1. A more complete implementation, as a class, may look something like this:
This declares a new class called Example that inherits from the class Model, provided by the Birch standard library. At this stage, classes in Birch use a limited subset of the functionality of the classes in C++ to which they are compiled. They support single but not multiple inheritance, and all member functions are virtual. Code 7 makes use of a fiber (also called a coroutine). Intuitively, this is a function for which execution can be paused and resumed. Each time execution is paused, the fiber yields a value to the caller, in a manner analogous to the way that a function returns a value to its caller—although a fiber may yield many times while a function returns only once. Like member functions, member fibers are virtual in Birch.
The Example class in Code 7 declares some member variables to represent the random variables of the model, then specifies the joint probability distribution over them by overriding the simulate fiber inherited from Model. This fiber simply simulates the model forward, but does so incrementally via implicit yields in the statements, which have a particular behavior. If the random variable on the left has not yet been assigned a value, the statement associates it with the distribution on the right, so that a value can be simulated for it later. If, on the other hand, the random variable on the left has previously been assigned a value, the statement computes its loglikelihood under the distribution on the right and yields the result. This yield value is always of type Real, as shown in the fiber declaration. This forms the most basic interface by which an inference method can incrementally evaluate likelihoods and posteriors, and even condition by assigning values to random variables in the model before executing the simulate fiber. It also represents the ideal of writing code to specify the joint distribution, while assigning values to variables before execution to simulate from a conditional distribution instead.
It is clear that the model represented by Code 7 has the structure of an SSM as in (2). A compiler with sophisticated static analysis may recognize this. Lacking such sophistication, it is the responsibility of the programmer to provide some hints. The choice to inherit from the Model class provides no such hints—the model is essentially a black box. Alternatives do provide hints. The Birch standard library provides a generic class called StateSpaceModel that itself inherits from Model, but reveals more information about structure to an inference method, and handles the boilerplate code for such structure. StateSpaceModel is a generic class that takes a number of additional type arguments to specify the type of the parameters, state variables, and observed variables. A class that inherits from StateSpaceModel should also override the parameter, initial, transition, and observation member fibers to describe the components of the model. These four fibers replace the simulate member fiber that is overridden when inheriting directly from Model.
The implementation may look something like Code 8. Type arguments are given to StateSpaceModel<...> in the inheritance clause to indicate that the parameters, state variables and observed variables are all of type Random<Real>. The x’ that appears in the transition fiber is simply a name; the prime is a valid character for names, motivated by bringing the representation in code closer to the representation in mathematics.
The model structure defined through the StateSpaceModel class is a straightforward extension of the sorts of interfaces that existing software provides. LibBi [33], for example, is based entirely on the SSM structure, and the user implements their model by writing code blocks with the same four names as the fibers above. But while LibBi supports only one interface, for SSMs, Birch can support many interfaces, for many model classes, with an objectoriented approach.
5 Example in Birch
We demonstrate the above ideas on a multiple object tracking problem (see e.g. [55] for an overview). Such problems arise in application domains including air traffic control, space debris monitoring (e.g. [22]), and cell tracking (e.g. [53]). In all of these cases, some unknown number of objects appear, move, and disappear in some physical space, with noisy sensor measurements of their positions during this time. The task is to recover the object tracks from these noisy sensor measurements.
The model to be introduced is an SSM within the class of programmatic models described in Section 2. The size of the latent state varies according to the number of objects, which is unknown a priori, and furthermore changes in time as objects appear and disappear. Similarly, the size of the observation varies according to the number of objects and, in addition, rates of detection and noise. While it is straightforward to simulate from the joint distribution , simulation of the posterior distribution is complicated by a data association problem: the random matching of observations to objects, which includes both missing and spurious detections (clutter). That is, the structure of the relationships between the components of and is not fixed. For observations and detected objects, there are (rising factorial) possible associations of equal probability under the prior. Naive inference with forward simulation of this association, as in a bootstrap particle filter, will not scale beyond a small number of objects and observations. Optimizations that instead leverage the structure and form of the model are necessary.
For this example, we work on a twodimensional rectangular domain with lower corner and upper corner . The model is described—and ultimately implemented—in a hierarchical way, by first specifying a model for single objects, then combining several of these into a model for multiple objects.
5.1 Single object model
The single object model describes the dynamics and observations (including missing detections) of a single object. The state of that object is represented by a sixdimensional vector giving its position, velocity and acceleration in the twodimensional space. These evolve according to a linearGaussian SSM. Using superscripts to denote object
, the initial and transition models are given by:where has its position component drawn uniformly on the domain , with its velocity and acceleration components set to zero, and , and are the matrices:
with the identity matrix and the zero matrix.
At time , the object may or may not be detected. This is indicated by the variable , which takes value 1 with probability to indicate detection, and value 0 otherwise. If detected, the observation is of position only:
with matrices:
5.2 Multiple object model
The multiple object model mediates several single object models, including their appearance and disappearance, and clutter. At time , the number of new objects to appear, , is distributed as:
for some parameter . The lifetime of each object, , is distributed as:
for some parameter . In practice this is implemented as a probability of disappearance at each time step. If object has been present for time steps, its probability of disappearing on the next is given by
, with these probabilities easily computed under the Poisson distribution.
At time , some number of spurious observations (clutter) occur that are not associated with an object. Their number is denoted , distributed as
(12) 
for some parameter
. The position of each is uniformly distributed on the domain
. That there is at least one spurious observation at each time merely simplifies the implementation slightly—we revisit this point in Section 5.4 below.5.3 Inference method
Inference can leverage the structure and form of the model. The structure is that of an SSM, with random latent state size and random observation size. Within this SSM is further structure, as each of the single objects follows an SSM independently of the others. The form of those inner SSMs is linearGaussian, suggesting the use of a Kalman filter for optimal tracking, while the outer SSM requires the use of a particle filter. The inference problem is further complicated by data association, specifically matching detected objects with given observations. This is handled in the multiple object model, with a specific proposal distribution used within the particle filter to propose associations of high likelihood.
Let denote the index of the observation associated with object at time . For an object that is not detected () we set . If there are number of objects, of which are detected, and number of observations, the prior distribution over associations is uniform on the possibilities:
where the pmfs associated with the conditional distributions on the right are:
Here, denotes the indicator function. These expressions simply limit the uniform distribution to the correct domain: as long as all are in the support and the nonzero (corresponding to detected objects) are distinct, the probability is uniformly .
The proposal distribution is to iterate through the detected objects in turn, choosing for each an associated observation in proportion to its likelihood, excluding those observations already associated. Thus, we have:
where the pmfs associated with the conditional distributions on the right are:
The delayed sampling heuristic within Birch automatically applies a further optimization to this. As each object follows a linearGaussian SSM, the are marginalized out using a Kalman filter, so that the proposal becomes:
where
is the pdf of the multivariate normal distribution, with
the mean and the covariance of as predicted by the Kalman filter for the th object.At time , then, it is straightforward to propose a data association from (or ) and, in the usual importance sampling fashion, weight with the ratio:
Unassociated observations at the end of the procedure are considered clutter.
5.4 Implementation
The model is implemented in Birch in a modular fashion. It consists of three classes: one for the parameters, one for the single object model, and one for the multiple object model.
A class called Global is declared with a member variable for each of the parameters of the model, given in Code 9.
The implementation of the single object model is given in Code 10. This declares a class called Track that, as before, inherits from the class StateSpaceModel in the Birch standard library. The parameter type is Global, while the state and observation types are both Random<Real[_]>. The initial, transition, and observation member fibers are overridden to specify the model. An unfamiliar operator appears in the code: the meaning of is to simulate from the distribution on the right and assign the value to the variable on the left.
The implementation of the multiple object model is given in Code 11. This declares a class called Multi that, again, inherits from the class StateSpaceModel in the Birch standard library. The parameter type is given as Global, the state type as a List of Track objects, and the observation type as a List of Random<Real[_]> objects; List is a generic class provided by the standard library. The transition and observation member fibers are overridden to specify the model. The observation fiber is complicated by the data association problem. Recall, as in (12), that there is always at least one point of clutter; an empty list of observations can therefore be interpreted as missing observations to be simulated, rather than present observations to be conditioned upon. When observations are present, the code defers to an alternative association member fiber, detailed below. When missing, they are simulated. This is an instance where it is not possible to write a single piece of code that specifies the joint distribution and covers both cases.
Finally, we show the association member fiber, which is the most difficult part of the model and inference method. This is given in Code 12. A number of new language features appear. First, the yield statement is used to yield a value from a fiber, much like the return statement is used to return a value from a function. In previous examples, yielding from fibers has been implicit, via operators, rather than explicit, via yield statements. Here, explicit yields are used to correct weights for the data association proposal described in Section 5.3. Another unfamiliar operator appears: the meaning of is to compute the pdf or pmf of the value of the variable on the left under the distribution on the right, and implicitly yield its logarithm. The symmetry with the previouslyintroduced operator is intentional: these two operators work as a pair for simulation and observation. Other than this, the code makes use of the interface to the Random<Real[_]> class to query for detection and compute likelihoods.
5.5 Results
The model is simulated for 100 time steps to produce the ground truth and data set shown in the top left of Figure 4. Using Birch, we then run a particle filter several times to produce the remaining plots in Figure 4. The delayed sampling feature of Birch ensures that, within each particle of the particle filter, a separate Kalman filter is applied to each object. Each run uses 32768 particles, from which a single path is drawn as a posterior sample, to be weighted by the normalizing constant estimate obtained from the particle filter.
Generally, these posterior samples show good alignment with the ground truth. The longer tracks in the posterior samples correspond to similar tracks in the ground truth. Some shorter tracks in the ground truth are missing in the posterior samples (for example, the short track in the top right of the ground truth that appears at time 97), and conversely, some spurious tracks that do not appear in the ground truth appear sporadically in some posterior samples. These spurious tracks should be expected: the prior puts positive probability on objects being detected very few times—or not at all—in their lifetime.
6 Summary
Probabilistic programming is a relatively young field that seeks to accelerate the workflow of probabilistic modeling and inference using new probabilistic programming languages. A key concept is to decouple the implementation of models and inference methods, using various techniques to match them together for efficient inference, preferably in an automated way.
In this work, we have provided a definition of the class of programmatic models, with an emphasis on structure and form. The class reflects the nature of models expressed in programming languages where, in general, structure and form are not known a priori, but may instead depend on random choices made during program execution. We have discussed some of the complexity that this brings to inference, especially that different executions of the model code may encounter different sets of random variables. SMC methods have issues around alignment for resampling, where different particles may encounter observations in different orders. MCMC methods encounter issues around Markov kernels needing to be transdimensional. Nevertheless, we argue that persistent substructure is common, can be represented by structural motifs, and can be utilized in implementation to match models with appropriate inference methods.
We have shown the particular implementation of these ideas in the universal probabilistic programming language Birch. Here, expression templates are used to explore finegrain structure and form, while class interfaces reveal coarsegrain structure. The expression templates in particular are important to enable analytical optimizations via the delayed sampling heuristic.
Finally, we have shown a multiple object tracking example, where the model involved resides in the class of programmatic models, featuring a random number of latent variables, random number of observed variables, and random associations between them. Nonetheless, the model exhibits a clear structure and form: multiple linearGaussian SSMs for single objects, within a single nonlinear SSM for multiple objects. The delayed sampling heuristic provided by Birch automatically adapts the inference method to a particle filter for the multiple object model, with each particle within that filter using a Kalman filter for each single object model.
7 Acknowledgements
This research was financially supported by the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT150012). The authors thank Karl Granström for helpful discussions around the multiple object tracking example.
References
 [1] Pyro: Deep universal probabilistic programming. URL http://pyro.ai.
 Andrieu et al. [2010] C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society B, 72:269–302, 2010.
 Bardenet et al. [2017] R. Bardenet, A. Doucet, and C. Holmes. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18:1–43, 2017.
 Bierkens et al. [2016] J. Bierkens, P. Fearnhead, and G. Roberts. The zigzag process and superefficient sampling for Bayesian analysis of big data. ArXiv eprints, 2016.
 Bishop [2007] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.
 BouchardCôté et al. [2017] A. BouchardCôté, S. J. Vollmer, and A. Doucet. The bouncy particle sampler: A nonreversible rejectionfree Markov chain Monte Carlo method. arXiv preprint 1510.02451, 2017.
 Chopin et al. [2013] N. Chopin, P. Jacob, and O. Papaspiliopoulos. SMC: An efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society B, 75:397–426, 2013.
 de Valpine et al. [2017] P. de Valpine, D. Turek, C. Paciorek, C. AndersonBergman, D. Temple Lang, and R. Bodik. Programming with models: writing statistical algorithms for general model structures with NIMBLE. Journal of Computational and Graphical Statistics, 26:403–417, 2017.
 Del Moral and Murray [2015] P. Del Moral and L. M. Murray. Sequential Monte Carlo with highly informative observations. SIAM/ASA Journal on Uncertainty Quantification, 3(1):969–997, 2015. doi: 10.1137/15M1011214.
 Deligiannidis et al. [2016] G. Deligiannidis, A. Doucet, and M. K. Pitt. The correlated pseudo–marginal method. arXiv abs/1511.04992, 2016.
 Devroye [1986] L. Devroye. NonUniform Random Variate Generation. SpringerVerlag, New York, 1986.
 Doucet and Johansen [2011] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: fifteen years later, chapter 24, pages 656–704. Oxford University Press, 2011.
 Ellis et al. [2013] K. Ellis, E. Dechter, R. P. Adams, and J. B. Tenenbaum. Learning graphical concepts. In NIPS Workshop on Constructive Machine Learning, 2013.
 Ge et al. [2018] H. Ge, K. Xu, and Z. Ghahramani. Turing: A language for flexible probabilistic inference. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1682–1690, 2018.
 Gerber and Chopin [2015] M. Gerber and N. Chopin. Sequential quasi Monte Carlo. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(3):509–579, 2015. doi: 10.1111/rssb.12104.
 Goodman and Stuhlmüller [2014] N. D. Goodman and A. Stuhlmüller. The design and implementation of probabilistic programming languages. http://dippl.org, 2014. Accessed: 2017517.
 Goodman et al. [2008] N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonowitz, and J. B. Tenenbaum. Church: a language for generative models. Uncertainty in Artificial Intelligence, 2008.
 Gordon et al. [1993] N. Gordon, D. Salmond, and A. Smith. Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEE ProceedingsF, 140:107–113, 1993.
 Green [1995] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732, 1995.
 Guennebaud et al. [2010] G. Guennebaud, B. Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.
 Höhna et al. [2016] S. Höhna, M. J. Landis, T. A. Heath, B. Boussau, N. Lartillot, B. R. Moore, J. P. Huelsenbeck, and F. Ronquist. RevBayes: Bayesian phylogenetic inference using graphical models and an interactive modelspecification language. Systematic Biology, 65(4):726–736, 2016.
 Jones et al. [2015] B. A. Jones, D. S. Bryant, B.T. Vo, and B.N. Vo. Challenges of multitarget tracking for space situational awarenes. In 2015 18th International Conference on Information Fusion (Fusion), pages 1278–1285, 2015.
 Jordan [2004] M. I. Jordan. Graphical models. Statistical Science, 19(1):140–155, 2004.
 Koller and Friedman [2009] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 Lindsten et al. [2017] F. Lindsten, A. M. Johansen, C. A. Naesseth, B. Kirkpatrick, T. B. Schön, J. A. D. Aston, and A. BouchardCôté. Divideandconquer with sequential Monte Carlo. Journal of Computational and Graphical Statistics, 26(2):445–458, 2017.
 Lundén et al. [2018] D. Lundén, D. Broman, and L. M. Murray. Combining static and dynamic optimizations using closedform solutions. Proceedings of the Workshop on Probabilistic Programming Semantics (PPS), 2018.
 Lunn et al. [2012] D. Lunn, C. Jackson, N. Best, A. Thomas, and D. Spiegelhalter. The BUGS Book: A Practical Introduction to Bayesian Analysis. CRC Press / Chapman and Hall, 2012.
 Lunn et al. [2000] D. J. Lunn, A. Thomas, N. Best, and D. Spiegelhalter. WinBUGS – a Bayesian modelling framework: Concepts, structure and extensibility. Statistics and Computing, 10:325–337, 2000.
 Mansinghka et al. [2018] V. Mansinghka, U. Schaechtle, S. Handa, A. Radul, Y. Chen, and M. Rinard. Probabilistic programming with programmable inference. In PLDI, 2018.
 Mansinghka et al. [2014] V. K. Mansinghka, D. Selsam, and Y. N. Perov. Venture: a higherorder probabilistic programming platform with programmable inference. arXiv abs/1404.0099, 2014.
 Milch et al. [2007] B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. Statistical Relational Learning, chapter BLOG: Probabilistic Models with Unknown Objects. MIT Press, 2007.
 Minka et al. [2018] T. Minka, J. Winn, J. Guiver, Y. Zaykov, D. Fabian, and J. Bronskill. Infer.NET 2.7, 2018. URL http://research.microsoft.com/infernet. Microsoft Research Cambridge.
 Murray [2015] L. M. Murray. Bayesian statespace modelling on highperformance hardware using LibBi. Journal of Statistical Software, 67(10):1–36, 2015.
 Murray et al. [2013] L. M. Murray, E. M. Jones, and J. Parslow. On disturbance statespace models and the particle marginal Metropolis–Hastings sampler. SIAM/ASA Journal of Uncertainty Quantification, 1(1):494–521, 2013. doi: 10.1137/130915376.
 Murray et al. [2018] L. M. Murray, D. Lundén, J. Kudlicka, D. Broman, and T. B. Schön. Delayed sampling and automatic Rao–Blackwellization of probabilistic programs. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), Lanzarote, Spain, 2018.
 Neal [2011] R. M. Neal. Handbook of Markov Chain Monte Carlo, chapter MCMC Using Hamiltonian Dynamics. Chapman and Hall/CRC, 2011.
 Nori et al. [2014] A. Nori, C.K. Hur, S. Rajamani, and S. Samuel. R2: An efficient MCMC sampler for probabilistic programs. AAAI Conference on Artificial Intelligence (AAAI), 2014.
 Paige and Wood [2014] B. Paige and F. Wood. A compilation target for probabilistic programming languages. 31st International Conference on Machine Learning (ICML), 2014.
 Pearl [1988] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.
 Perov et al. [2015] Y. N. Perov, T. A. Le, and F. Wood. Datadriven sequential Monte Carlo in probabilistic programming. NIPS 2015 Workshop: Black Box Learning and Inference, 2015.
 Pfeffer [2016] A. Pfeffer. Practical Probabilistic Programming. Manning, 2016.
 Pfeffer et al. [2018] A. Pfeffer, B. Ruttenberg, W. Kretschmer, and A. OConnor. Structured factored inference for probabilistic programming. Proceedings of Machine Learning Research, 84:1224–1232, 2018.
 Plummer [2003] M. Plummer. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 2003.
 Plummer [2013] M. Plummer. JAGS, 2013. URL http://mcmcjags.sourceforge.net. Just Another Gibbs Sampler.
 Roberts and Rosenthal [1998] G. O. Roberts and J. S. Rosenthal. Optimal scaling of discrete approximations to Langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(1):255–268, 1998.
 Schön et al. [2015] T. B. Schön, F. Lindsten, J. Dahlin, J. Wågberg, A. C. Naesseth, A. Svensson, and L. Dai. Sequential Monte Carlo methods for system identification. In Proceedings of the 17th IFAC Symposium on System Identification (SYSID), Beijing, China, October 2015.
 Schön et al. [2018] T. B. Schön, A. Svensson, L. M. Murray, and F. Lindsten. Probabilistic learning of nonlinear dynamical systems using sequential Monte Carlo. Mechanical Systems and Signal Processing (MSSP), 104:866–883, 2018.
 Spiegelhalter et al. [2003] D. Spiegelhalter, A. Thomas, N. Best, and D. Lunn. WinBUGS user manual. Technical report, 2003. URL http://www.mrcbsu.cam.ac.uk/bugs.
 Stan Development Team [2013] Stan Development Team. Stan: A C++ library for probability and sampling, 2013. URL http://mcstan.org.
 Todeschini et al. [2014] A. Todeschini, F. Caron, M. Fuentes, P. Legrand, and P. Del Moral. Biips: Software for Bayesian inference with interacting particle systems. arXiv abs/1412.3779, 2014.
 Tolpin et al. [2016] D. Tolpin, J. van de Meent, H. Yang, and F. Wood. Design and implementation of probabilistic programming language Anglican. arXiv abs/1608.05263, 2016.
 Tran et al. [2016] D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, and D. M. Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.
 Ulman et al. [2017] V. Ulman, M. Maška, K. E. G. Magnusson, O. Ronneberger, C. Haubold, N. Harder, P. Matula, P. Matula, D. Svoboda, M. Radojevic, I. Smal, K. Rohr, J. Jaldén, H. M. Blau, O. Dzyubachyk, B. Lelieveldt, P. Xiao, Y. Li, S.Y. Cho, A. C. Dufour, J.C. OlivoMarin, C. C. ReyesAldasoro, J. A. SolisLemus, R. Bensch, T. Brox, J. Stegmaier, R. Mikut, S. Wolf, F. A. Hamprecht, T. Esteves, P. Quelhas, O. Demirel, L. Malmström, F. Jug, P. Tomancak, E. Meijering, A. Muñoz Barrutia, M. Kozubek, and C. Ortizde Solorzano. An objective comparison of celltracking algorithms. Nature Methods, 14:1141, 2017. doi: dx.doi.org/10.1038/nmeth.4473.
 Vanetti et al. [2017] P. Vanetti, A. BouchardCôté, G. Deligiannidis, and A. Doucet. Piecewisedeterministic Markov chain Monte Carlo. ArXiv eprints, 2017.
 Vo et al. [2015] B.N. Vo, M. Mallick, Y. BarShalom, S. Coraluppi, R. Osborne, R. Mahler, and B.T. Vo. Multitarget Tracking, pages 1–15. American Cancer Society, 2015. doi: 10.1002/047134608X.W8275.
 Whiteley and Lee [2014] N. Whiteley and A. Lee. Twisted particle filters. Annals of Statistics, 42(1):115–141, 2014. doi: 10.1214/13AOS1167.
 Wingate et al. [2011] D. Wingate, A. Stuhlmueller, and N. Goodman. Lightweight implementations of probabilistic programming languages via transformational compilation. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 770–778, 2011.
 Wood et al. [2014] F. Wood, J. W. van de Meent, and V. Mansinghka. A new approach to probabilistic programming inference. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), 2014.
Comments
There are no comments yet.