A compiler for BLOG probabilistic programming language
A probabilistic program defines a probability measure over its semantic structures. One common goal of probabilistic programming languages (PPLs) is to compute posterior probabilities for arbitrary models and queries, given observed evidence, using a generic inference engine. Most PPL inference engines---even the compiled ones---incur significant runtime interpretation overhead, especially for contingent and open-universe models. This paper describes Swift, a compiler for the BLOG PPL. Swift-generated code incorporates optimizations that eliminate interpretation overhead, maintain dynamic dependencies efficiently, and handle memory management for possible worlds of varying sizes. Experiments comparing Swift with other PPL engines on a variety of inference problems demonstrate speedups ranging from 12x to 326x.READ FULL TEXT VIEW PDF
Deep probabilistic programming languages try to combine the advantages o...
Probabilistic programming languages represent complex data with intermin...
This work offers a broad perspective on probabilistic modeling and infer...
This paper proposes π, a formal semantic framework for compiler
Probabilistic programming languages (PPLs) are receiving widespread atte...
Probabilistic programming is the idea of writing models from statistics ...
We propose Edward, a Turing-complete probabilistic programming language....
A compiler for BLOG probabilistic programming language
Probabilistic programming languages (PPLs) aim to combine sufficient expressive power for writing real-world probability models with efficient, general-purpose inference algorithms that can answer arbitrary queries with respect to those models. One underlying motive is to relieve the user of the obligation to carry out machine learning research and implement new algorithms for each problem that comes along. Another is to support a wide range of cognitive functions in AI systems and to model those functions in humans.
General-purpose inference for PPLs is very challenging; they may include unbounded numbers of discrete and continuous variables, a rich library of distributions, and the ability to describe uncertainty over functions, relations, and the existence and identity of objects (so-called open-universe models). Existing PPL inference algorithms include likelihood weighting (LW) [Milch et al.2005b], parental Metropolis–Hastings (PMH) [Milch and Russell2006, Goodman et al.2008], generalized Gibbs sampling (Gibbs) [Arora et al.2010], generalized sequential Monte Carlo [Wood et al.2014], Hamiltonian Monte Carlo (HMC) [Stan Development Team2014], variational methods [Minka et al.2014, Kucukelbir et al.2015] and a form of approximate Bayesian computation [Mansinghka et al.2013]. While better algorithms are certainly possible, our focus in this paper is on achieving orders-of-magnitude improvement in the execution efficiency of a given algorithmic process.
A PPL system takes a probabilistic program (PP) specifying a probabilistic model as its input and performs inference to compute the posterior distribution of a query given some observed evidence. The inference process does not (in general) execute the PP, but instead executes the steps of an inference algorithm (e.g., Gibbs) guided by the dependency structures implicit in the PP. In many PPL systems the PP exists as an internal data structure consulted by the inference algorithm at each step [Pfeffer2001, Lunn et al.2000, Plummer2003, Milch et al.2005a, Pfeffer2009]. This process is in essence an interpreter
for the PP, similar to early Prolog systems that interpreted the logic program. Particularly when running sampling algorithms that involve millions of repetitive steps, the overhead can be enormous. A natural solution is to produce model-specific compiled inference code, but, as we show in Sec.2, existing compilers for general open-universe models [Wingate et al.2011, Yang et al.2014, Hur et al.2014, Chaganty et al.2013, Nori et al.2014] miss out on optimization opportunities and often produce inefficient inference code.
The paper analyzes the optimization opportunities for PPL compilers and describes the Swift compiler, which takes as input a BLOG program [Milch et al.2005a] and one of three inference algorithms (LW, PMH, Gibbs) and generates target code for answering queries. Swift includes three main contributions: (1) elimination of interpretative overhead by joint analysis of the model structure and inference algorithm; (2) a dynamic slice maintenance method (FIDS) for incremental computation of the current dependency structure as sampling proceeds; and (3) efficient runtime memory management for maintaining the current-possible-world data structure as the number of objects changes. Comparisons between Swift and other PPLs on a variety of models demonstrate speedups ranging from 12x to 326x, leading in some cases to performance comparable to that of hand-built model-specific code. To the extent possible, we also analyze the contributions of each optimization technique to the overall speedup.
Although Swift is developed for the BLOG language, the overall design and the choices of optimizations can be applied to other PPLs and may bring useful insights to similar AI systems for real-world applications.
In a general purpose programming language (e.g., C++), the compiler compiles exactly what the user writes (the program). By contrast, a PPL compiler essentially compiles the inference algorithm, which is written by the PPL developer, as applied to a PP. This means that different implementations of the same inference algorithm for the same PP result in completely different target code.
As shown in Fig. 1, a PPL compiler first produces an intermediate representation combining the inference algorithm () and the input model () as the inference code (), and then compiles to the target code (). Swift focuses on optimizing the inference code and the target code given a fixed input model .
Although there have been successful PPL compilers, these compilers are all designed for a restricted class of models with fixed dependency structures: see work by Tristan NIPS2014_5531 and the Stan Development Team sdt2014stan for closed-world Bayes nets, Minka minka2014infernet for factor graphs, and Kazemi and Poole kazemi2016knowledge for Markov logic networks.
Church [Goodman et al.2008] and its variants [Wingate et al.2011, Yang et al.2014, Ritchie et al.2016] provide a lightweight compilation framework for performing the Metropolis–Hastings algorithm (MH) over general open-universe probability models (OUPMs). However, these approaches (1) are based on an inefficient implementation of MH, which results in overhead in , and (2) rely on inefficient data structures in the target code .
For the first point, consider an MH iteration where we are proposing a new possible world from accoding to the proposal
by resampling random variablefrom to . The computation of acceptance ratio follows
Since only is resampled, it is sufficient to compute using merely the Markov blanket of , which leads to a much simplified formula for from Eq.(1) by cancelling terms in and . However, when applying MH for a contingent model, the Markov blanket of a random variable cannot be determined at compile time since the model dependencies vary during inference. This introduces tremendous runtime overhead for interpretive systems (e.g., BLOG, Figaro), which track all the dependencies at runtime even though typically only a tiny portion of the dependency structure may change per iteration. Church simply avoids keeping track of dependencies and uses Eq.(1), including a potentially huge overhead for redundant probability computations.
For the second point, in order to track the existence of the variables in an open-universe model, similar to BLOG, Church also maintains a complicated dynamic string-based hashing scheme in the target code, which again causes interpretation overhead.
Lastly, techniques proposed by Hur et al. hur2014slicing, Chaganty et al. chaganty2013efficiently and Nori et al. nori2014r2 primarily focus on optimizing by analyzing the static properties of the input PP, which are complementary to Swift.
This paper focuses on the BLOG language, although our approach also applies to other PPLs with equivalent semantics [McAllester et al.2008, Wu et al.2014]. Other PPLs can be also converted to BLOG via static single assignment form (SSA form) transformation [Cytron et al.1989, Hur et al.2014].
The BLOG language [Milch et al.2005a] defines probability measures over first-order (relational) possible worlds; in this sense it is a probabilistic analogue of first-order logic. A BLOG program declares types for objects and defines distributions over their numbers, as well as defining distributions for the values of random functions applied to objects. Observation statements supply evidence, and a query statement specifies the posterior probability of interest. A random variable in BLOG corresponds to the application of a random function to specific objects in a possible world. BLOG naturally supports open universe probability models (OUPMs) and context-specific dependencies.
Fig. 2 demonstrates the open-universe urn-ball model. In this version the query asks for the color of the next random pick from an urn given the colors of balls drawn previously. Line 4 is a number statement stating that the number variable
, corresponding to the total number of balls, is uniformly distributed between 1 and 20. Lines 5–6 declare a random function, applied to balls, picking
Greenwith a biased probability. Lines 7–8 state that each draw chooses a ball at random from the urn, with replacement.
All PPLs need to infer the posterior distribution of a query given the observed evidence. The great majority of PPLs use Monte Carlo sampling methods such as likelihood weighting (LW) and Metropolis–Hastings MCMC (MH). LW samples unobserved random variables sequentially in topological order, conditioned on evidence and sampled values earlier in the sequence; each complete sample is weighted by the likelihood of the evidence variables appearing in the sequence. The MH algorithm is summarized in Alg. 1. denotes a BLOG model, its evidence, a query, the number of samples, a possible world, the value of random variable in , denotes the conditional probability of in , and denotes the likelihood of possible world . In particular, when the proposal distribution in Alg.1 samples a variable conditioned on its parents, the algorithm becomes the parental MH algorithm (PMH). Our discussion in the paper focuses on LW and PMH but our proposed solution applies to other Monte Carlo methods as well.
Swift has the following design goals for handling open-universe probability models.
Provide a general framework to automatically and efficiently (1) track the exact Markov blanket for each random variable and (2) maintain the minimum set of random variables necessary to evaluate the query and the evidence (the dynamic slice [Agrawal and Horgan1990], or equivalently the minimum self-supporting partial world [Milch et al.2005a]).
Provide efficient memory management and fast data structures for Monte Carlo sampling algorithms to avoid interpretive overhead at runtime.
For the first goal, we propose a novel framework, FIDS (Framework for Incrementally updating Dynamic Slices), for generating the inference code ( in Fig. 1). For the second goal, we carefully choose data structures in the target C++ code ().
For convenience, r.v. is short for random variable. We demonstrate examples of compiled code in the following discussion: the inference code () generated by FIDS in Sec. 4.1 is in pseudo-code while the target code () by Swift shown in Sec. 4.2 is in C++. Due to limited space, we show the detailed transformation rules only for the adaptive contingency updating (ACU) technique and omit the others.
Our discussion in this section focuses on LW and PMH, although FIDS can be also applied to other algorithms. FIDS includes three optimizations: dynamic backchaining (DB), adaptive contingency updating (ACU), and reference counting (RC). DB is applied to both LW and PMH while ACU and RC are specific to PMH.
Dynamic backchaining (DB) [Milch et al.2005a] constructs a dynamic slice incrementally in each LW iteration and samples only those variables necessary at runtime. DB in Swift is an example of compiling lazy evaluation: in each iteration, DB backtracks from the evidence and query and samples an r.v. only when its value is required during inference.
For every r.v. declaration random T ; in the input PP, FIDS generates a getter function get X(), which is used to (1) sample a value for from its declaration and (2) memoize the sampled value for later references in the current iteration. Whenever an r.v. is referred to during inference, its getter function will be evoked.
Compiling DB is the fundamental technique for Swift. The key insight is to replace dependency look-ups and method invocations from some internal PP data structure with direct machine address accessing and branching in the target executable file.
For example, the inference code for the getter function of in the -GMM model (Fig. 3) is shown below.
memoization denotes some pseudo-code snippet for performing memoization. Since
get_z() is called before
get_mu(), only those corresponding to non-empty clusters will be sampled. Notably, with the inference code above, no explicit dependency look-up is ever required at runtime to discover those non-empty clusters.
When sampling a number variable, we need to allocate memory for the associated variables. For example, in the urn-ball model (Fig. 2), we need to allocate memory for after sampling . The corresponding generated inference code is shown below.
denotes some pseudo-code segment for allocating
val chunks of memory for the values of .
Reference counting (RC) generalize the idea of DB to incrementally maintain the dynamic slice in PMH. RC is an efficient compilation strategy for the interpretive BLOG to dynamically maintain references to variables and exclude those without any references from the current possible world with minimal runtime overhead. RC is also similar to dynamic garbage collection in programming language community.
For an r.v. being tracked, say , RC maintains a reference counter cnt(X) defined by . denote the children of in the possible world . When cnt(X) becomes zero, is removed from ; when cnt(X) become positive, is instantiated and added back to . This procedure is performed recursively.
Tracking references to every r.v. might cause unnecessary overhead. For example, for classical Bayes nets, RC never excludes any variables. Hence, as a trade-off, Swift only counts references in open-universe models, particularly, to those variables associated with number variables (e.g., in the urn-ball model).
Take the urn-ball model as an example. When resampling and accepting the proposed value , the generated code for accepting the proposal will be
dec_cnt(X) update the references to . The function
accept_value_drawn() is specialized to r.v. : Swift analyzes the input program and generates specialized codes for different variables.
The goal of ACU is to incrementally maintain the Markov Blanket for every r.v. with minimal efforts.
denotes the declaration of r.v. in the input PP; denote the parents of in the possible world ; denotes the new possible world derived from by only changing the value of to . Note that deriving may require instantiating new variables not existing in due to dependency changes.
Computing for is straightforward by executing its generation process, , within , which is often inexpensive. Hence, the principal challenge for maintaining the Markov blanket is to efficiently maintain a children set for each r.v. with the aim of keeping identical to the true set for the current possible world .
With a proposed possible world (PW) , it is easy to add all missing dependencies from a particular r.v. . That is, for an r.v. and every r.v. , since must be in , we can keep up-to-date by adding to if . Therefore, the key step is tracking the set of variables, , which will have a set of parents in different from .
Precisely computing is very expensive at runtime but computing an upper bound for does not influence the correctness of the inference code. Therefore, we proceed to find an upper bound that holds for any value . We call this over-approximation the contingent set .
Note that for every r.v. with dependency that changes in , must be reached in the condition of a control statement on the execution path of within . This implies . One straightforward idea is set . However, this leads to too much overhead. For example, should be always empty for models with fixed dependencies.
Another approach is to first find out the set of switching variables for every r.v. , which is defined by
This brings an immediate advantage: can be statically computed at compile time. However, computing is NP-Complete111Since the declaration may contain arbitrary boolean formulas, one can reduce the 3-SAT problem to computing ..
Our solution is to derive the approximation by taking the union of free variables of if/case conditions, function arguments, and set expression conditions in , and then set
is a function of the sampling variable and the current PW . Likewise, for every r.v. , we maintain a runtime contingent set identical to the true set under the current PW . can be incrementally updated as well.
Back to the original focus of ACU, suppose we are adding new the dependencies in the new PW . There are three steps to accomplish the goal: (1) enumerating all the variables , (2) for all , add to , and (3) for all , add to . These steps can be also repeated in a similar way to remove the vanished dependencies.
Take the -GMM model (Fig. 3) as an example , when resampling , we need to change the dependency of r.v. since for any . The generated code for this process is shown below.
We omit the code for
del_from_Ch() for conciseness, which is essentially the same as
The formal transformation rules for ACU are demonstrated in Fig. 4. takes in a random variable declaration statement in the BLOG program and outputs the inference code containing methods accept value() and add to Ch(). takes in a BLOG expression and generates the inference code inside method add to Ch(). It keeps track of already seen variables in , which makes sure that a variable will be added to the contingent set or the children set at most once. Code for removing variables can be generated similarly.
Since we maintain the exact children for each r.v. with ACU, the computation of acceptance ratio in Eq. 1 for PMH can be simplified to
Here denotes the total number of random variables existing in , which can be maintained via RC.
Finally, the computation time of ACU is strictly shorter than that of acceptance ratio (Eq. 2).
Here we introduce the implementation details of the memoization code in the getter function mentioned in section 4.1.1.
Objects will be converted to integers in the target code. Swift analyzes the input PP and allocates static memory for memoization in the target code. For open-universe models where the number of random variables are unknown, the dynamic table data structure (e.g., vector in C++) is used to reduce the amount of dynamic memory allocations: we only increase the length of the array when the number becomes larger than the capacity.
One potential weakness of directly applying a dynamic table is that, for models with multiple nested number statements (e.g. the aircraft tracking model with multiple aircrafts and blips for each one), the memory consumption can be large. In Swift, the user can force the target code to clear all the unused memory every fixed number of iterations via a compilation option, which is turned off by default.
The following code demonstrates the target code for the number variable and its associated ’s in the urn-ball model (Fig. 2) where iter is the current iteration number.
Note that when generating a new PW, by simply increasing the counter iter, all the memoization flags (e.g., mark color for ) will be automatically cleared.
Lastly, in PMH, we need to randomly select an r.v. to sample per iteration. In the target C++ code, this process is accomplished via polymorphism by declaring an abstract class with a resample() method and a derived class for every r.v. in the PP. An example code snippet for the -GMM model is shown below.
Although one PMH iteration only samples a single variable, generating a new PW may still involve (de-)instantiating an arbitrary number of random variables due to RC or sampling a number variable. The general approach commonly adopted by many PPL systems is to construct the proposed possible world in dynamically allocated memory and then copy it to the current world [Milch et al.2005a, Wingate et al.2011], which suffers from significant overhead.
In order to generate and accept new PWs in PMH with negligible memory management overhead, we extend the lightweight memoization to manipulate proposals: Swift statically allocates an extra memoized cache for each random variable, which is dedicated to storing the proposed value for that variable. During resampling, all the intermediate results are stored in the cache. When accepting the proposal, the proposed value is directly loaded from the cache; when rejection, no action is needed due to our memoization mechanism.
Here is an example target code fragment for in the -GMM model. proposed vars is an array storing all the variables to be updated if the proposal gets accepted.
We demonstrate that FIDS can be applied to new algorithms by implementing a translator for the Gibbs sampling [Arora et al.2010] (Gibbs) in Swift.
Gibbs is a variant of MH with the proposal distribution set to be the posterior distribution. The acceptance ratio
is always 1 while the disadvantage is that it is only possible to explicitly construct the posterior distribution with conjugate priors. In Gibbs, the proposal distribution is still constructed from the Markov blanket, which again requires maintaining the children set. Hence, FIDS can be fully utilized.
However, different conjugate priors yield different forms of posterior distributions. In order to support a variety of proposal distributions, we need to (1) implement a conjugacy checker to check the form of posterior distribution for each r.v. in the PP at compile time (the checker has nothing to do with FIDS); (2) implement a posterior sampler for every conjugate prior in the runtime library of Swift.
In the experiments, Swift generates target code in C++ with C++ standard <random> library for random number generation and the armadillo package [Sanderson2010] for matrix computation. The baseline systems include BLOG (version 0.9.1), Figaro (version 3.3.0), Church (webchurch), Infer.NET (version 2.6), BUGS (winBUGS 1.4.3) and Stan (cmdStan 2.6.0).
We collect a set of benchmark models222Most models are simplified to ensure that all the benchmark systems can produce an output in reasonably short running time. All of these models can be enlarged to handle more data without adding extra lines of BLOG code. which exhibit various capabilities of a PPL (Tab. 1), including the burglary model (Bur), the hurricane model (Hur), the urn-ball model (U-B(,) denotes the urn-ball model with at most balls and draws), the TrueSkill model [Herbrich et al.2007] (T-K), 1-dimensional Gaussian mixture model (GMM, 100 points with 4 clusters) and the infinite GMM model (-GMM). We also include a real-world dataset: handwritten digits [LeCun et al.1998] using the PPCA model [Tipping and Bishop1999]. All the models can be downloaded from the GitHub repository of BLOG.
We first evaluate the speedup promoted by each of the three optimizations in FIDS individually.
We compare the running time of the following versions of code: (1) the code generated by Swift (“Swift”); (2) the modified compiled code with DB manually turned off (“No DB”), which sequentially samples all the variables in the PP; (3) the hand-optimized code without unnecessary memoization or recursive function calls (“Hand-Opt”).
We measure the running time for all the 3 versions and the number of calls to the random number generator for “Swift” and “No DB”. The result is concluded in Tab. 2.
|Running Time (s): Swift v.s. No DB|
|Running Time (s): Swift v.s. Hand-Optimized|
|Calls to Random Number Generator|
The overhead due to memoization compared against the hand-optimized code is less than . We can further notice that the speedup is proportional to the number of calls saved.
RC only applies to open-universe models in Swift. Hence we focus on the urn-ball model with various model parameters. The urn-ball model represents a common model structure appearing in many applications with open-universe uncertainty. This experiment reveals the potential speedup by RC for real-world problems with similar structures. RC achieves greater speedup when the number of balls and the number of observations become larger.
We compare the code produced by Swift with RC (“Swift”) and RC manually turned off (“No RC”). For “No RC”, we traverse the whole dependency structure and reconstruct the dynamic slice every iteration. Fig. 5 shows the running time of both versions. RC is indeed effective – leading to up to 2.3x speedup.
We compare the code generated by Swift with ACU (“Swift”) and the manually modified version without ACU (“Static Dep”) and measure the number of calls to the likelihood function in Tab. 3.
The version without ACU (“Static Dep”) demonstrates the best efficiency that can be achieved via compile-time analysis without maintaining dependencies at runtime. This version statically computes an upper bound of the children set by and again uses Eq.(2) to compute the acceptance ratio. Thus, for models with fixed dependencies, “Static Dep” should be faster than “Swift”.
In Tab. 3, “Swift” is up to 1.7x faster than “Static Dep”, thanks to up to reduction of likelihood function evaluations. Note that for the model with fixed dependencies (Bur), the overhead by ACU is almost negligible: due to traversing the hashmap to access to child variables.
|Running Time (s): Swift v.s. Static Dep|
|Calls to Likelihood Functions|
We compare Swift with other systems using LW, PMH and Gibbs respectively on the benchmark models. The running time is presented in Tab. 4. The speedup is measured between Swift and the fastest one among the rest. An empty entry means that the corresponding PPL fails to perform inference on that model. Though these PPLs have different host languages (C++, Java, Scala, etc.), the performance difference resulting from host languages is within a factor of 2333http://benchmarksgame.alioth.debian.org/.
|LW with samples|
|PMH with samples|
|Gibbs with samples|
We use Swift with Gibbs to handle the probabilistic principal component analysis (PPCA)[Tipping and Bishop1999], with real-world data: 5958 images of digit “2” from the hand-written digits dataset (MNIST) [LeCun et al.1998]. We compute 10 principal components for the digits. The training and testing sets include 5958 and 1032 images respectively, each with 28x28 pixels and pixel value rescaled to .
Since most benchmark PPL systems are too slow to handle this amount of data, we compare Swift against other two scalable PPLs, Infer.NET and Stan, on this model. Both PPLs have compiled inference and are widely used for real-world applications. Stan uses HMC as its inference algorithm. For Infer.NET, we select variational message passing algorithm (VMP), which is Infer.NET’s primary focus. Note that HMC and VMP are usually favored for fast convergence.
Stan requires a tuning process before it can produce samples. We ran Stan with 0, 5 and 9 tuning steps respectively444We also ran Stan with 50 and 100 tuning steps, which took more than 3 days to finish 130 iterations (including tuning). However, the results are almost the same as that with 9 tuning steps.. We measure the perplexity of the generated samples over test images w.r.t. the running time in Fig. 6, where we also visualize the produced principal components. For Infer.NET, we consider the mean of the approximation distribution.
Swift quickly generates visually meaningful outputs with around Gibbs iterations in 5 seconds. Infer.NET takes 13.4 seconds to finish the first iteration555Gibbs by Swift samples a single variable while Infer.NET processes all the variables per iteration. and converges to a result with the same perplexity after 25 iterations and 150 seconds. The overall convergence of Gibbs w.r.t. the running time significantly benefits from the speedup by Swift.
For Stan, its no-U-turn sampler [Homan and Gelman2014] suffers from significant parameter tuning issues. We also tried to manually tune the parameters, which does not help much. Nevertheless, Stan does work for the simplified PPCA model with 1-dimensional data. Although further investigation is still needed, we conjecture that Stan is very sensitive to its parameters in high-dimensional cases. Lastly, this experiment also suggests that those parameter-robust inference engines, such as Swift with Gibbs and Infer.NET with VMP, would be preferred at practice when possible.
We have developed Swift, a PPL compiler that generates model-specific target code for performing inference. Swift uses a dynamic slicing framework for open-universe models to incrementally maintain the dependency structure at runtime. In addition, we carefully design data structures in the target code to avoid dynamic memory allocations when possible. Our experiments show that Swift achieves orders of magnitudes speedup against other PPL engines.
The next step for Swift is to support more algorithms, such as SMC [Wood et al.2014], as well as more samplers, such as the block sampler for (nearly) deterministic dependencies [Li et al.2013]. Although FIDS is a general framework that can be naturally extended to these cases, there are still implementation details to be studied.
Another direction is partial evaluation, which allows the compiler to reason about the input program to simplify it. Shah et al. rohin2016simpl proposes a partial evaluation framework for the inference code ( in Fig. 1). It is very interesting to extend this work to the whole Swift pipeline.
We would like to thank our anonymous reviewers, as well as Rohin Shah for valuable discussions. This work is supported by the DARPA PPAML program, contract FA8750-14-C-0011.
Approximate inference for infinite contingent Bayesian networks.In
Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005.