Venture: a higher-order probabilistic programming platform with programmable inference

04/01/2014 ∙ by Vikash Mansinghka, et al. ∙ MIT 0

We describe Venture, an interactive virtual machine for probabilistic programming that aims to be sufficiently expressive, extensible, and efficient for general-purpose use. Like Church, probabilistic models and inference problems in Venture are specified via a Turing-complete, higher-order probabilistic language descended from Lisp. Unlike Church, Venture also provides a compositional language for custom inference strategies built out of scalable exact and approximate techniques. We also describe four key aspects of Venture's implementation that build on ideas from probabilistic graphical models. First, we describe the stochastic procedure interface (SPI) that specifies and encapsulates primitive random variables. The SPI supports custom control flow, higher-order probabilistic procedures, partially exchangeable sequences and "likelihood-free" stochastic simulators. It also supports external models that do inference over latent variables hidden from Venture. Second, we describe probabilistic execution traces (PETs), which represent execution histories of Venture programs. PETs capture conditional dependencies, existential dependencies and exchangeable coupling. Third, we describe partitions of execution histories called scaffolds that factor global inference problems into coherent sub-problems. Finally, we describe a family of stochastic regeneration algorithms for efficiently modifying PET fragments contained within scaffolds. Stochastic regeneration linear runtime scaling in cases where many previous approaches scaled quadratically. We show how to use stochastic regeneration and the SPI to implement general-purpose inference strategies such as Metropolis-Hastings, Gibbs sampling, and blocked proposals based on particle Markov chain Monte Carlo and mean-field variational inference techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic modeling and approximate Bayesian inference have proven to be powerful tools in multiple fields, from machine learning

(Bishop, 2006) and statistics (Green et al., 2003; Gelman et al., 1995) to robotics (Thrun et al., 2005)

, artificial intelligence

(Russell and Norvig, 2002), and cognitive science (Tenenbaum et al., 2011). Unfortunately, even relatively simple probabilistic models and their associated inference schemes can be difficult and time-consuming to design, specify, analyze, implement, and debug. Applications in different fields, such as robotics and statistics, involve differing modeling idioms, inference techniques and dataset sizes. Different fields also often impose varying speed and accuracy requirements that interact with modeling and algorithmic choices. Small changes in the modeling assumptions, data, performance/accuracy requirements or compute budget frequently necessitate end-to-end redesign of the probabilistic model and inference strategy, in turn necessitating reimplementation of the underlying software.

These difficulties impose a high cost on practitioners, making state-of-the-art modeling and inference approaches impractical for many problems. Minor variations on standard templates can be out of reach for non-specialists. The cost is also high for experts: the development time and failure rate make it difficult to innovate on methodology except in simple settings. This limits the richness of probabilistic models of cognition and artificial intelligence systems, as these kinds of models push the boundaries of what is possible with current knowledge representation and inference techniques.

Probabilistic programming languages could potentially mitigate these problems. They provide a formal representation for models — often via executable code that makes a random choice for every latent variable — and attempt to encapsulate and automate inference. Several languages and systems have been built along these lines over the last decade (Lunn et al., 2000; Stan Development Team, 2013; Milch et al., 2007; Pfeffer, 2001; McCallum et al., 2009). Each of these systems is promising in its own domain; some of the strengths of each are described below. However, none of the probabilistic programming languages and systems that have been developed thus far is suitable for general purpose use. Examples of drawbacks include inadequate and unpredictable runtime performance, limited expressiveness, batch-only operation, lack of extensibility, and overly restrictive and/or opaque inference schemes. In this paper, we describe Venture, a new probabilistic language and inference engine that attempts to address these limitations.

Several probabilistic programming tools have sought efficiency by restricting expressiveness. For example, Microsoft’s Infer.NET system (Minka et al., 2010) leverages fast message passing techniques originally developed for graphical models, but as a result restricts the use of stochastic choice in the language so that it cannot influence control flow. Such random choices would yield models over sets of random variables with varying or even unbounded size, and therefore preclude compilation to a graphical model. BUGS, arguably the first (and still most widely used) probabilistic programming language, has essentially the same restrictions (Lunn et al., 2000)

. Random compound data types, procedures and stochastic control flow constructs that could lead to a priori unbounded executions are all out of scope. STAN, a BUGS-like language being developed in the Bayesian statistics community, has limited support for discrete random variables, as these are incompatible with the hybrid (gradient-based) Monte Carlo strategy it uses to overcome convergence issues with Gibbs sampling

(Stan Development Team, 2013). Other probabilistic programming tools that have seen real-world use include FACTORIE (McCallum et al., 2009) and Markov Logic (Richardson and Domingos, 2006)

; applications of both have emphasized problems in information extraction. The probabilistic models that can be defined using FACTORIE and Markov Logic are finite and undirected, specified imperatively (for FACTORIE) or declaratively (for Markov Logic). Both systems make use of specialized, efficient approximation algorithms for inference and parameter estimation. Infer.NET, STAN, BUGS, FACTORIE and Markov Logic each capture important modeling and approximate inference idioms, but there are also interesting models that each cannot express. Additionally, a number of probabilistic extensions of classical logic programming languages have also been developed

(Poole, 1997; Sato and Kameya, 1997; De Raedt and Kersting, 2008), motivated by problems in statistical relational learning. As with FACTORIE and Markov Logic, these languages have interesting and useful properties, but have thus far not yielded compact descriptions of many useful classes of probabilistic generative models from domains such as statistics and robotics.

In contrast, probabilistic programming languages such as BLOG (Milch et al., 2007), IBAL (Pfeffer, 2001), Figaro (Pfeffer, 2009) and Church (Goodman*, Mansinghka*, Roy, Bonowitz, and Tenenbaum, 2008; Mansinghka, 2009) emphasize expressiveness. Each language was designed around the needs of models whose fine-grained structure cannot be represented using directed or undirected graphical models, and where standard inference algorithms for graphical models such as belief propagation do not directly apply. Examples include probabilistic grammars (Jelinek et al., 1992), nonparametric Bayesian models (Rasmussen, 1999; Johnson et al., 2007; Rasmussen and Williams, 2006; Griffiths and Ghahramani, 2005), probabilistic models over worlds with a priori unknown numbers of objects, models for learning the structure of graphical models (Heckerman, 1998; Friedman and Koller, 2003; Mansinghka et al., 2006), models for inductive learning of symbolic expressions (Grosse et al., 2012; Duvenaud et al., 2013) and models defined in terms of complex stochastic simulation software that lack tractable likelihoods (Marjoram et al., 2003).

Each of these model classes is the basis of real-world applications, where inference over richly structured models can address limitations of classic statistical modeling and pattern recognition techniques. Example domains include natural language processing

(Manning and Schütze, 1999), speech recognition (Baker, 1979), information extraction (Pasula et al., 2002), multitarget tracking and sensor fusion (Oh et al., 2009; Arora et al., 2010), ecology (Csilléry et al., 2010) and computational biology (Friedman et al., 2000; Yu et al., 2004; Toni et al., 2009; Dowell and Eddy, 2004). However, the performance engineering needed to turn specialized inference algorithms for these models into viable implementations is challenging. Direct deployment of probabilistic program implementations in real-world applications is often infeasible. The elaborations on these models that expressive probabilistic languages enable can thus seem completely impractical.

Church makes the most extreme tradeoffs with respect to expressiveness and efficiency. It can represent models from all the classes listed above, partly through its support for higher-order probabilistic procedures, and it can also represent generative models defined in terms of algorithms for simulation and inference in arbitrary Church programs. This flexibility makes Church especially suitable for nonparametric Bayesian modeling (Roy et al., 2008), as well as artificial intelligence and cognitive science problems that involve reasoning about reasoning, such as sequential decision making, planning and theory of mind (Goodman and Tenenbaum, 2013; Mansinghka, 2009). Additionally, probabilistic formulations of learning Church programs from data — including both program structure and parameters — can be formulated in terms of inference in an ordinary Church program (Mansinghka, 2009). But although various Church implementations provide automatic Metropolis-Hastings inference mechanisms that in principle apply to all these problems, these mechanisms have exhibited limitations in practice. It has not been clear how to make general-purpose sampling-based inference in Church sufficiently scalable for typical machine learning applications, including problems for which standard techniques based on graphical models have been applied successfully. It also is not easy for Church programmers to override built in inference mechanisms or add new higher-order stochastic primitives.

In this paper we describe Venture, an interactive, Turing-complete, higher order probabilistic programming platform that aims to be sufficiently expressive, extensible and efficient for general-purpose use. Venture includes a virtual machine, a language for specifying probabilistic models, and a language for specifying inference problems along with custom inference strategies for solving those problems. Venture’s implementation of standard MCMC schemes scales linearly with dataset size on problems where many previous inference architectures scale quadratically and are therefore impractical. Venture also supports a larger class of primitives — including “likelihood-free” primitives arising from complex stochastic simulators — and enables programmers to incrementally migrate performance-critical portions of their probabilistic program to optimized external inference code. Venture thus improves over Church in terms of expressiveness, extensibility, and scalability. Although it remains to be seen if these improvements are sufficient for general-purpose use, unoptimized Venture prototypes have begun to be successfully applied in real-world system building, Bayesian data analysis, cognitive modeling and machine intelligence research.

color=greencolor=greentodo: color=green[medium] Figure outlining key characteristics of applied modeling + inference across fields, with discussion of general purpose and robotics-statistics spectrumcolor=greencolor=greentodo: color=green[small] Provide short overview blurb summarizing performance measurements and bump up to key contributions, here and below

1.1 Contributions

This paper makes two main contributions. First, it describes key aspects of Venture’s design, including support for interactive modeling and programmable inference. Due to these and other innovations, Venture provides broad coverage in terms of models, approximation strategies for those models and overall applicability to inference problems with varying model/data complexities and time/accuracy requirements. Second, this paper describes key aspects of Venture’s implementation: the stochastic procedure interface (SPI) for encapsulating primitives, the probabilistic execution trace (PET) data structure for efficient representation and updating of execution histories, and a suite of stochastic regeneration algorithms for scalable inference within trace fragments called scaffolds. Other important aspects of Venture, including the VentureScript front-end syntax, formal language definitions, software architecture, standard library, and performance measurements of optimized implementations, are all beyond the scope of this paper.

color=greencolor=greentodo: color=green[medium] Figure outlining Venture’s architecture and main ideas

It is helpful to consider the relationships between PETs and the SPI. The SPI generalizes the notion of elementary random procedure associated with many previous probabilistic programming languages. The SPI encapsulates Venture primitives and enables interoperation with external modeling components, analogously to a foreign function interface in a traditional programming language. External modeling components can represent sets of latent variables hidden from Venture and that use specialized inference code. The SPI also supports custom control flow, higher-order probabilistic procedures, exchangeable sequences, and the “likelihood-free” stochastic primitives that can arise from complex simulations. Probabilistic execution traces are used to represent generative models written in Venture, along with particular realizations of these models and the data values they must explain. PETs generalize Bayesian networks to handle the patterns of conditional dependence, existential dependence and exchangeable coupling amongst invocations of stochastic procedures conforming to the SPI. PETs thus must handle a priori unbounded sets of random variables that can themselves be arbitrary probabilistic generative processes written in Venture and that may lack tractable probability densities.

Using these tools, we show how to define coherent local inference steps over arbitrary sets of random choices within a PET, conditioned on the surrounding trace. The core idea is the notion of a scaffold. A scaffold is a subset of a PET that contains those random variables that must exist regardless of the values chosen for a set of variables of interest, along with a set of random variables whose values will be conditioned on. We show how to construct scaffolds efficiently. Inference given a scaffold proceeds via stochastic regeneration algorithms that efficiently consume and either restore or resample PET fragments without visiting conditionally independent random choices. The proposal probabilities, local priors, local likelihoods and gradients needed for several approximate inference strategies can all be obtained via small variations on stochastic regeneration.

We use stochastic regeneration to implement both single-site and composite Metropolis-Hastings, Gibbs sampling, and also blocked proposals based on hybrids with conditional sequential Monte Carlo and variational techniques. The uniform implementation of these approaches to incremental inference, along with analytical tools for converting randomly chosen local transition operators into ergodic global transition operators on PETs, constitutes another contribution.

2 The Venture Language

Consider the following example Venture program for determining if a coin is fair or tricky:

[ASSUME is_tricky_coin (bernoulli 0.1)]
[ASSUME coin_weight (if is_tricky_coin (uniform 0.0 1.0) 0.5)]
[OBSERVE (bernoulli coin_weight) True]
[OBSERVE (bernoulli coin_weight) True]
[INFER (mh default one 10)]
[PREDICT (bernoulli coin_weight)]

We will informally discuss this program before defining the Venture language more precisely.

The ASSUME instructions induce the hypothesis space for the probabilistic model, including a random variable for whether or not the coin is tricky, and either a deterministic coin weight or a potential random variable corresponding to the unknown weight. The model selection problem is expressed via an if with a stochastic predicate, with the alternative models on the consequent and alternate branches. After executing the ASSUME instructions, particular values for is_tricky_coin and coin_weight

will have been sampled, though the meaning of the program so far corresponds to a probability distribution over possible executions.

The OBSERVE instructions describe a data generator that produces two flips of a coin with the generated weight, along with data that is assumed to be generated by the given generator. In this program, the OBSERVEs encode constraints that both of these coin flips landed heads up.

The INFER instruction causes Venture to find a hypothesis (execution trace) that is probable given the data using 10 iterations of its default Markov chain for inference111This default Markov chain is a variant of the algorithm from Church (Goodman*, Mansinghka*, Roy, Bonowitz, and Tenenbaum, 2008; Mansinghka, 2009). This is a simple random scan single-site Metropolis-Hastings algorithm that chooses random choices uniformly at random from the current execution, resimulates them conditioned on the rest of the trace, and accepts or rejects the result.. INFER evolves the probability distribution over execution traces inside Venture from whatever distribution is present before the instruction — in this case, the prior — closer to its conditional given any observations that have been added prior to that INFER, using a user-specified inference technique. In this example program, the resulting marginal distribution on whether or not the coin is tricky shifts from the prior to an approximation of the posterior given the two observed flips, increasing the probability of the coin being tricky ever so slightly. Increasing 10 to 100 shifts the distribution closer to the true posterior; other inference strategies, including exact sampling techniques, will be covered later. The execution trace inside Venture after the instruction is sampled from this new distribution.

color=greencolor=greentodo: color=green[medium] tricky coin example: schematically illustrate samples vs distributions

Once inference has finished, the PREDICT instruction causes Venture to report a sampled prediction for the outcome of another flip of the coin. The weight used to generate this sample comes from the current execution trace.

color=greencolor=greentodo: color=green[medium] tricky coin example: figure for changing evidence as number of coin flips changes

2.1 Modeling and Inference Instructions

Venture programs consist of sequences of modeling instructions and inference instructions, each given a positive integer index by a global instruction counter222Venture implementations also support labels for the instruction language. However, the recurrence of line numbers is reflective of the ways the current instruction language is primitive as compared to the modeling language. For example, it currently lacks control flow constructs, procedural abstraction, and recursion, let alone runtime generation and execution of statements in the instruction language. Current research is focused on addressing these limitations.. Interactive Venture sessions have the same structure. The modeling instructions are used to specify the probabilistic model of interest, any conditions on the model that the inference engine needs to enforce, and any requests for prediction of values against the conditioned distribution.

The core modeling instructions in Venture are:

  1. [ASSUME <name> <expr>]: binds the result of simulating the model expression <expr> to the symbol <name> in the global environment. This is used to build up the model that will be used to interpret data. Returns the value taken on by <name>, along with the index of the instruction.

  2. [OBSERVE <expr> <literal-value>]: adds the constraint that the model expression <expr> must yield <literal-value> in every execution. Note that this constraint is not enforced until inference is performed.

  3. [PREDICT <expr>]: samples a value for the model expression <expr> from the current distribution on executions in the engine and returns the value. As the amount of inference done since the last OBSERVE approaches infinity, this distribution converges to the conditioned distribution that reconciles the OBSERVEs.

  4. [FORGET <instruction-index-or-label>]: This instruction causes the engine to undo and then forget the given instruction, which must be either an OBSERVE or PREDICT. Forgetting an observation removes the constraint it represents from the inference problem. Note that the effect may not be visible until an INFER is performed.

Venture supports additional instructions for inference and read-out, including:

color=greencolor=greentodo: color=green[code] Patch Venture to add a controllable computation time bound for rejection
  1. [INFER <inference-expr>]: This instruction first incorporates any observations that have occurred after the last INFER, then evolves the probability distribution on executions according to the inference strategy described by <inference-expr>. Two inference expressions, corresponding to general-purpose exact and approximate sampling schemes, are useful to consider here:

    1. (rejection default all) corresponds to the use of rejection sampling to generate an exact sample from the conditioned distribution on traces. The runtime requirements may be substantial, and exact sampling applies to a smaller class of programs than approximate sampling. However, rejection is crucial for understanding the meaning of a probabilistic model and for debugging models without simultaneously debugging inference strategies.

    2. (mh default one 1) corresponds to one transition of the standard uniform mixture of single-site Metropolis-Hastings transition operators used as a general-purpose “automatic” inference scheme in many probabilistic programming systems. As the number of transitions is increased from towards , the semantics of the instruction approach an exact implementation of conditioning via rejection.

  2. [SAMPLE <expr>]: This instruction simulates the model expression <expr> against the current trace, returns the value, and then forgets the trace associated with the simulation. It is equivalent to [PREDICT <expr>] followed by [FORGET <index-of-predict>], but is provided as a single instruction for convenience.

  3. [FORCE <expr> <literal-value>]: Modify the current trace so that the simulation of <expr> takes on the value <literal-value>. Its implementation can be roughly thought of as an OBSERVE immediately followed by an INFER and then a FORGET. This instruction can be used for controlling initialization and for debugging.

2.2 Modeling Expressions

Venture modeling expressions describe stochastic generative processes. The space of all possible executions of all the modeling expressions in a Venture program constitute the hypothesis space that the program represents. Each Venture program thus represents a probabilistic model by defining a stochastic generative process that samples from it.

At the expression level, Venture is similar to Scheme and to Church, though there are several differences333We sometimes refer to the s-expression syntax (including syntactic sugar) as Venchurch, and the desugared language (represented as JSON objects corresponding to parse trees) as Venture.. For example, branching and procedure construction can both be desugared into applications of stochastic procedures — that is, ordinary combinations — and do not need to be treated as special forms. Additionally, Venture supports a dynamic scoping construct called scope_include for tagging portions of an execution history such that they can be referred to by inference instructions; to the best of our knowledge, analogous constructs have not yet been introduced in other probabilistic programming languages.

Venture modeling expressions can be broken down into a few simple cases:

  1. Self-evaluating or “literal” values: These describe constant values in the language, and are discussed below.

  2. Combinations: (<operator-expr> <operand0-expr> ... <operandk-expr>) first evaluates all of its expressions in arbitrary order, then applies the value of <operator-expr> (which must be a stochastic procedure) to the values of all the <operand-expr>s. It returns the value of the application as its own result.

  3. Quoted expressions: (quote <expr>) returns the expression value <expr>. As compared to combinations, quote suppresses evaluation.

  4. Lambda expressions: (lambda <args> <body-expr>) returns a stochastic procedure with the given formal parameters and procedure body. <args> is a specific list of argument names (<arg0> ... <argk>).

  5. Conditionals: (if <predicate-expr> <consequent-expr> <alternate-expr>) evaluates the <predicate-expr>, and then if the resulting value is true, evaluates and returns value of the <consequent-expr>, and if not, evaluates and returns the value of the <alternate-expr>.

  6. Inference scope annotations: (scope_include <scope-expr> <block-expr> <expr>) provides a mechanism for naming random choices in a probabilistic model, so that they can be referred to during inference programming. The scope_include form simulates <scope-expr> and <block-expr> to obtain a scope value and a block value, and then simulates <expr>, tagging all the random choices in that process with the given scope and block. More details on inference scopes are given below.

2.3 Inference Scopes

Venture programs may attach metadata to fragments of execution traces via a dynamic scoping construct called inference scopes. Scopes are defined in modeling expressions, via the special form (scope_include <scope-expr> <block-expr> <expr>), that assigns all random choices required to simulate <expr> to a scope that is named by the value resulting from simulating <scope-expr> and a block within that scope that is named by the value that results from simulating <block-expr>. A single random choice can be in multiple inference scopes, but can only be in one block within each scope. Also, a random choice gets annotated with a scope each time the choice is simulated within the context of a scope_include form, not just the first time it is simulated.

Inference scopes can be referred to in inference expressions, thus providing a mechanism for associating custom inference strategies with different model fragments. For example, in a parameter estimation problem for hidden Markov models, it might be natural to have one scope for the hidden states, another for the hyperparameters, and a third for the parameters, where the blocks for the hidden state scope correspond to indexes into the hidden state sequence. We will see later how to write cycle hybrid kernels that use Metropolis-Hastings to make proposals for the hyperparameters and either single-site or particle Gibbs over the hidden states. Inference scopes also provide a means of controlling the allocation of computational effort across different aspects of a hypothesis, e.g. by only performing inference over scopes whose random choices are conditionally dependent on the choices made by a given

PREDICT instruction of interest.

Random choices are currently tagged with (scope, block) pairs. Blocks can be thought of as subdivisions of scopes into meaningful (and potentially ordered) subsets. We will see later how inference expressions can make use of block structure to provide fine-grained control over inference and enable novel inference strategies. For example, the order in which a set of random choices is traversed by conditional sequential Monte Carlo can be controlled via blocks, regardless of the order in which they were constructed during initial simulation.

Scopes and blocks can be produced by random choices; <scope-expr>s and <block-expr>s are ordinary Venture modeling expressions444Our current implementations restrict the values of scope and block names to symbols and integers for simplicity, but this restriction is not intrinsic to the Venture specification.. This enables the use of random choices in one scope to control the scope or block allocation of random choices in another scope. The random choices used to construct scopes and blocks may be auxiliary variables independent of the rest of the model, or latent variables whose distributions depend on the interaction of modeling assumptions, data and inference. At present, the only restriction is that inference on the random choices in a set of blocks cannot add or remove random choices from that set of blocks, though the probability of membership can be affected. Applications of randomized, inference-driven scope and block assignments include variants of cluster sampling techniques, beyond the spin glass (Swendsen and Wang, 1986) and regression (Nott and Green, 2004) settings where they have typically been deployed.

color=greencolor=greentodo: color=green[medium] Add support for random scopes/blocks; cite stochastic digital circuits as motivation

The implementation details needed to handle random scopes and blocks are beyond the scope of this paper. However, we will later see analytical machinery that is sufficient for justifying the correctness of complex transition operators involving randomly chosen scopes and blocks.

Venture provides two built-in scopes555Some implementations so far have merged the default and latent scopes, or triggered inference over latents automatically after every transition over the default scope. :

  1. default — This scope contains every random choice. Previously proposed inference schemes for Church as well as concurrently developed generic inference schemes for variants of Venture correspond to single-line inference instructions acting on this default, global scope.

  2. latents — This scope contains all the latent random choices made by stochastic procedures but hidden from Venture. Using this scope, programmers can control the frequency with which any external inference systems are invoked, and interleave inference over external variables with inference over the latent variables managed by Venture.

2.4 Inference Expressions

Inference expressions specify transition operators that evolve the probability distribution on traces inside the Venture virtual machine. This is in contrast to instructions, which extend a model, add data, or initiate inference using a valid transition operator.

Venture provides several primitive forms for constructing transition operators that leave the conditioned distribution invariant, each of which is a valid inference expression. In each of these forms, scope must be a literal scope, and block must either be a literal block within that scope, or the keyword one or all. The “selected” set of random choices on which each inference expression acts is given by the specified scope and block. If the block specification is all, then the union of all blocks within the scope is taken. If the block specification is one, then one block is chosen uniformly at random from the set of all blocks within the given scope.

The core set of inference expressions in Venture are as follows:

  1. (mh <scope> <block> <#-transitions>) — Propose new values for the selected choices either by resimulating them or by invoking a custom local proposal kernel if one has been provided. Accept or reject the results via the Metropolis-Hastings rule, accounting for changes to the mapping between random choices and scopes/blocks using the machinery provided later in this paper. Repeat the whole process #-transitions times.

  2. (rejection <scope> <block> <#-transitions>) — Use rejection sampling to generate an exact sample from the conditioned distribution on all the selected random choices. Repeat the whole process #-transitions times, potentially improving convergence if the selected set is randomly chosen, i.e. block is one. This transition operator is often computationally intractable, but is optimal, in the sense that it makes the most progress per completed transition towards the conditioned distribution on traces. All the other transition operators exposed by the Venture inference language can be viewed as asymptotically convergent approximations to it.

  3. (pgibbs <scope> <block> <#-particles> <#-transitions>) — Use conditional sequential Monte Carlo to propose from an approximation to the conditioned distribution over the selected set of random choices. If block is ordered, all the blocks in the scope are sorted, and each distribution in the sequence of distributions includes all the random choices from the next block. Otherwise, each distribution in the sequence includes a single random choice drawn from the selected set, and the ordering is arbitrary.

  4. (meanfield <scope> <block> <iters> <#-transitions>) — Use iters steps of stochastic gradient to optimize the parameters of a partial mean-field approximation to the conditioned distribution over the random choices in the given scope and block (with block interpreted as with mh). Make a single Metropolis-Hastings proposal using this approximation. Repeat the process #-transitions times.

  5. (enumerative_gibbs <scope> <block> <#-transitions>) — Use exhaustive enumeration to perform a transition over all the selected random choices from a proposal corresponding to the optimal conditional proposal (conditioned on the values of any newly created random variables). Random choices whose domains cannot be enumerated are resimulated from their prior unless they have been equipped with custom simulation kernels. If all selected random choices are discrete and no new random choices are created, this is equivalent to the rejection transition operator, and corresponds to a discrete, enumerative implementation of Gibbs sampling, hence the name. The computational cost scales exponentially with the number of random choices, as opposed to the KL divergence between the prior and the conditional (Freer et al., 2010).

Both mh and pgibbs are implemented by in-place mutation. However, versions of each that use simultaneous particles to represent alternative possibilities are given by func-mh and func-pgibbs. These are prepended with func to signal an aspect of their implementation: these simultaneously accessible sets of particles are implemented using persistent data structure techniques typically associated with pure functional programming. func-pgibbs can yield improvements in order of growth of runtime as compared to pgibbs, but it imposes restrictions on the selected random choices666To support multiple simultaneous particles, all stochastic procedures within the given scope and block must support a clone operation for their auxiliary state storage (or have the ability to emulate it). This is feasible for standard exponential family models, but may be not be feasible for external inference systems hosted on distributed hardware..

There are also currently two composition rules for transition operators, enabling the creation of cycle and mixture hybrids:

  1. (cycle (<inference-expr-1> <inference-expr-2>) ... <#-transitions>) — This produces a cycle hybrid of the transition operators represented by the given inference expression: each transition operator is run in sequence, and the whole sequence is repeated #-transitions times.

  2. (mixture ((<w1> <inference-expr-1>) (<w2> <inference-expr-2>) ...) <#-transitions>) — This produces a mixture hybrid of the given transition operators, using the given mixing weights, that is invoked #-transitions times.

This language is flexible enough to express a broad class of standard approximate inference strategies as well as novel combinations of standard inference algorithm templates such as conditional sequential Monte Carlo, Metropolis-Hastings, Gibbs sampling and mean-field variational inference. Additionally, the ability to use random variables to map random choices to inference strategies and to perform inference over these variables may enable new cluster sampling techniques. That said, from an aesthetic standpoint, the current inference language also has many limitations, some of which seem straightforward to relax. For example, it seems natural to expand inference expressions to support arbitrary modeling expressions, and thereby also support arbitrary computation to produce inference schemes. The machinery needed to support these and other natural extensions is discussed later in this paper.

2.5 Values

Venture values include the usual scalar and symbolic data types from Scheme, along with extended support for collections and additional datatypes corresponding to primitive objects from probability theory and statistics. Venture also supports the

stochastic procedure datatype, used for built-in and user-added primitive procedures as well as compound procedures returned by lambda. A full treatment of the value hierarchy is out of scope, but we provide a brief list of the most important values here:

  1. Numbers: roughly analogous to floating point numbers, e.g. 1, 2.4, -23, and so on.

  2. Atoms: discrete items with no internal structure or ordering. These are generated by categorical draws, but also Dirichlet and Pitman-Yor processes.

  3. Symbols: symbol values, such as the name of a formal argument being passed to lambda, the name associated with an ASSUME instruction, or the result of evaluating a quote special form.

  4. Collections:vectors, which map numbers to values and support O(1) random access, and maps, which map values to values and support O(1) amortized random access (via a hash table that relies on the built-in hash function associated with each kind of value).

  5. Stochastic procedures: these include the components of the standard library, and can also be created by lambda and other stochastic procedures.

2.6 Automatic inference versus inference programming

Although Venture programs can incorporate custom inference strategies, it does not require them. Interfaces that are as automatic as existing probabilistic programming systems are straightforward to implement. Single-site Metropolis-Hastings and Gibbs sampling algorithms — the sole automatic inference option in many probabilistic programming systems — can be invoked with a single instruction. We have also seen that global sequential Monte Carlo and mean field algorithms are similarly straightforward to describe. Support for programmable inference does not necessarily increase the education burden on would-be probabilistic programmers, although it does provide a way to avoid limiting probabilistic programmers to a potentially inadequate set of inference strategies.

The idea that inference strategies can be formalized as structured, compositionally specified inference programs operating on model programs is, to the best of our knowledge, new to Venture. Under this view, standard inference algorithms actually correspond to primitive inference programming operations or program templates, some of which depend on specific features of the model program being acted upon. This perspective suggests that far more complex inference strategies should be possible if the right primitives, means of combination and means of abstraction can be identified. Considerations of modularity, analyzability, soundness, completeness, and reuse will become central. and will be complicated by the interaction between inference programs and model programs. For example, inference programmers will need to be able to predict the asymptotic scaling of inference instructions, factoring out the contribution of the computational complexity of the model expressions to which a given inference instruction is being applied. Another example comes from considering abstraction and reuse. It should be possible to write compound inference procedures that can be reused across different models, and perhaps even use inference to learn these procedures via an appropriate hierarchical Bayesian formulation.

Another view, arguably closer to the mainstream view in machine learning, is that inference algorithms are better thought of by analogy to mathematical programming and operations research, with each algorithm corresponding to a “solver” for a well-defined class of problems with certain structure. This perspective suggests that there is likely to be a small set of monolithic, opaque mechanisms that are sufficient for most important problems. In this setting, one might hope that inference mechanisms can be matched to models and problems via simple heuristics, and that the problem of automatically generating high-quality inference strategies will prove easier than query planning for databases, and will be vastly easier than automatic programming.

color=greencolor=greentodo: color=green[medium] Add in tactics languages for theorem provers: coq, isabelle, etc. and term rewriting logics

It remains to be seen whether the traditional view is sufficient in practice or if it underestimates the richness of inference and its interaction with modeling and problem specification.

2.7 Procedural and Declarative Interpretations

color=yellowcolor=yellowtodo: color=yellow[medium] Include an example that illustrates equivalences between procedural and declarative

We briefly consider the relationship between procedural and declarative interpretations of Venture programs.

Venture code has a direct procedural reading: it defines a probabilistic generative process that samples hypotheses, checks constraints, and invokes inference instructions that trigger specific algorithms for reconciling the hypotheses with the constraints. Re-orderings of the instructions can significantly impact runtime and change the distribution on outputs. The divergence between the true conditioned distribution on execution traces and the distribution encoded by the program may depend strongly on what inference instructions are chosen and how they are interleaved with the incorporation of data.

Venture code also has declarative readings that are unaffected by some of these procedural details. One way to formalize the of meaning of a Venture program is as a probability distribution over execution traces. A second approach is to ignore the details of execution and restrict attention to the joint probability distribution of the values of all PREDICTs so far. A third approach, consistent with Venture’s interactive interface, is to equate the meaning of a program with the probability distribution of the values of all the PREDICTs in all possible sequences of instructions that could be executed in the future. Under the second and third readings, many programs are equivalent, in that they induce the same distribution albeit with different scaling behavior.

As the amount of inference performed at each INFER instruction increases, these interpretations coalesce, recovering a simple semantics based on sequential Bayesian reasoning. Consider replacing all inference instructions are replaced with exact sampling — [INFER (rejection default one 1)] — or a sufficiently large number of transitions of a generic inference operator, such as [INFER (mh default one 1000000)]. In this case, each INFER implements a single step of sequential Bayesian reasoning, conditioning the distribution on traces with all the OBSERVEs since the last INFER. The distribution after each INFER becomes equivalent to the distribution represented by all programs with the same ASSUMEs, all the OBSERVEs before the infer (in any order), and a single INFER. The computational complexity varies based on the ordering and interleaving of INFERs and OBSERVEs, but the declarative meaning is unchanged. Although correspondence with these declarative, fully Bayesian semantics may require an unrealistic amount of computation in real-world applications, close approximations can be useful tools for debugging, and the presence of the limit may prove useful for probabilistic program analysis and transformation.

Venture programs represent distributions by combining modeling operations that sample values for expressions, constraint specification operations that build up a conditioner, and inference operations that evolve the distribution closer to the conditional distribution induced by a conditioner. Later in this paper we will see how to evaluate the partial probability densities of probabilistic execution traces under these distributions, as well as the ratios and gradients of these partial densities that are needed for a wide range of inference schemes.

2.8 Markov chain and sequential Monte Carlo architectures

The current Venture implementation maintains a single probabilistic execution trace per virtual machine instance. This trace is initialized by simulating the code from the ASSUME and OBSERVE

instructions, and stochastically modified during inference via transition operators that leave the current conditioned distribution on traces invariant. The prior and posterior probability distributions on traces are implicit, but can be probed by repeatedly re-executing the program and forming Monte Carlo estimates.

This Markov chain architecture has been chosen for simplicity, but sequential Monte Carlo architectures based on weighted collections of traces are also possible and indeed straightforward. The number of initial traces could be specified via an INFER instruction at the beginning of the program. Forward simulation would be nearly unchanged. OBSERVE instructions would attach weights to traces based on the “likelihood” probability density corresponding to the constrained random choice in each observation, and PREDICT instructions would read out their values from a single, arbitrarily chosen “active” trace. An [INFER (resample <k>)] instruction would then implement multinomial resampling, and also change the active trace to ensure that PREDICTs are always mutually consistent. Venture’s other inference programming instructions could be treated as rejuvenation kernels (Del Moral et al., 2006), and would not need to modify the weights777The only subtlety is that transition operators must not change which random choice is being constrained, as this would require changing the weight of the trace.. This kind of sequential Monte Carlo implementation would have the advantage that the weights could be used to estimate marginal probability densities of given OBSERVEs, and that another source of non-embarrassing parallelism would be exposed. Integrating sophisticated coupling strategies from the SMC framework (Whiteley et al., 2013) into the inference language could also prove fruitful.

Running separate Venture virtual machines is guaranteed to produce independent samples from the distribution represented by the Venture program. This distribution will typically only approximate some desired conditional. If there is no need to quantify uncertainty precisely, then Venture programmers can append repetitions of a sequence of INFER and PREDICT instructions to their program. Unless the INFER instructions use rejection sampling, this choice yields PREDICT outputs that are dependent under both Markov chain and sequential Monte Carlo architectures. It only approximates the behavior of independent runs of the program. Application constraints will determine what approximation strategies are most appropriate for each problem.

color=yellowcolor=yellowtodo: color=yellow[big] Figure using the multiripl illustrating convergencecolor=yellowcolor=yellowtodo: color=yellow[big] Careful treatment using some new math notation; tease apart degrees of procedural/declarative

2.9 Examples

Here we give simple illustrations of the Venture language, including some standard modeling idioms as well as the use of custom inference instructions. Venture has been also used to implement applications of several advanced modeling and inference techniques; examples include generative probabilistic graphics programming (Mansinghka*, Kulkarni*, Perov, and Tenenbaum, 2013) and topic modeling (Blei et al., 2003). A description of these and other applications is beyond the scope of this paper.

color=greencolor=greentodo: color=green[medium] include LDA w variational and asymptotics w real results on NIPScolor=greencolor=greentodo: color=green[medium] include CrossCat w real results on DHA and asymptoticscolor=greencolor=greentodo: color=green[medium] include some filtering result with asymptoticscolor=greencolor=greentodo: color=green[small] revise experimental section text to explain coverage

2.9.1 Hidden Markov Models

To represent a Hidden Markov model in Venture, one can use a stochastic recursion to capture the hidden state sequence, and index into it by a stochastic observation procedure. Here we give a variant with continuous observations, a binary latent state, and an a priori unbounded number of observation sequences to model:

[ASSUME observation_noise (scope_include ’hypers unique (gamma 1.0 1.0))]
[ASSUME get_state
  (mem (lambda (seq t)
      (scope_include ’state t
         (if (= t 0)
             (bernoulli 0.3)
             (transition_fn (get_state seq (- t 1)))))))]
[ASSUME get_observation
  (mem (lambda (seq t)
         (observation_fn (get_state seq t))))]
[ASSUME transition_fn
  (lambda (state)
      (scope_include state t (bernoulli (if state 0.7 0.3))))]
[ASSUME observation_fn
  (lambda (state)
    (normal (if state 3 -3) observation_noise))]
[OBSERVE (get_observation 1 1) 3.6]
[INFER (mh default one 10)]
[OBSERVE (get_observation 1 2) -2.8]
[INFER (mh default one 10)]
<...>

This is a sequentialized variant of the ”default” resimulation-based Metropolis-Hastings inference scheme. If all but the last INFER statement were removed, the program would yield the same stochastic transitions as several Church implementations, but with linear (rather than quadratic) scaling in the length of the sequence. Interleaving inference with the addition of observations improves over bulk incorporation of observations by mitigating some of the strong conditional dependencies in the posterior.

Another inference strategy is particle Markov chain Monte Carlo. For example, one could combine Metropolis-Hastings moves on the hyperparameters, given the latent states, with a conditional sequential Monte Carlo approximation to Gibbs over the hidden states given the hyper parameters and observations. Here is one implementation of this scheme, where 10 transitions of Metropolis-Hastings are done on the hyper parameters for every 5 transitions of approximate Gibbs based on 30 particles, all repeated 50 times:

[INFER (cycle ((mh hypers one 10) (pgibbs state ordered 30 5)) 50)]

The global particle Gibbs algorithm from (Wood et al., 2014) with 30 particles would be expressed as follows:

[INFER (pgibbs default ordered 30 100)]

Note that Metropolis-Hastings transitions can be more effective than pure conditional sequential Monte Carlo for handling global parameters (Andrieu et al., 2010). This is because MH moves allow hyperparameter inference to be constrained by all latent states.

In real-time applications, hyperparameter inference is sometimes skipped. Here is one representation of a close relative888The only difference is that a particle filter exposes all its weighted particles for forming Monte Carlo expectations or for rapidly obtaining a set of approximate samples. Only straightforward modifications are needed for Venture to expose a set of weighted traces instead of a single trace and literally recover particle filtering. of a standard 30-particle particle filter that uses randomly chosen hyper parameters and yields a single latent trajectory:

[INFER (pgibbs state ordered 30 1)]

2.9.2 Hierarchical Nonparametric Bayesian Models

Here we show how to implement one version of a multidimensional Dirichlet process mixture of Gaussians (Rasmussen, 1999):

[ASSUME alpha (scope_include ’hypers 0 (gamma 1.0 1.0))]
[ASSUME scale (scope_include hypers 1 (gamma 1.0 1.0))]
[ASSUME crp (make_crp alpha)]
[ASSUME get_cluster (mem (lambda (id)
  (scope_include ’clustering id (crp))))]
[ASSUME get_mean (mem (lambda (cluster dim)
  (scope_include parameters cluster (normal 0 10))))]
[ASSUME get_variance (mem (lambda (cluster dim)
  (scope_include ’parameters cluster (gamma 1 scale))))]
[ASSUME get_component_model (lambda (cluster dim)
  (lambda () (normal (get_mean cluster dim) (get_variance cluster dim))))]
[ASSUME get_datapoint (mem (lambda (id dim)
  ((get_component_model (get_cluster id dim)))))]
[OBSERVE (get_datapoint 0 0) 0.2]
<...>
; default resimulation-based Metropolis-Hastings scheme
[INFER (mh default one 100)]’

The parameters are explicitly represented, i.e. “uncollapsed”, rather than integrated out as they often are in practice. While the default resimulation-based Metropolis-Hastings scheme can be effective on this problem, it is also straightforward to balance the computational effort differently:

[INFER (cycle ((mh hypers one 1)
                        (mh parameters one 5)
                        (mh clustering one 5))
                        1000)]

On each execution of this cycle, one hyperparameter transition, five parameter transitions (each to both parameters of a randomly chosen cluster), and five cluster reassignments are made. Which hyperparameter, parameters and cluster assignments are chosen is random. As the number of data points grows, the ratio of computational effort devoted to inference over the hyperparameters and cluster parameters to the cluster assignments is higher than it is for the default scheme. Note that the complexity of this inference instruction, as well as the computational effort ratio, depend on the number of clusters in the current trace.

It is also straightforward to use a structured particle Markov chain Monte Carlo scheme over the cluster assignments:

[INFER (mixture ((0.2 (mh hypers one 10))
                            (0.5 (mh parameters one 5))
                            (0.3 (pgibbs clustering ordered 2 2)))
                            100)]

Due to the choice of only 2 particles for the pgibbs inference strategy, this scheme closely resembles an approximation to blocked Gibbs over the indicators based on a sequential initialization of the complete model. Also note that despite the low mixing weight on the clustering scope in the mixture, this inference program allocates asymptotically greater computational effort to inference over the cluster assignments than the previous strategy. This is because the pgibbs transition operator is guaranteed to reconsider every single cluster assignment.

color=greencolor=greentodo: color=green[medium] Include road scene example and CAPTCHA example

2.9.3 Inverse Interpretation

We now describe inverse interpretation, a modeling idiom that is only possible in Turing-complete languages. Recall that Venture modeling expressions are easy to represent as Venture data objects, and Venture models can invoke the evaluation and application of arbitrary Venture stochastic procedures. These Scheme-like features make it straightforward to write an evaluator — perhaps better termed a simulator — for a Turing-complete, higher-order probabilistic programming language.

This application highlights Turing-completeness and also embodies a new potentially appealing path for solving problems of probabilistic program synthesis. In less expressive languages, learning programs (or structure) requires custom machinery that goes beyond what is provided by the language itself. In Venture, the same inference machinery used for state estimation or causal inference can be brought to bear on problems of probabilistic program synthesis. The dependency tracking and inference programming machinery that is general to Venture can be brought to bear on the problem of approximately Bayesian learning of probabilistic programs in a Venture-like language999Although we are still far from a study of the expressiveness of probabilistic languages via definitional interpretation (Reynolds, 1972; Abelson and Sussman, 1983), it seems likely that probabilistic programming formulations of probabilistic program synthesis — inference over a space of probabilistic programs, possibly including inference instructions, and an interpreter for those programs — will be revealing..

We first define some utility procedures for manipulating references, symbols and environments:

[ASSUME make_ref (lambda (x) (lambda () x))]
[ASSUME deref (lambda (x) (x))]
[ASSUME incremental_initial_environment
  (lambda ()
    (list
      (dict
        (list (quote bernoulli)
              (quote normal)
              (quote plus)
              (quote times)
              (quote branch))
        (list (make_ref bernoulli)
              (make_ref normal)
              (make_ref plus)
              (make_ref times)
              (make_ref branch)))))]
[ASSUME extend_env
  (lambda (outer_env syms vals)
    (pair (dict syms vals) outer_env))]
[ASSUME find_symbol
  (lambda (sym env)
    (if (contains (first env) sym)
        (lookup (first env) sym)
        (find_symbol sym (rest env))))]

The most interesting of these are make_ref and deref. These use closures to pass references around the trace, using an idiom that avoids unnecessary growth of scaffolds. Consider an execution trace in which a value that is the argument to make_ref becomes the principal node of a transition. Only those uses of the reference to the value that have been passed to deref will become resampling nodes. The value of the reference is unchanged, though the value the reference refers to is not. This permits dependence tracking through the construction of complex data structures.

Given this machinery, it is straightforward to write an evaluator for a simple function language that has access to arbitrary Venture primitives:

[ASSUME incremental_venture_apply
  (lambda (op args) (eval (pair op (map_list deref args)) (get_empty_environment)))]
[ASSUME incremental_apply
  (lambda (operator operands)
    (incremental_eval (deref (lookup operator 2))
                      (extend_env (deref (lookup operator 0))
                                          (deref (lookup operator 1))
                                          operands)))]
[ASSUME incremental_eval
  (lambda (expr env)
    (if (is_symbol expr)
        (deref (find_symbol expr env))
            (if (not (is_pair expr))
                expr
            (if (= (deref (lookup expr 0)) (quote lambda))
                          (pair (make_ref env) (rest expr))
                          ((lambda (operator operands)
                             (if (is_pair operator)
                               (incremental_apply operator operands)
                                 (incremental_venture_apply operator operands)))
                           (incremental_eval (deref (lookup expr 0)) env)
                           (map_list (lambda (x)
                                             (make_ref (incremental_eval (deref x) env)))
                                     (rest expr)))))))]

It is also possible to generate the input exprs using another Venture program, and use general-purpose, Turing-complete inference mechanisms to explore a hypothesis space of expressions given constraints on the values that result. We call this the inverse interpretation approach to probabilistic program synthesis. As in Church — and contrary to (Liang et al., 2010) — inverse interpretation algorithms are not limited to rejection sampling. But while Church was limited to a single-site Metropolis-Hastings scheme, Venture programmers have more options. In Venture it is possible to associate portions of the program source (and portions of the induced program’s executions) with custom inference strategies.

Here is an example expression grammar that can be used with the incremental evaluator:

[ASSUME genBinaryOp (lambda () (if (flip) (quote plus) (quote times)))]
[ASSUME genLeaf (lambda () (normal 4 3))]
[ASSUME genVar (lambda (x) x)]
[ASSUME genExpr
  (lambda (x)
    (if (flip 0.4)
      (genLeaf)
      (if (flip 0.8)
         (genVar x)
         (list (make_ref (genBinaryOp)) (make_ref (genExpr x)) (make_ref (genExpr x))))))]
[ASSUME noise (gamma 5 .2)]
[ASSUME expr (genExpr (quote x))]
[ASSUME f
  (mem
    (lambda (y)
      (incremental_eval expr
                        (extend_env (incremental_initial_environment)
                                            (list (quote x))
                                            (list (make_ref y))))))]
[ASSUME g (lambda (z) (normal (f z) noise))]
[OBSERVE (g 1) 10] ;f(x) = 5x + 5
[OBSERVE (g 3) 20]
<...>

Indirection via references substantially improves the asymptotic scaling of programs like these. When a given production rule in the grammar is resimulated, only those portions of the execution of the program that depend on the changed source code are resimulated. A naive implementation of an evaluator would not have this property.

Scaling up this approach to larger symbolic expressions and small programs will require multiple advances. Overall system efficiency improvements will be necessary. Inverse interpretation also may benefit from additional inference operators, such as Hamiltonian Monte Carlo for the continuous parameters. Expression priors with inference-friendly structures would also help. For example, a prior where resimulation recovers some of the search moves from (Duvenaud et al., 2013) may be expressible by separately generating symbolic expressions and the contents of the environment into which they are evaluated. Longer term, it may be fruitful to explore formalizations of some of the knowledge taught to programmers using probabilistic programming.

color=greencolor=greentodo: color=green[big] Redo expression learning with better inference and target example, after fixing type issues; discuss connection to Grosse via carefully chosen prior

3 Stochastic Procedures

Random choices in Venture programs arise due to the invocation of stochastic procedures (SPs). These stochastic procedures accept input arguments that are values in Venture and sample output values given those inputs. Venture includes a built-in stochastic procedure library, which includes SPs that construct other SPs, such as the SP make_csp which the special form lambda gets desugared to. Stochastic procedures can also be added as extensions to Venture, and provide a mechanism for incremental optimization of Venture programs. Model fragments for which Venture delivers inadequate performance can be migrated to native inference code that interoperates with the enclosing Venture program.

3.1 Expressiveness and extensibility

Many typical random variables, such as draws from a Bernoulli or Gaussian distribution, are straightforward to represent computationally. One common approach has been to use a pair of functions: a

simulator that maps from an input space of values and a stream of random bits to an output space of values , and a marginal density that maps from pairs to . This representation corresponds to the “elementary random procedures” supported by early Church implementations. Repeated invocations of such procedures correspond to IID sequences of random variables whose densities are known. While simple and intuitive, this simple interface does not naturally handle many useful classes of random objects. In fact, many objects that are easy to express as compound procedures in Church and in Venture cannot be made to fit in this form.

Stochastic procedures in Venture support a broader class of random objects:

  1. Higher-order stochastic procedures, such as mem (including stochastic memorization), apply and map. Higher-order procedures may accept procedures as arguments, apply these procedures internally, and produce procedures as return values. Stochastic procedures in Venture are equipped with a simple mechanism for handling these cases. In fact, it turns out that all structural changes to execution traces — including those arising from the execution of constructs that affect control flow, such as IF — can be handled by this mechanism. This simplifies the development of inference algorithms, and permits users to extend Venture by adding new primitives that affect the flow of control.

  2. Stochastic procedures whose applications are exchangeably coupled. Examples include collapsed representations of conjugate models from Bayesian statistics, combinatorial objects from Bayesian nonparametrics such as the Chinese Restaurant and Pitman-Yor processes, and probabilistic sequence models (such as HMMs) whose hidden state sequences can be efficiently marginalized out. Support for these primitives whose applications are coupled is important for recovering the efficiency of manually optimized samplers, which frequently make use of collapsed representations. Whereas exchangeable primitives in Church are thunks, which prohibits collapsing many important models such as HMMs, Venture supports primitives whose applications are row-wise partially exchangeable across different sets of arguments. The formal requirement is that the cumulative log probability density of any sequence of input-output pairs is invariant under permutation.

  3. Likelihood-free stochastic procedures that lack tractable marginal densities. Complex stochastic simulations can be incorporated into Venture even if the marginal probability density of the outputs of these simulations given the inputs cannot be efficiently calculated. Models from the literature on Approximate Bayesian Computation (ABC), where priors are defined over the outcome of forward simulation code, can thus be naturally supported in Venture. Additionally, a range of “doubly intractable” inference problems, including applications of Venture to reasoning about the behavior of approximately Bayesian reasoning agents, can be included using this mechanism.

  4. Stochastic procedures with external latent variables that are invisible to Venture. There will always be models that admit specialized inference strategies whose efficiency cannot be recovered by performing generic inference on execution traces. One of the principal design decisions in Venture is to allow these strategies to be exploited whenever possible by supporting a broad class of stochastic procedures with custom inference over internal latent variables, hidden from the rest of Venture. The stochastic procedure interface thus serves as a flexible bridge between Venture and foreign inference code, analogous to the role that foreign function interfaces (FFIs) play in traditional programming languages.

3.2 Primitive stochastic procedures

Informally, a primitive stochastic prodecure (PSP) is an object that can simulate from a family of distributions indexed by some arguments. In addition to simulating, PSPs may be able to report the logdensity of an output given an input, and may incorporate and unincorporate information about the samples drawn from it using mutation, e.g. in the case of a conjugate prior. This mutation cannot be aribitrary: the draws from the PSP must remain row-wise partially exchangeable as discussed above. A PSP may also have custom proposal kernels, in which case it must be able to return its contribution to the Metropolis-Hastings acceptance rate. For example, the PSP that simulates Gaussian random variables may provide a drift kernel that proposes small steps around its previous location, rather than resampling from the prior distribution.

Primitive stochastic procedures are parameterized by the following properties and behaviors:

  1. isStochastic() — does this PSP consume randomness when it is invoked?

  2. canAbsorbArgumentChanges() — can this PSP absorb changes to its input arguments? If true, then this PSP must correctly implement logdensity() as described below.

  3. childrenCanAbsorbAtApplications() — does this PSP return an SP that implements the short-cut “absorbing at applications” optimization, needed to integrate optimized expressions for the log marginal probability of sufficient statistics in standard conjugate models.color=yellowcolor=yellowtodo: color=yellow[small, end]Clarify when and where AAA Is defined

  4. value = simulate(args) — samples a value for an application of the PSP, given the arguments args

  5. logp = logdensity(value, args) — an optional procedure that evaluates the log probability density101010This density is implicitly defined with respect to a PSP-specific (but argument independent) choice of dominating measure. For PSPs that are guaranteed to produce discrete outputs, the measure is assumed to be the counting measure, so logdensity is equivalent to the log probability mass function. A careful measure-theoretic treatment of Venture is left for future work. of an output given the input arguments .

  6. incorporate(aux, value, args) — incorporate the value stored in the variable named value into the auxiliary storage aux associated with the SP that contains the PSP. incorporate() is used to implement SPs whose applications are exchangeably coupled. While it is always sufficient to store and update the full set of values returned for each observed args, often only the counts (or some other sufficient statistics) are necessary.

  7. unincorporate(aux, value, args) — remove value from the auxiliary storage, restoring it to a state consistent with all other values that have been incorporated but not unincorporated; this is done when an application is unevaluated.

    color=yellowcolor=yellowtodo: color=yellow[medium] Check lite to finish SP extras: add custom var/sim/delt kernels, opt enumeration, and good treatment of latents

3.3 The Stochastic Procedure Interface

The stochastic procedure interface specifies the contract that Venture primitives must satisfy to preserve the coherence of Venture’s inference mechanisms. It also serves as the vehicle by which external inference systems can be integrated into Venture. This interface preserves the ability of primitives to dynamically create and destroy internal latent variables hidden from Venture and to perform custom inference over these latent variables.

3.3.1 Definition

Definition 3.1 (Stochastic procedure).

A stochastic prodecure is a pair of PSPs, along with a latent variable simulator, where

  1. returns:

    1. A list of tuples that represent expressions whose values must be available to before it can start its simulation. We refer to requests of this form as exposed simulation requests (ESRs),

    2. A list of opaque tokens that can be interpreted by the SP as the latent variables that will need in order to simulate its output, along with the values of the exposed simulation requests. We refer to requests of this form as latent simulation requests (LSRs).

  2. The latent variable simulator responds to LSRs by simulating any latent variables requested.

  3. returns the final output of the procedure, conditioned on the inputs, the results of any of the exposed simulation requests, and the results of any latent simulation requests.

3.3.2 Exposed Simulation Requests

We want our procedures to be able to pass pairs to Venture for evaluation, and make use of the results in some way. A procedure may also have multiple applications all make use of a shared evaluation, e.g. , and in these cases the procedure must take care to request the same each time, and the will only be evaluated the first time and then reused thereafter.

Specifically, an ESR request of the is handled by Venture as follows. First Venture checks the requesting SP’s namespace to see if it already has an entry with address . If it does not, then Venture evaluates in , and adds the mapping to the SP’s namespace, where root is the root of the evaluation tree. If the SP’s namespace does contain , then Venture can look up root. Either way, Venture wires in root as an extra argument to the output node.

3.3.3 Latent Simulation Requests and the Foreign Inference Interface

Some procedures may want to simulate and perform inference over latent variables that are hidden from Venture. Consider an optimized implementation of a hidden Markov model integrated into the following Venture program:

[ASSUME my_hmm (make_hmm 10 0.1 5 0.2)]
[OBSERVE (my_hmm 0 0) 0]
[OBSERVE (my_hmm 0 1) 1]
[OBSERVE (my_hmm 0 2) 0]
[INFER (mh default one 20)]
[PREDICT (my_hmm 0 3)]
[PREDICT (my_hmm 1 0)]

Here we have a constructor SP (make_hmm <num-states> <transition-hyper> <num-output-symbols> <observation-hyper) that generates an (uncollapsed) hidden Markov model by generating the rows of the transition and observation matrices at random. The assumption is that make_hmm is a primitive, although it would be straightforward to implement make_hmm as a compound procedure in Venture, using variations of the examples presented earlier. The constructor returns a procedure bound to the symbol my_hmm that permits observations from this process to be queried via (my_hmm <sequence-id> <index>). The program adds a sequence fragment of length three then requests predictions for the next observation in the sequence, as well as the initial observation from an entirely new sequence.

This probabilistic program captures a common pattern: integrating Venture with an foreign probabilistic model fragment that can be dynamically queried and contains latent variables hidden from Venture. It is useful to partition the random choices in this program as follows. The transition and observation matrices of the HMM could be viewed as part of the value of the my_hmm SP and therefore returned by the output PSP of the make_hmm SP. The observations are managed by Venture as the applications of the my_hmm SP. The hidden states, however, are fully latent from the standpoint of Venture, yet need to be created, updated and destroyed as invocations of my_hmm are created or destroyed and as their arguments (or the arguments to make_hmm) change.

Venture makes it possible for procedures to instantiate latent variables only as necessary to simulate a given program. The mechanism is similar to that for exposed simulation requests, except in this case the requests–which we call latent simulation requests (LSRs)–are opaque to Venture. Venture only calls appropriate methods on the SP at appropriate times to ensure that all the bookkeeping is handled correctly.

This framework is straightforward to apply to make_hmm and the stochastic procedure(s) that it returns. If my_hmm is queried for an observation at time , the requestPSP can return the time as an LSR. When Venture tells the HMM to simulate that LSR, the HMM will either do nothing if already exists in its internal store of simulations, or else continue simulating from its current position up until . The outputPSP then samples conditioned on the latent . If the application at time is ever unevaluated, Venture will tell the HMM to detach the LSR , which will cause the HMM to place the latents that are no longer necessary to simulate the program into a “latentDB”, which it returns to Venture. Later on, Venture may tell the HMM to restore latents from a latentDB, for example if a proposal is rejected and the starting trace is being restored.

The main reason to encapsulate latent variables in this way, as opposed to requesting them as ESRs, is so that the SP can use optimized implementations of inference over their values, potentially utilizing special-purpose inference methods. For example, the uncollapsed HMM can implement forwards-filtering backwards-sampling to efficiently sample the latent variables conditioned on all observations. Such procedures are integrated into Venture by defining an “Arbitrary Ergodic Kernel” (AEKernel) which Venture may call during inference, and which is simply a black-box to Venture. Note that this same mechanism may be used by SPs that do not make latent simulation requests at all, but which have latent variables instantiated upon creation, such as a finite-time HMM or an uncollapsed Dirichlet-Multinomial.

color=yellowcolor=yellowtodo: color=yellow[big] Introduce and illustrate MakeSIVM for inference over inference, in all its flexibilitycolor=yellowcolor=yellowtodo: color=yellow[medium] Based on lite, introduce AEAAA and latent kernels

To implement this functionality, stochastic procedures must implement three procedures in addition to the procedures needed for their ESR requestor, LSR requestor and output PSPs:

  1. simulateLatents(aux, LSR, shouldRestore, latentDB) — simulate the latents corresponding to LSR, using the tokens in latentDB (indexed by LSR) to find a previous value if shouldRestore is true.

  2. detachLatents(aux, LSR, latentDB) — signal that the latents corresponding to the request LSR are no longer needed, and store enough information in latentDB so that the value can be recovered later.

  3. AEInfer(aux) — Trigger the external implementation to perform inference over all latent variables using the contents of aux. It is often convenient for simulateLatents to store latent variables in the aux and for incorporate to store the return values of applications in tt aux, along with the arguments that produced them.

Examples of this use of the stochastic procedure interface can be found in current releases of the Venture system.

3.3.4 Optimizations for higher-order SPs

Venture provides a special mechanism that allows certain SPs to exploit the ability to quickly compute the logdensity of its applications. Consider the following program:

[ASSUME alpha (gamma 1 1)]
[ASSUME collapsed_coin (make_beta_bernoulli alpha alpha)]
[OBSERVE (collapsed_coin) False]
[OBSERVE (collapsed_coin) True]
<repeat 10^9 times>
[OBSERVE (collapsed_coin) False]
[INFER]

A hand-written inference scheme would only keep track of the counts of the observations, and could perform rapid Metropolis-Hastings proposals on alpha by exploiting conjugacy. On the other hand, a naive generic inference scheme might visit all one billion observation nodes to compute the acceptance ratio for each proposal. We can achieve this efficient inference scheme by letting make_beta_bernoulli be responsible for tracking the sufficient statistics from the applications of collapsed_coin, and for evaluating the log density of all those applications as a block. Stochastic procedures that return other stochastic procedures and implement this optimization are said to be absorbing at applications, often abbreviated AAA. We will discuss techniques for implementing this mechanism in a later section.

3.3.5 Auxiliary State

An SP is itself stateless, but may have an associated auxiliary store, called SPAux, that carries any mutable information. SPAuxs have several uses:

  1. If an SP makes exposed simulation requests, Venture uses the SPAux to store mappings from the addresses of the ESRs to the node that stores the result of that simulation.

  2. If a PSP keeps track of its sample counts or other sufficient statistics, such as the collapsed-beta-bernoulli which stores the number of trues and falses, it will store this information in the SPAux.

  3. If an SP makes latent simulation requests, then all latent variables it simulates to respond to those requests are stored in the SPAux.

  4. Some SPs may optionally store part of the value of the SP directly in the SPAux. This is necessary if an SP cannot easily store its value, its latents, and its outputs separately.

color=greencolor=greentodo: color=green[big] Examples of SPs: flip, ccoin, crp, mem, eval, apply, dpmem, lazyhmm, makeSIVM

4 Probabilistic Execution Traces

Bayesian networks decompose probabilistic models into nodes for random variables, equipped with conditional probability tables, and edges for conditional dependencies. They can be interpreted as probabilistic programs via the ancestral simulation algorithm (Frey, 1997). The network is traversed in an order consistent with the topology, and each random variable is generated given its immediate parents. Bayesian networks can also be viewed as expressing a function for evaluating the joint probability density of all the nodes in terms of a factorization given by the graph structure.

Here we describe probabilistic execution traces (PETs), which serve analogous functions for Venture programs and address the additional complications that arise from Turing-completeness and the presence of higher order probabilistic procedures. We also describe recursive procedures for constructing and destroying probabilistic execution traces as Venture modeling language expressions are evaluated and unevaluated.

color=yellowcolor=yellowtodo: color=yellow[medium] Write comparative PET intro based on structure in the comments

4.1 Definition of a probabilistic execution trace

Probabilistic execution traces consist of a directed graphical model representing the dependencies in the current execution, along with the stateful auxiliary data for each stochastic procedure, the Venture program itself, and metadata for existential dependencies and exchangeable coupling. We will typically identify executions and PETs, and denote them via the symbols or .

PETs contain the following nodes:

  1. One constant node for the global environment.

  2. One constant node for every value bound in the global environment, which includes all built-in SPs.

  3. One constant node for every call to eval on a expression that is either self-evaluating or quoted.

  4. One lookup node for every call to eval that triggers a symbol lookup.

  5. One request node and one output node for every call to eval that triggers an SP application. We refer to the operator nodes and operand nodes of request nodes and output nodes, but note that these are not special node types.

PETs also contain the following edges:

  1. One lookup edge to each lookup node from the node it is looking up.

  2. One operator edge to every request node from its operator node.

  3. One operator edge to every output node from its operator node.

  4. One operand edge to every request node from each of its operand nodes.

  5. One operand edge to every output node from each of its operand nodes.

  6. One requester edge to every output node from its corresponding request node.

  7. One ESR edge from the root node of every SP family to each SP application that requests it.

Every node represents a random variable and has a value that cannot change during an execution. The PET also includes the SPAux for every SP that needs one. Unlike the values in the nodes, the SPAuxs may be mutated during an execution, for example to increment the number of trues for a beta-bernoulli.

4.2 Families

We divide our traces into families: one Venture family for every assume, predict, observe directive, and one SP family for every unique ESR requested during forward simulation. Because of our uniform treatment of conditional simulation, executions of programs satisfy the following property: the structure of every family is a function of the expression only, and does not depend on the random choices made while evaluating that expression. The only part of the topology of the graph that can change is which ESRs are requested.

4.3 Exchangeable coupling

color=greencolor=greentodo: color=green[small] Better motivate exchangeable coupling, based on Cameron’s feedback

Given our exchangeability assumptions for SPs, we can cite (generalized) de Finetti (Orbanz and Roy, ) to conclude that there is, in addition to the observed random variables explicitly represented in the PET, one latent random variable for every SP in the PET corresponding to the unobserved de Finetti measure, and one latent variable for every set of arguments that is called on, with an edge from ’s maker-node to , edges from to every , and an edge from to every node corresponding to an application of the form . However, each and is marginalized out by each by way of mutation on its SPAux, effectively introducing a hyperedge that indirectly couples the application nodes of all applications of .

We have chosen not to represent these dependencies in the graphical structure of the PET for the following reasons. First, once we integrate out and , we can only encode these dependencies at all with directed edges once we fix a specific ordering for the applications. Second, for the orderings we will be interested in, the graph that combines both types of directed edges would be cyclic. This would complicate future efforts to develop a causal semantics for PETs and Venture programs.

4.4 Existential dependence and contingent evaluation

An SP family is existentially dependent on the nodes that request it as part of an ESR, in the sense that if at any point it is not requested by any nodes, then the family would not have been computed while simulating the PET. Existential dependence is then handled with garbage collection semantics, whereby an SP family can no longer be part of the PET if it is not selected by any nodes, and thus should be unevaluated.

4.5 Examples

Here we briefly give example PETs for simple Venture programs.

4.5.1 Trick coin

This is a variant of our running example. It defines a model that can be used for inferences about whether or not a coin is tricky, where a trick coin is allowed to have any weight between 0 and 1. The version we use includes one observation that a single flip of the coin came up heads.

To keep the PET as simple as possible, we give both the program and the PET in the form where IF has already been desugared into an SP application: (IF <predicate> <consequent> <alternate>) has been replaced with (branch <predicate> (quote <consequent>) (quote <alternate>)), where branch is an ordinary stochastic procedure.

[ASSUME coin_is_tricky (bernoulli 0.1)]
[ASSUME weight (branch coin_is_tricky (quote (beta 1.0 1.0)) (quote 0.5))]
[OBSERVE (bernoulli weight) true]

Figure 1 shows the two PET structures that can arise from simulating this program, along with arbitrarily chosen values.

(a) (b)

Figure 1: The two different PET structures corresponding to the trick coin program. (a) An execution trace where the coin is fair. (b) An execution trace where the coin is tricky, containing additional nodes that depend existentially on the coin flip.
color=greencolor=greentodo: color=green[small] enlarge into pedagogical example discuss trick coin example PET

4.5.2 A simple Bayesian network

Figure 2 shows a PET for the following program, implementing a simple Bayesian network:

[ASSUME rain (bernoulli 0.2)]
[ASSUME sprinkler (bernoulli (branch rain 0.01 0.4))]
[ASSUME grassWet
    (bernoulli (branch rain
                       (branch sprinkler 0.99 0.8)
                       (branch sprinkler 0.9 0.00001)))]
[OBSERVE grassWet True]

Note that each of the Venture families in the PET corresponds to a node in the Bayesian network. A coarsened version of the PET contains the same conditional dependence and independence information as the Bayesian network would.

color=greencolor=greentodo: color=green[medium] Explain alternative representations of a Bayesian network, with figures, and discuss tradeoffs; include the actual Bayes net in the figure
Figure 2: The PET corresponding to a three-node Bayesian network. Each of the three numbered families correspond to a node in the Bayesian network, capturing the execution history needed to simulate the node given its parents.

4.5.3 Stochastic memoization

color=yellowcolor=yellowtodo: color=yellow[medium] Discuss+cite Church, mansinghka’s thesis, nonparametrics; explain how this has been misinterpreted by Luc de Raedt (whom we should cite)color=redcolor=redtodo: color=red[medium] Discuss probabilistic logic programming and SRL

We now illustrate a program that exhibits stochastic memoization. This program constructs a stochastically memoized procedure and then applies it three times. Unlike deterministic memoization, a stochastically memoized procedure has a stochastic PSP which sometimes returns a previously sampled value, and sometimes samples a fresh one. These random choices follow a Pitman-Yor process. Figure 3 shows a PET corresponding to a typical simulation; note the overlapping requests.

[ASSUME f (pymem bernoulli 1.0 0.1)]
[PREDICT (f)]
[PREDICT (f)]
[PREDICT (f)]
Figure 3: A PET corresponding to an execution of a program with stochastic memoization. The procedure being stochastically memoized, bernoulli, is applied twice, based on the value of the requests arising in invocations of (f).
color=greencolor=greentodo: color=green[small] de Raedt arxiv note on stochastic memorization: straighten out the record

4.6 Constructing PETs via forward simulation

Let be a constrained program. Venture’s primary inference strategies require an execution of with positive probability before they can even begin. Therefore the first thing we do is simply evaluate the program by interpreting it in a fairly standard way, except with all conditional evaluation handled uniformly through the ESR machinery presented above. For simplicity, we elide details related to inference scoping.

4.6.1 Pseudocode for EVAL, APPLY and EVAL-REQUESTS

Evaluation is generally similar to a pure Scheme, but there are a few noteworthy differences. First, evaluation creates nodes for every recursive call, and connects them together to form the directed graph structure of the probabilistic execution trace. Second, there are no distinctions between primitive procedures and compound procedures. We color=yellowcolor=yellowtodo: color=yellowRename EvalFamily to eval.call the top-level evaluation procedure EvalFamily to emphasize the family block structure in PETs. Third, a scaffold and a database of random choices db are threaded through the recursions, to support their use in inference, including restoring trace fragments when a transition is rejected by reusing random choices from the db. Note that an environment model evaluator (Abelson and Sussman, 1983) is used, even though the underlying language is pure, with the PET storing both the environment structure and the recursive invocations of eval.

The call to Regenerate ensures that if EvalFamily is called in the context of a pre-existing PET, it is traversed in an order that is compatible with the dependence structure of the program. We will sometimes refer to such orders as “evaluation-consistent”. From the standpoint of forward simulation, however, Regenerate can be safely ignored.

weight

Stochastic procedures are allowed to request evaluations of expressions111111To prevent this flexibility from introducing arbitrary dependencies, the expressions and environments are restricted to those constructible from the arguments to the procedure or the procedure that constructed it.; this enables higher-order procedures as well as the encapsulation of custom control flow constructs. For example, compound procedures do this to evaluate their body in an environment extended with their formal parameters and argument values.

dbhasLatentDBFor(sp) weight

Stochastic procedures are also allowed to perform opaque operations to produce outputs given the value of their arguments and any requested evaluations. Note that this may result in random choices being added to the record maintained by the trace.

Determine new value Determine weight

This functionality is supported by book-keeping to link new stochastic procedures into the trace, and register any customizations they implement with the trace so that inference transitions can make use of them:

4.7 Undoing simulation of PET fragments

Venture also includes requires unevaluation procedures that are dual to the evaluation procedures described earlier. This is because PET fragments need to be removed from the trace in two situations. First, when FORGET instructions are triggered, the expression corresponding to the directive is removed. Second, during inference, changes to the values of certain requestPSPs make cause some SP families to no longer be requested. In the first case, all of the random choices are permanently removed, whereas in the second case, the random choices may need to be restored if the proposal is rejected.

4.7.1 Pseudocode for UNEVAL, UNAPPLY and UNEVAL-REQUESTS

When a trace fragment is unevaluated, we must visit all application nodes in the trace fragment so that the PSPs have a chance to unincorporate the (input,output) pairs. The operations needed to do this are essentially inverses of the simulation procedures described above, designed to visit nodes in the reverse order that evaluation does, to ensure compatibility with exchangeable coupling. Here we give pseudocode for these operations, eliding the details of garbage collection121212Previous work has anecdotally explored the possibility of preserving unused trace fragments and treating them as auxiliary variables, following the treatment of component model parameters from Algorithm 8 in (Neal, 1998). In theory, this delay of garbage collection, where multiple copies of each trace fragment are maintained for each branch point, could support adaptation to the posterior. A detailed empirical evaluation of these strategies is pending a comprehensive benchmark suite for Venture as well as a high-performance implementation.:

To unapply a stochastic procedure, Venture must undo its random choices and also unapply any requests it generated:

weight

To unapply a primitive stochastic procedure, we remove its random choices from the trace, unincorporate them, and update the weight accordingly. Note that we store the value in the db so that we can restore it later if necessary.