Deployable probabilistic programming

06/20/2019 ∙ by David Tolpin, et al. ∙ 0

We propose design guidelines for a probabilistic programming facility suitable for deployment as a part of a production software system. As a reference implementation, we introduce Infergo, a probabilistic programming facility for Go, a modern programming language of choice for server-side software development. We argue that a similar probabilistic programming facility can be added to most modern general-purpose programming languages. Probabilistic programming enables automatic tuning of program parameters and algorithmic decision making through probabilistic inference based on the data. To facilitate addition of probabilistic programming capabilities to other programming languages, we share implementation choices and techniques employed in development of Infergo. We illustrate applicability of Infergo to various use cases on case studies, and evaluate Infergo's performance on several benchmarks, comparing Infergo to dedicated inference-centric probabilistic programming frameworks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Probabilistic programming (Goodman et al., 2008; Mansinghka et al., 2014; Wood et al., 2014; Goodman and Stuhlmüller, 2014)

represents statistical models as programs written in an otherwise general programming language that provides syntax for the definition and conditioning of random variables. Inference can be performed on probabilistic programs to obtain the posterior distribution or point estimates of the variables. Inference algorithms are provided by the probabilistic programming framework, and each algorithm is usually applicable to a wide class of probabilistic programs in a black-box manner. The algorithms include Metropolis-Hastings

(Wingate et al., 2011; Mansinghka et al., 2014; Yang et al., 2014), Hamiltonian Monte Carlo (Carpenter et al., 2017), expectation propagation (Minka et al., 2010), extensions of Sequential Monte Carlo (Wood et al., 2014; van de Meent et al., 2015; Paige et al., 2014; Rainforth et al., 2016), variational inference (Wingate and Weber, 2013; Kucukelbir et al., 2017), gradient-based optimization (Carpenter et al., 2017; Bingham et al., 2019), and others.

There are two polar views on the role of probabilistic programming. One view is that probabilistic programming is a flexible framework for specification and analysis of statistical models (Wingate et al., 2011; Tolpin et al., 2016; Ge et al., 2018)

. Proponents of this view perceive probabilistic programming as a tool for data science practitioners. A typical workflow consists of data acquisition and pre-processing, followed by several iterations of exploratory model design and testing of inference algorithms. Once a sufficiently robust statistical model is obtained, analysis results are post-processed, visualized, and occasionally integrated into algorithms of a production software system.

The other, emerging, view is that probabilistic programming is an extension of a regular programming toolbox allowing algorithms implemented in general-purpose programming languages to have learnable parameters. In this view, statistical models are integrated into the software system, and inference and optimization take place during data processing, prediction, and algorithmic decision making in production. Probabilistic programming code runs unattended, and inference results are an integral part of the algorithms. Natural applications arise whenever the algorithm is a generative model of a mental or physical process, in particular when latent variables of the algorithm are used for decision making (Goodman et al., 2016; Winn et al., 2019); for example, inference about user behavior in social networks, ongoing health monitoring, or online motion planning of autonomous robotic devices.

In this work we are concerned with the later view on probabilistic programming. Our objective is to pave a path to deployment of probabilistic programs in production systems, promoting a wider adoption of probabilistic programming as a flexible tool for ubiquitous statistical inference. Most probabilistic programming frameworks are suited, to a certain extent, to be used in either the exploratory or the production scenario. However, the proliferation of probabilistic programming languages and frameworks 111There are 45 probabilistic programming languages listed on Wikipedia (Wikipedia, 2019). 18 probabilistic programming languages were presented during the developer meetup at PROBPROG’2018, the inaugural conference on probabilistic programming, http://probprog.cc/, half of which are not on the Wikipedia list. on one hand, and scarcity of success stories about production use of probabilistic programming on the other hand, suggest that there is a need for new ideas and approaches to make probabilistic programming models available for production use.

Our attitude differs from the established approach in that instead of proposing yet another probabilistic programming language, either genuinely different (Milch et al., 2007; Carpenter et al., 2017) or derived from an existing general-purpose programming language by extending the syntax or changing the semantics (Goodman et al., 2008; Tolpin et al., 2016; Ge et al., 2018), we advocate enabling probabilistic inference on models written in a general-purpose probabilistic programming language, the same one as the language used to program the bulk of the system. We formulate guidelines for such implementation, and, based on the guidelines, introduce a probabilistic programming facility for the Go programming language (team, 2009). By explaining our implementation choices and techniques, we argue that a similar facility can be added to any modern general-purpose purpose programming language, giving a production system modeling and inference capabilities which do not fall short from and sometimes exceed those of probabilistic programming-centric frameworks with custom languages and runtimes. Availability of such facilities in major general-purpose programming languages would free probabilistic programming researchers and practitioners alike from language wars (Stefik and Hanenberg, 2014) and foster practical applications of probabilistic programming.

Contributions

This work brings the following major contributions:

  • Guidelines for design and implementation of a probabilistic programming facility deployable as a part of a production software system.

  • A reference implementation of deployable probabilistic programming facility for the Go programming language.

  • Implementation highlights outlining important choices we made in implementation of the probabilistic programming facility, and providing a basis for implementation of a similar facility for other general-purpose programming languages.

2. Related Work

Our work is related to three interconnected areas of research:

  1. Design and implementation of probabilistic programming languages.

  2. Integration and interoperability between probabilistic programming models and the surrounding software systems.

  3. Differentiable programming for machine learning.

Probabilistic programming is traditionally associated with design and implementation of probabilistic programming languages. An apparent justification for a new programming language is that a probabilistic program has a different semantics (Gordon et al., 2014; Staton et al., 2016a) than a ‘regular’ program of similar structure. Goodman and Stuhlmüller (2014) offer a systematic approach to design and implementation of probabilistic programming languages based on extension of the syntax of a (subset of) general-purpose programming language and transformational compilation to continuation-passing style (Appel and Jim, 1989; Appel, 2007). van de Meent et al. (2018) give a comprehensive overview of available choices of design and implementation of first-order and higher-order probabilistic programming languages. Stan (Carpenter et al., 2017) is built around an imperative probabilistic programming language with program structure tuned for most efficient execution of inference on the model. SlicStan (Gorinova et al., 2019) is built on top of Stan as a source-to-source compiler, providing the model developer with a more intuitive language while retaining performance benefits of Stan. Our work differs from existing approaches in that we advocate enabling probabilistic programming in a general-purpose programming language by leveraging existing capabilities of the language, rather than building a new language on top or besides an existing one.

The need for integration between probabilistic programs and the surrounding software environment is well understood in the literature (Tolpin et al., 2016; Bingham et al., 2019). Probabilistic programming languages are often implemented as embedded domain-specific languages (Scibior et al., 2015; Ge et al., 2018; Tolpin et al., 2016) and include a mechanism of calling functions from libraries of the host languages. Our work goes a step further in this direction: since the language of implementation of probabilistic models coincides with the host language, any library or user-defined function of the host language can be directly called from the probabilistic model, and model methods can be directly called from the host language.

Automatic differentiation is widely employed in machine learning (Baydin et al., 2017; van Merrienboer et al., 2018a) where it is also known as ‘differentiable programming’, and is responsible for enabling efficient inference in many probabilistic programming frameworks (Carpenter et al., 2017; Ge et al., 2018; Innes et al., 2018; Bingham et al., 2019; Tran et al., 2017). Different automatic differentiation techniques (Griewank and Walther, 2008) allow different compromises between flexibility, efficiency, and feature-richness (Stan Development Team, 2014; Paszke et al., 2017; Innes et al., 2018). Automatic differentiation is usually implemented through either operator overloading (Carpenter et al., 2017; Paszke et al., 2017; Ge et al., 2018) or source code transformation (Innes et al., 2018; van Merrienboer et al., 2018b). Our work, too, relies on automatic differentiation, implemented as source code transformation. However, a novelty of our approach is that instead of using explicit calls or directives to denote parts of code which need to be differentiated, we rely on the type system of Go to selectively differentiate the code relevant for inference, thus combining advantages of both operator overloading and source code transformation.

3. Challenges

Incorporating a probabilistic program, or rather a probabilistic procedure, within a larger code body appears to be rather straightforward: one implements the model in the probabilistic programming language, fetches and preprocesses the data in the host programming language, passes the data and the model to an inference algorithm, and post-processes the results in the host programming language again to make algorithmic decisions based on inference outcomes. To choose the best option for each particular application, probabilistic programming languages and frameworks are customarily compared by the expressive power (classes of probabilistic models that can be represented), conciseness (e.g. the number of lines of code to describe a particular model), and time and space efficiency of inference algorithms (Wood et al., 2014; Perov, 2016; Rainforth, 2017; Ge et al., 2018). However, complex software systems make integration of probabilistic inference challenging, and considerations beyond expressiveness and performance come into play.

Simulation vs. inference

Probabilistic models often follow the design pattern of simulation-inference: a significant part of the model is a simulator running an algorithm with given parameters; the optimal parameters, or their distribution, are to be inferred. The inferred parameters are then used by the software system to execute the simulation independently of inference for prediction and decision making.

This pattern suggests re-use of the simulator: instead of implementing the simulator twice, in the probabilistic model and in the host environment, one can invoke the same code for both purposes. However to achieve this, the host language must coincide with the implementation language of the probabilistic model, on one hand, and allow a computationally efficient implementation of the simulation, on the other hand. Some probabilistic programming frameworks (Figaro (Pfeffer, 2009), Anglican (Tolpin et al., 2016), Turing (Ge et al., 2018)) are built with tight integration with the host environment in mind; more often than not though the probabilistic code is not trivial to re-use.

Data interface

The data for inference may come from a variety of sources: network, databases, distributed file systems, and in many different formats. Efficient inference depends on fast data access and updating. Libraries for data access and manipulation are available in the host environment. While the host environment can be used as a proxy retrieving and transforming the data, such as in the case of Stan (Carpenter et al., 2017)

integrations, sometimes direct access from the probabilistic code is the preferred option, for example when the data is streamed or retrieved conditionally, as in active learning 

(Sun et al., 2015).

A flexible data interface is also beneficial for support of mini-batch optimization (Li et al., 2014). Mini-batch optimization is used when the data set is too big to fit in memory, and the objective function gradient is estimated based on mini-batches — small portions of data. Different models and inference objectives may require different mini-batch loading schemes. For best performance of mini-batch optimization, it should be possible to program loading of mini-batches on case-by-case basis.

Integration and deployment

Deployment of software systems is a delicate process involving automatic builds and maintenance of dependencies. Adding a component, which possibly introduces additional software dependencies or even a separate runtime, complicates deployment. Minimizing the burden of probabilistic programming on the integration and deployment process should be a major consideration in design or selection of probabilistic programming tools. Probabilistic programming frameworks that are implemented or provide an interface in a popular programming language, e.g. Python (Edward (Tran et al., 2017), Pyro (Bingham et al., 2019)) are easier to integrate and deploy, however the smaller the footprint of a probabilistic programming framework, the easier is the adoption. An obstacle in integration of probabilistic programming lies also in adoption of the probabilistic programming language (Meyerovich and Rabkin, 2012), if the latter differs from the implementation language of the rest of the system.

4. Guidelines

Based on the experience of developing and deploying solutions using different probabilistic programming environments, we came up with guidelines to implementation of a probabilistic programming facility for server-side applications. We believe that these guidelines, when followed, help easier integration of probabilistic programming inference into large-scale software systems.

  1. A probabilistic model should be programmed in the host programming language. The facility may impose a discipline on model implementation, such as through interface constraints, but otherwise supporting unrestricted use of the host language for implementation of the model.

  2. Built-in and user-defined data structures and libraries should be accessible in the probabilistic programming model. Inference techniques relying on the code structure, such as those based on automatic differentiation, should support the use of common data structures of the host language.

  3. The model code should be reusable between inference and simulation. The code which is not required solely for inference should be written once for both inference of parameters and use of the parameters in the host environment. It should be possible to run simulation outside the probabilistic model without runtime or memory overhead imposed by inference needs.

5. Probabilistic Programming in Go

In line with the guidelines, we have implemented a probabilistic programming facility for the Go programming language, Infergo. We have chosen Go because Go is a small but expressive programming language with efficient implementation that has recently become quite popular for computation-intensive server-side programming. Infergo is used in production environment for inference of mission-critical algorithm parameters.

5.1. An Overview of Infergo

Infergo comprises a command-line tool that augments the source code of a model to facilitate inference, a collection of basic models (distributions) serving as building blocks for other models, and a library of inference algorithms.

A probabilistic model in Infergo is an implementation of an interface requiring a single method Observe222The method name follows Venture (Mansinghka et al., 2014), Anglican (Tolpin et al., 2016), and other probabilistic programming languages in which observe is the language construct for conditioning of the model on observations.

accepts a vector (a Go

slice) of floats, the parameters to infer, and returns a single float, interpreted as unnormalized log-likelihood of the posterior distribution. The model object usually encapsulates the observations, either as a data field or as a reference to a data source. Implementation of model methods can be written in virtually unrestricted Go and use any Go libraries.

For inference, Infergo relies on automatic differentiation. The source code of the model is translated by the command-line tool into an equivalent model with reverse-mode automatic differentiation of the log-likelihood with respect to the parameters applied. The differentiation operates on the built-in floating-point type and incurs only a small computational overhead. However, even this overhead is avoided when the model code is executed outside inference algorithms: both the original and the differentiated model are simultaneously available to the rest of the program code, so the methods can be called on the differentiated model for inference, and on the original model for the most efficient execution with the inferred parameters.

5.2. A Basic Example

Let us illustrate the use of Infergo on a basic example serving as a probabilistic programming equivalent of the ”Hello, World!” program — inferring parameters of a unidimensional Normal distribution. In this example, the model object holds the set of observations from which the distribution parameters are to be inferred:

type ExampleModel struct {
  Data []float64
}

The Observe method computes the log-likelihood of the distribution parameters. The Normal distribution has two parameters: mean

and standard deviation

. To ease the inference, positive is transformed to unrestricted , and the parameter vector x is . Observe first imposes a prior on the parameters, and then conditions the posterior distribution on the observations:

// x[0] is the mean,
// x[1] is the log stddev of the distribution
func (m *ExampleModel)
     Observe(x []float64) float64 {
  // Our prior is a unit normal ...
  ll := Normal.Logps(0, 1, x...)
  // ... but the posterior is based
  // on data observations.
  ll += Normal.Logps(x[0], math.Exp(x[1]),
                     m.Data...)
  return ll
}

For inference, we supply the data and optionally initialize the parameter vector. Here the data is hard-coded, but in general can be read from a file, a database, a network connection, or dynamically generated by another part of the program:

// Data
m := &ExampleModel{[]float64{
  -0.854, 1.067, -1.220, 0.818, -0.749,
  0.805, 1.443, 1.069, 1.426, 0.308}}
// Parameters
x := []float64{}

The fastest and most straightforward inference is the maximum a posteriori (MAP) estimation, which can be done using a gradient descent algorithm (Adam (Kingma and Ba, 2015) in this case):

opt := &infer.Adam{
  Rate:  0.01,
}
for iter := 0; iter != 1000; iter++ {
  opt.Step(m, x)
}

To recover the full posterior distribution of the parameters we use Hamiltonian Monte Carlo (Neal, 2012):

hmc := &infer.HMC{
  Eps: 0.1,
}
samples := make(chan []float64)
hmc.Sample(m, x, samples)
for i := 0; i != 5000; i++ {
  x = <-samples
}
hmc.Stop()

Models are of course not restricted to linear code, and may comprise multiple methods containing loops, conditional statements, function and method calls, and recursion. Case studies (Section 7) provide more model examples.

5.3. Model Syntax and Rules of Differentiation

In Infergo, the inference is performed on a model. Just like in Stan (Carpenter et al., 2017) models, latent random variables in Infergo models are continuous. Additionally, Infergo can express and perform inference on non-deterministic models by including stochastic choices in the model code.

An Infergo model is a data type and methods defined on the data type. Inference algorithms accept a model object as an argument. A model object usually encapsulates observations — the data to which the parameter values are adapted.

A model is implemented in Go, virtually unrestricted333There are insignificant implementation limitations which are gradually lifted as Infergo is being developed.. Because of this, the rules of model specification and differentiation are quite straightforward. Nonetheless, the differentiation rules ensure that provided that the model’s Observe method (including calls to other model methods) is differentiable, the gradient is properly computed (Griewank and Walther, 2008).

A model is defined in its own package. The model must implement interface Model containing a single method Observe. Observe accepts a slice of float64 — the model parameters — and returns a float64 scalar. For probabilistic inference, the returned value is interpreted as the log-likelihood of the model parameters. Inference models rely on computing the gradient of the returned value with respect to the model parameters through automatic differentiation. In the model’s source code:

  1. Methods on the type implementing Model returning a single float64 or nothing are differentiated.

  2. Within the methods, the following is differentiated:

    • assignments to float64 (including parallel assignments if all values are of type float64);

    • returns of float64;

    • standalone calls to methods on the type implementing Model (apparently called for side effects on the model).

Functions for which the gradient is provided rather than derived through automatic differentiation are called elementals (Griewank and Walther, 2008). Derivatives do not propagate through a function that is not an elemental or a call to a model method. If a derivative is not registered for an elemental, calling the elemental in a differentiated context will cause a run-time error.

Functions are considered elementals (and must have a registered derivative) if their signature is of kind func (float64, float64*) float64; that is, one or more non-variadic float64 arguments and float64 return value. For example, function func (float64, float64, float64) float64 is considered elemental, while functions

  • func (...float64) float64

  • func ([]float64) float64

  • func (int, float64) float64

are not. Gradients for selected functions from the math package are pre-defined (Sqrt, Exp, Log, Pow, Sin, Cos, Tan). Auxiliary elemental functions with pre-defined gradients are provided in a separate package.

Any user-defined function of appropriate kind can be used as an elemental inside model methods. The gradient must be registered for the function using a call to

func RegisterElemental(
  f interface{},
  g func(value float64,
         params ...float64) []float64)

For example, the call

RegisterElemental(Sigm,
  func(value float64,
       _ ...float64) []float64 {
    return []float64{value*(1-value)}
  })

registers the gradient for function Sigm.

5.4. Inference

Out of the box, Infergo offers

When only a point estimate of the model parameters is required, as, for example, in the simulation-inference case, stochastic gradient descent is a fast and robust inference family. Infergo offers stochastic gradient descent with momentum and Adam (Kingma and Ba, 2015). The inference is performed by calling the Step method of the optimizer until a termination condition is met:

for !<termination condition> {
    optimizer.Step(model, parameters)
}

When the full posterior must be recovered for integration, Hamiltonian Monte Carlo (HMC) can be used instead. Vanilla HMC as well as the No-U-Turn sampler (NUTS) (Hoffman and Gelman, 2011) are provided. To support lazy any-time computation, the inference runs in a separate goroutine, connected by a channel to the caller, and asynchronously writes samples to the channel. The caller reads as many samples from the channel as needed, and further postprocesses them:

samples := make(chan []float64)
hmc.Sample(model, parameters, samples)
for !<termination condition> {
  sample = <-samples
  // postprocess the sample
}
hmc.Stop()

Other inference schemes, most notably automatic differentiation variational inference (Kucukelbir et al., 2017), are planned for addition in future versions of Infergo. However, an inference algorithm does not have to be a part of Infergo to work with Infergo models. Since Infergo models operate on a built-in Go floating point type, float64, any third-party inference or optimization library can be used as well.

As an example, the Gonum (authors, 2017) library for numerical computations contains an optimization package offering several algorithms for gradient-based optimization. A Gonum optimization algorithm requires the function to optimize, as well as the function’s gradient, as parameters. An Infergo model can be trivially wrapped for use with Gonum, and Infergo provides a convenience function

func FuncGrad(m Model) (
  Func func(x []float64) float64,
  Grad func(grad []float64, x []float64))

which accepts a model and returns the model’s Observe method and its gradient suitable for passing to Gonum optimization algorithms. Among available algorithms are advanced versions of stochastic gradient descent, as well as algorithms of BFGS family, in particular L-BFGS (Liu and Nocedal, 1989), providing for a more efficient MAP estimation than stochastic gradient descent at the cost of higher memory consumption:

function, gradient := FuncGrad(model)
problem := gonum.Problem{Func: function,
                         Grad: gradient}
result, err := gonum.Minimize(problem, parameters)

Other third-party implementations of optimization algorithms can be used just as easily, thanks to Infergo models being pure Go code operating on built-in Go data types.

6. Implementation Highlights

In our journey towards implementation and deployment of Infergo, we considered options and made implementation decisions which we believe to have been crucial for addressing the challenges and following the guidelines we formulated for implementation of a deployable probabilistic programming facility. While different choices may be as well suited for the purpose as the ones we made, we feel it is important to share our contemplations and justify decisions, to which the rest of this section is dedicated, from choosing the implementation language, through details of automatic differentiation, to support for inference on streaming and stochastic observations.

6.1. Choice of Programming Language

Python (Rossum, 1995), Scala (Odersky, 2006), and Julia (Bezanson et al., 2014) have been popular choices for development of probabilistic programming languages and frameworks integrated with general-purpose programming languages (Salvatier et al., 2016; van Merrienboer et al., 2018b; Bingham et al., 2019; Pfeffer, 2009; Bryant, 2018; Ge et al., 2018; Innes et al., 2018). Each of these languages has advantages for implementing a probabilistic programming facility. However, for a reference implementation of a facility best suited for probabilistic inference in a production software system, we were looking for a language that

  • compiles into fast and compact code;

  • does not incur dependency on a heavy runtime; and

  • is a popular choice for building server-side software.

In addition, since we were going to automatically differentiate the language’s source code, we were looking for a relatively small language, preferably one for which parsing and program analysis tools are readily available.

We chose Go (team, 2009). Go is a small but expressive programming language widely used for implementing computationally-intensive server-side software systems. The standard libraries and development environment offer capabilities which made implementation of Infergo affordable and streamlined:

  • The Go parser and abstract syntax tree serializer are a part of the standard library. Parsing, transforming, and generating Go source code is straightforward and effortless.

  • Type inference (or type checking as it is called in the Go ecosystem), also provided in the standard library, augments parsing, and allows to selectively apply transformation-based automatic differentiation based on static expression types.

  • Go provides reflection capabilities for almost any aspect of the running program, without compromising efficiency. Reflection gives the flexibility highly desirable to implement features such as calling Go library functions from the model code.

  • Go compiles and runs fast. Fast compilation and execution speeds allow to use the same facility for both exploratory design of probabilistic models and for inference in production environment.

  • Go is available on many platforms and for a wide range of operating systems and environments, from portable devices to supercomputers. Go programs can be cross-compiled and reliably developed and deployed in heterogeneous environments.

  • Go offers efficient parallel execution as a first-class feature, via so-called goroutines. Goroutines streamline implementation of sampling-based inference algorithms. Sample generators and consumers are run in parallel, communicating through channels. Inference is easy to parallelize in order to exploit hardware multi-processing, and samples are retrieved lazily for postprocessing.

Considering and choosing Go for development of what later became Infergo also affected the shape of the software system in which Infergo was deployed first. Formerly implemented mostly in Python, and both enjoying a wide choice of libraries for machine learning and data processing and suffering from limitations of Python as a language for implementation of computation-intensive algorithms, the system has been gradually drifting towards adopting Go, with performance-critical parts re-implemented and running faster due to better hardware and network utilization, and simpler and more maintainable code for data processing, analysis, and decision making.

6.2. Automatic Differentiation

Infergo employs reverse-mode automatic differentiation via source code transformation (Griewank and Walther, 2008). The reverse mode is the default choice in machine learning applications, because it works best with differentiating a scalar return value with respect to multiple parameters. Automatic differentiation is usually implemented using either operator overloading or source code transformation. Operator overloading is often the method of choice (Carpenter et al., 2017; Ge et al., 2018; Paszke et al., 2017) due to relative simplicity, but the differentiated code must use a special type (on which the operators are overloaded) for representing floating point numbers. Regardless of the issue of computational overhead, this restricts the language used to specify the model to operations on that special type and impairs interoperability: library functions cannot be directly called but have to use an adaptor, or a ‘lifted’ version of the standard library must be supplied. Similarly, the data must be cast into the special type before it is passed to the model for inference.

Reverse mode source code transformation

Source code transformation allows to automatically differentiate program code operating on the built-in floating point type. This method involves generation of new code, for the function gradient or for the function itself, ahead of time or on the fly. To achieve the best performance, some implementations generate static code for computing the function gradient (van Merrienboer et al., 2018b; Innes, 2018). This requires a rather elaborated source code analysis, and although a significant progress has been made in this direction, not all language constructs are supported. Alternatively, the code of the function itself can be transformed by augmenting arithmetics and function calls on the floating point type by function calls that record the computations on the tape, just like in the case of operator overloading. Then, the backward pass is identical to that of operator overloading. This method allows the use of the built-in floating point type along with virtually unrestricted language for model specification, at the cost of slight performance loss.

This latter method is used in Infergo: the model is written in Go and is placed in a separate Go package. A command-line tool is run on the package to generate a sub-package containing a differentiated version of the model. Both the original and the differentiated model are available simultaneously in the calling code, so that model methods that do not require computation of gradients can be invoked without the overhead of automatic differentiation. However, the performance loss is rarely a major issue — differentiated versions of model methods are only 2–4 times slower on models from the case studies (Section 7), and the backward pass, which computes the gradient, is roughly as fast as the original, undifferentiated version. At least partially, fast backward pass is due to careful implementation and thorough profiling of the tape.

Automatic differentiation of models

It may seem desirable to implement or use an automatic differentiation library that is more general than needed for probabilistic programming. However the use of such library, in particular with source code transformation, may make implementation restricted and less efficient: first, if a model comprises several functions, all functions must have a signature suitable for differentiation (that is, accept and return floating point scalars or vectors); second, the gradient of every nested call must be fully computed first and then used to compute the gradient of the caller. In Infergo, only the gradient of Observe is needed, and any other differentiated model method can only be called from Observe or another differentiated method. Consequently, only the gradient of Observe is computed explicitly, and other differentiated methods may have arbitrary arguments. In particular, derivatives may propagate through fields of the model object and structured arguments.

6.3. Model Composition

Just like it is possible to compose library and user-defined functions into more complex functions, it should be possible to compose models to build other models (Tran et al., 2017; Sennesh et al., 2018). Infergo supports model composition organically: a differentiated model method can be called from another differentiated method, not necessarily a method of the same model object or type. Derivatives can propagate through methods of different models without restriction. Each model is a building block for another model. For example, if there are two independent single-parameter models A and B describing the data, then the product model AB can be obtained by composition of A and B (Figure 1).

1type A struct {Data []float64}
2func (model A) Observe(x []float64) {
3  ...
4}
5
6type B struct {Data []float64}
7func (model B) Observe(x []float64) {
8  ...
9}
10
11type AB struct {Data []float64}
12func (model AB) Observe(x []float64) float64 {
13  return A{model.Data}.Observe(x[:1]) +
14    B{model.Data}.Observe(x[1:])
15}
Figure 1. Model composition. Model AB is the product of A and B.

Combinators such as product, mixture, or hierarchy can be realized as functions operating on Infergo models.

Distributions

A probabilistic programming facility should provide a library of distributions for definition and conditioning of random variables. To use a distribution in a differentiable model, the distribution’s probability density function must be differentiable by both the distribution’s parameters and by the value of the random variable. Distributions are usually supplied as a part of the probabilistic programming framework (Tolpin et al., 2016; Carpenter et al., 2017; Ge et al., 2018), and, if at all possible, adding a user-defined distribution requires programming at a lower level of abstraction than while specifying a probabilistic model (Salvatier et al., 2016; Tolpin et al., 2016; Annis et al., 2017).

In Infergo, distributions are just models, and adding and using a custom distribution is indistinguishable from adding a model and calling the model’s method from another model. By convention, library distributions, in addition to Observe, also have methods Logp and Logps for the scalar and vector version of probability density functions, for convenience of use in models. The distribution library in Infergo is just a model package which is automatically differentiated like any other model. New distributions can be added in any package. Figure 2

shows the Infergo definition of the exponential distribution.

1// Exponential distribution.
2type expon struct{}
3
4// Exponential distribution, singleton instance.
5var Expon expon
6
7// Observe implements the Model interface. The
8// parameter vector is lambda, observations.
9func (dist expon) Observe(x []float64)
10  float64 {
11  lambda, y := x[0], x[1:]
12  if len(y) == 1 {
13    return dist.Logp(lambda, y[0])
14  } else {
15    return dist.Logps(lambda, y...)
16  }
17}
18
19// Logp computes the log pdf of a single
20// observation.
21func (_ expon) Logp(lambda float64, y float64)
22  float64 {
23  logl := math.Log(lambda)
24  return logl - lambda*y
25}
26
27// Logps computes the log pdf of a vector
28// of observations.
29func (_ expon) Logps(lambda float64, y ...float64)
30  float64 {
31  ll := 0.
32  logl := math.Log(lambda)
33  for i := range y {
34    ll += logl - lambda*y[i]
35  }
36  return ll
37}
Figure 2. The exponential distribution in Infergo.

6.4. Stochasticity and Streaming

Perhaps somewhat surprisingly, most probabilistic programs are semantically deterministic (Staton et al., 2016b) — probabilistic programs define distributions, and although they manipulate random variables, there is no randomization or non-determinism per se. Randomization arises in certain inference algorithms (notably of the Monte Carlo family) for tractable approximate inference, but this does not introduce non-determinism into the programs themselves. Non-deterministic probabilistic programs arise in applications. For example, van de Meent et al. (2016) explored using probabilistic programming for policy search in non-deterministic domains. To represent non-determinism, appropriate constructs had to be added to the probabilistic programming language used in the study.

In Infergo, non-determinism can be easily introduced through data. In the basic example if Section 5.2, the data is encapsulated in the model as a vector, populated before inference. However, this is not at all required by Infergo. Instead of being stored in a Go slice, the data can be read from a Go channel or network connection and generated by another goroutine or process. Figure 3 shows a modified model from the basic example with the data gradually read from a channel rather than passed as a whole in a field of the model object.

1type StreamingModel struct {
2  Data chan float64  // data is a channel
3  N    int           // batch size
4}
5
6func (m *StreamingModel)
7     Observe(x []float64) float64 {
8  ll := Normal.Logps(0, 1, x...)
9  // observe a batch of data from the channel
10  for i := 0; i != m.N; i++ {
11    ll += Normal.Logp(x[0], math.Exp(x[1]),
12                      <- m.Data)
13  }
14  return ll
15}
Figure 3. The basic example with the data read from a channel.

A similar approach can be applied to streaming data: the model may read the data gradually as it arrives, and the inference algorithm will emit samples from the time process inferred from the data. Work is still underway on algorithms ensuring robust and sound inference on streams, but Infergo models connected to data sources via channels are already used in practice.

7. Case Studies

In this section, we present three case studies of probabilistic programs, each addressing a different aspect of Infergo. Eight schools (Section 7.1

) is a popular example we use to compare model specification in Stan and Infergo. Linear regression (Session 

7.2) formulated as a probabilistic program lets us explore a model comprising multiple methods and illustrate the simulation-inference pattern. Gaussian mixture (Section 7.3) is the most complex model of the three, featuring nested loops and conditionals.

By no means do the models presented in this section reflect the complexity or the size of real-world Infergo applications, the latter reaching hundreds of lines of code in length and performing inference on tens of thousands of data entries. Rather, their purpose is to provide a general picture of probabilistic programming with Infergo.

7.1. Eight Schools

The eight schools problem (Gelman et al., 2013)

is an application of the hierarchical normal model that often serves as an introductory example of Bayesian statistical modeling. Without going into details, in this problem the effect of coaching programs on test outcomes is compared among eight schools. Figure 

4 shows specifications of the same model in Stan and Infergo, side-by-side.

1data { 2  int<lower=0> J; 3  vector[J] y; 4  vector<lower=0>[J] sigma; 5} 6 7parameters { 8  real mu; 9  real<lower=0> tau; 10  vector[J] eta; 11} 12 13transformed parameters { 14  vector[J] theta; 15  theta = mu + tau * eta; 16} 17 18model { 19  eta ~ normal(0, 1); 20  y ~ normal(theta, sigma); 21} a. Stan 1type SchoolsModel struct { 2  J          int 3  Y          []float64 4  Sigma      []float64 5} 6 7 8func (m *SchoolsModel) Observe(x []float64) float64 { 9  mu := x[0] 10  tau := math.Exp(x[1]) 11  eta := x[2:] 12 13  ll := Normal.Logp(0, 1, eta) 14 15  for i, y := range m.Y { 16    theta := mu + tau*eta[i] 17    ll += Normal.Logp(theta, m.Sigma[i], y) 18  } 19 20  return ll 21} b. Infergo
Figure 4. Eight schools: Stan vs. Infergo. The Go implementation has a similar length and structure to the Stan model.

Stan models are written in a probabilistic programming language tailored to statistical modeling. Infergo models are implemented in Go. One can see that the models have roughly the same length and structure. The definition of the model type in Go (lines 1–5) corresponds to the data section in Stan (lines 1–5). Destructuring of the parameters of Observe in Go (lines 9–11) names the parameters just like the parameters section in Stan (lines 8–10). Introduction of theta in Go (line 16) corresponds to the transformed parameters section in Stan (lines 13–16). The only essential difference is the explicit loop over the data in Go (lines 15–18) versus vectorized computations in Stan (lines 15 and 20). However, vectorized computations are used in Stan to speed-up execution more than to improve readability. If repeated calls to Normal.Logp (line 17) become the bottleneck, an optimized vectorized version can be implemented in Go in the same package. While subjective, our feeling is that the Go version is as readable as the Stan one. In addition, a Go programmer is less likely to resist adoption of statistical modeling if the models are implemented in Go.

7.2. Linear Regression

The model in Figure 5 specifies linear regression on unidimensional data: an improper uniform prior is placed on the parameters , , and (lines 9–10), which are then conditioned on observations such that

(1)

(lines 12–17).

1type LRModel struct {
2  Data  [][]float64
3}
4
5func (m *LRModel) Observe(x []float64)
6    float64 {
7  ll := 0.
8
9  alpha, beta := x[0], x[1]
10  sigma := math.Exp(x[2])
11
12  for i := range m.Data {
13    ll += Normal.Logp(
14      m.Simulate(m.Data[i][0], alpha, beta),
15      sigma,
16      m.Data[i][1])
17  }
18  return ll
19}
20
21// Simulate simulates y based on x and
22// regression parameters.
23func (m *LRModel) Simulate(x, alpha, beta float64)
24    float64 {
25  y := alpha + beta*x
26  return y
27}
Figure 5. Linear regression: the simulator is re-used in both inference and prediction.

If only the maximum a posteriori is estimated, the parameter values will maximize their likelihood given the observations:

(2)

Once the values of and are estimated, they are used to predict for unseen values of

. Hence, the linear transformation

is computed in two different contexts: first, to compute the likelihood of the parameter values; then, to predict based on a given . This is an example of the simulation-inference pattern (Section 3): method Simulate simulates based on and is invoked both from Observe and then directly for prediction. Since both original and differentiated model are simultaneously available in the source code, can be called for prediction on the original model, without any differentiation overhead.

In this greatly simplified case study, the simulator is a one-line function. In a real-world scenario though, the simulator can be laborious to re-implement correctly and efficiently; for example, a simulator that executes a parameterized policy in an uncertain or dynamic domain, and depends on both parameter values and observations (van de Meent et al., 2016). The ability to re-use a single implementation both as a part of the probabilistic program, where the simulator must be augmented for inference (e.g. by automatic differentiation), and for prediction or decision making, where the simulator may be executed repeatedly and under time pressure, saves the effort otherwise wasted on rewriting the simulator and removes the need to maintain consistency between different implementations.

7.3. Gaussian Mixture Model

The Gaussian mixture model is a probabilistic model of a mixture of components, where each component comes from a Gaussian distribution. The model variant in Figure 

6 accepts a vector of single-dimensional observations and the number of components, and infers the distribution of parameters of each component in the mixture. Following Stan User’s Guide (Stan Development Team, 2018), component memberships are summed out to make the model differentiable.

1type GMModel struct {
2  Data  []float64 // observations
3  NComp int       // number of components
4}
5
6func (m *GMModel) Observe(x []float64) float64 {
7  ll := 0.0
8  // Impose a prior on component parameters
9  ll += Normal.Logps(0, 1, x...)
10
11  // Fetch the parameters
12  mu := make([]float64, m.NComp)
13  sigma := make([]float64, m.NComp)
14  for j := 0; j != m.NComp; j++ {
15    mu[j] = x[2*j]
16    sigma[j] = math.Exp(x[2*j+1])
17  }
18
19  // Compute log likelihood of the mixture
20  for i := 0; i != len(m.Data); i++ {
21    var l float64
22    for j := 0; j != m.NComp; j++ {
23      lj := Normal.Logp(mu[j], sigma[j],
24                        m.Data[i])
25      if j == 0 {
26        l = lj
27      } else {
28        l = logSumExp(l, lj)
29      }
30    }
31    ll += l
32  }
33  return ll
34}
35
36// logSumExp computes log(exp(x)+exp(y)) robustly.
37func LogSumExp(x, y float64) float64 {
38  z := x
39  if y > z {
40    z = y
41  }
42  return z + math.Log(math.Exp(x-z)+math.Exp(y-z))
43}
44
45// logSumExp gradient must be supplied.
46func init() {
47  ad.RegisterElemental(logSumExp,
48    func(_ float64, params ...float64)
49      []float64 {
50      z := math.Exp(params[1] - params[0])
51      t := 1 / (1 + z)
52      return []float64{t, t * z}
53    })
54}
Figure 6. Gaussian mixture model: an Infergo model with dynamic control structure and a user-defined elemental.

The model contains most of the steps commonly found in Infergo models. First, a prior on the model parameters is imposed (lines 8–9). Then, the parameter vector is destructured into a form convenient for formulation of conditioning (lines 11–17). Finally, the parameters are conditioned on the observations (lines 19–34).

The conditioning of parameters involves a nested loop over all observations and over all mixture components for each observation. Computing log-likelihood of the model parameters implies summing up likelihoods over components for each observation. In the log-domain, a trick called ‘log-sum-exp’ is required to avoid loss of precision. The trick is implemented as function logSumExp and called for each observation and component (line 28).

logSumExp could have been implemented as a model method and differentiated automatically. However, this function is called on each iteration of the inner loop. Hence, efficiency of inference in general is affected by an efficient implementation of logSumExp (this can also be confirmed by profiling). To save on the cost of a differentiated call and speed up gradient computation, logSumExp is implemented as an elemental (lines 36–43). The hand-coded gradient is supplied for logSumExp (lines 45–54) to make automatic differentiation work.

While still a very simple model compared to real-world Infergo models, this case study illustrates common parts and coding patterns of an elaborated model. The model code may be quite complicated algorithmically and manipulate Go data structures freely. When efficiency becomes an issue, parts of the model can be implemented at a lower level of abstraction, but nonetheless in the same programming language as the rest of the model.

8. Performance

model time, seconds
compilation execution
Infergo Turing Stan Infergo Turing Stan
Eight schools 0.50 - 50 0.60 2.8 0.12
Gaussian mixture model 0.50 - 54 32 14 4.9
Latent Dirichlet allocation 0.50 - 54 8.9 12 3.7
Table 1. Compilation and execution times for 1000 iterations of HMC with 10 leapfrog steps.

It is customary to include performance evaluation on benchmark problems in publications on new probabilistic programming frameworks (Paige and Wood, 2014; Wood et al., 2014; Tolpin et al., 2016; Ge et al., 2018). We provide here a comparison of running times of Infergo and two other probabilistic programming frameworks: Turing (Ge et al., 2018) and Stan (Carpenter et al., 2017). Turing is a Julia (Bezanson et al., 2014) library and DSL offering exploratory statistical modeling and inference composition. Stan is a compiled language and a library optimized for high performance.

We use three models of different size and structure for the comparison: two of the models, Eight schools and Gaussian mixture model, were introduced as a part of the case studies (Section 7); the code for the third model, Latent Dirichlet allocation (Blei et al., 2002), is included in Appendix A. The code and data for all models were originally included in Stan examples and translated for use with Turing and Infergo. Eight schools is the smallest model. One may expect that most of the time execution time is spent in executing the boilerplate code rather than the model. Gaussian mixture model is a more elaborated model which benefits greatly from vectorization. The greater the number of observations (1000 data points were used in the evaluation), the more significant is the benefit of vectorized computations. Latent Dirichlet allocation is, again, a relatively elaborated model, which is, however, not as easy to vectorize.

We report both compilation time (for Stan and Infergo) and execution time. Compilation time is an important characteristic of a probabilistic programming framework — statistical modeling often involves many cycles of model modifications and numerical experiments on subsampled data. The ability to modify and instantly re-run a model is crucial for exploratory data analysis and model design. Turing is based on Julia, which is a dynamic language with JIT compiler, and there is no separate compilation time. For Stan, the compilation time includes compiling the Stan model to C++, and then compiling the C++ source into executable code for inference. For Infergo, the compilation time consists of automatic differentiation of the model, and then building a command-line executable that calls an inference algorithm on the model.

Table 1 shows running time measurements on the models. The measurements were obtained on a 1.25GHz Intel(R) Core(TM) i5 CPU with 8GB of memory for 1000 iterations of Hamiltonian Monte Carlo with 10 leapfrog steps and averaged over 10 runs. In execution, Infergo is slower than Stan, which comes as little surprise — Stan is a mature highly optimized inference and optimization library. However, Infergo and Turing show similar execution times, despite Turing relying on Julia’s numerical libraries. The slowest relative execution time is on the Gaussian mixture model, where Stan is 6–7 times faster, apparently because Infergo model is not vectorized; the model could be made more efficient at the cost of verbosity and poorer readability, but we opted to use idiomatic Infergo models for the comparison. On the Latent Dirichlet allocation model, run on Stan example data with 25 documents and 262 word instances, Infergo is only 2.5 times slower than Stan, and faster than Turing, an illustration of the benefits of transformation-based differentiation and efficiency of Go.

Infergo programs compile very fast, in particular in comparison to Stan models. Both automatic differentiation and code generation combined take a fraction of a second. Compilation time of Stan models is dominated by compilation of the generated C++ code which takes close to a minute on our hardware. The ability to modify a model and then re-run inference almost instantly, and with reasonable performance, helps in exploratory model development with Infergo, which was conceived with server-side, unattended execution in mind, but fits the exploration and development stage just as well.

9. Conclusion

Building a production software system which employs probabilistic inference as an integral part of prediction and decision making algorithms is challenging. To address the challenges, we proposed guidelines for implementation of a deployable probabilistic programming facility. Based on the guidelines, we introduced Infergo, a probabilistic programming facility in Go, as a reference implementation in accordance with the guidelines. Infergo allows to program inference models in virtually unrestricted Go, and to perform inference on the model using either supplied or third-party algorithms. We demonstrated on case studies how Infergo overcomes the challenges by following the guidelines.

A probabilistic programming facility similar to Infergo can be added to most modern general-purpose programming languages, in particular those used for implementation of large-scale software systems, making probabilistic programming inference more accessible in large scale server-side applications. We described Infergo implementation choices and techniques in intent to encourage broader adoption of probabilistic programming in production software systems and ease implementation of similar facilities in other languages and software environments.

Infergo is an ongoing project. Development of Infergo is influenced both by feedback from the open-source Go community and by the needs of the software system in which Infergo was initially deployed, and which still supplies the most demanding use cases in terms of flexibility, scalability, and ease of integration and maintenance. Deploying probabilistic programming as a part of production software system has a mutual benefit of both improving the core algorithms of the system and of helping to shape and improve probabilistic modeling and inference, thus further advancing the field of probabilistic programming.

References

  • (1)
  • Annis et al. (2017) Jeffrey Annis, Brent J. Miller, and Thomas J. Palmeri. 2017. Bayesian inference with Stan: A tutorial on adding custom distributions. Behavior Research Methods 49, 3 (01 Jun 2017), 863–886.
  • Appel (2007) Andrew W. Appel. 2007. Compiling with Continuations. Cambridge University Press, New York, NY, USA.
  • Appel and Jim (1989) A. W. Appel and T. Jim. 1989. Continuation-passing, Closure-passing Style. In Proceedings of the 16th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’89). ACM, New York, NY, USA, 293–302.
  • authors (2017) Gonum authors. 2017. Gonum numerical packages. http://gonum.org/.
  • Baydin et al. (2017) Atılım Günes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2017. Automatic Differentiation in Machine Learning: A Survey. J. Mach. Learn. Res. 18, 1 (Jan. 2017), 5595–5637.
  • Bezanson et al. (2014) Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B. Shah. 2014. Julia: A Fresh Approach to Numerical Computing. CoRR abs/1411.1607 (2014). arXiv:1411.1607
  • Bingham et al. (2019) Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. 2019. Pyro: Deep Universal Probabilistic Programming. Journal of Machine Learning Research 20, 28 (2019), 1–6.
  • Blei et al. (2002) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2002. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2002), 2003.
  • Bryant (2018) Avi Bryant. 2018. Rainier. https://github.com/stripe/rainier.
  • Carpenter et al. (2017) Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software, Articles 76, 1 (2017), 1–32.
  • Ge et al. (2018) Hong Ge, Kai Xu, and Zoubin Ghahramani. 2018. Turing: Composable inference for probabilistic programming. In

    International Conference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain

    . 1682–1690.
  • Gelman et al. (2013) A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. 2013. Bayesian Data Analysis, Third Edition. Taylor & Francis.
  • Goodman et al. (2008) Noah D. Goodman, Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. 2008. Church: a language for generative models. In Proceedings of Uncertainty in Artificial Intelligence.
  • Goodman and Stuhlmüller (2014) N. D. Goodman and A. Stuhlmüller. 2014. The Design and Implementation of Probabilistic Programming Languages. http://dippl.org/ electronic; retrieved 2019/3/29.
  • Goodman et al. (2016) Noah D Goodman, Joshua B. Tenenbaum, and The ProbMods Contributors. 2016. Probabilistic Models of Cognition. http://probmods.org/v2. Accessed: 2019-4-29.
  • Gordon et al. (2014) Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani. 2014. Probabilistic Programming. In International Conference on Software Engineering (ICSE, FOSE track).
  • Gorinova et al. (2019) Maria I. Gorinova, Andrew D. Gordon, and Charles Sutton. 2019. Probabilistic Programming with Densities in SlicStan: Efficient, Flexible, and Deterministic. Proceedings of the ACM on Programming Languages 3, POPL, Article 35 (Jan. 2019), 30 pages.
  • Griewank and Walther (2008) Andreas Griewank and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (second ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
  • Hoffman and Gelman (2011) Matthew D. Hoffman and Andrew Gelman. 2011. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. arXiv:1111.4246
  • Innes (2018) Michael Innes. 2018. Don’t Unroll Adjoint: Differentiating SSA-Form Programs. arXiv:1810.07951
  • Innes et al. (2018) Michael Innes, Elliot Saba, Keno Fischer, Dhairya Gandhi, Marco Concetto Rudilosso, Neethu Mariya Joy, Tejan Karmali, Avik Pal, and Viral Shah. 2018. Fashionable Modelling with Flux. arXiv:1811.01457
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference for Learning Representations, San Diego. arXiv:1412.6980
  • Kucukelbir et al. (2017) Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. 2017. Automatic Differentiation Variational Inference. J. Mach. Learn. Res. 18, 1 (Jan. 2017), 430–474.
  • Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. 2014. Efficient Mini-batch Training for Stochastic Optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). ACM, New York, NY, USA, 661–670.
  • Liu and Nocedal (1989) Dong C. Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45, 1 (01 Aug 1989), 503–528.
  • Mansinghka et al. (2014) Vikash K. Mansinghka, Daniel Selsam, and Yura N. Perov. 2014. Venture: a higher-order probabilistic programming platform with programmable inference. arXiv:1404.0099
  • Meyerovich and Rabkin (2012) Leo A. Meyerovich and Ariel S. Rabkin. 2012. Socio-PLT: Principles for Programming Language Adoption. In Proceedings of the ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward! 2012). ACM, New York, NY, USA, 39–54.
  • Milch et al. (2007) Brian Milch, Bhaskara Marthi, Stuart Russell, David Sontag, Daniel L. Ong, and Andrey Kolobov. 2007. BLOG: Probabilistic Models with Unknown Objects. In Statistical Relational Learning, Lise Getoor and Ben Taskar (Eds.). MIT Press.
  • Minka et al. (2010) T Minka, J Winn, J Guiver, and D Knowles. 2010. Infer .NET 2.4, Microsoft Research Cambridge.
  • Neal (2012) Radford M. Neal. 2012. MCMC using Hamiltonian dynamics. Published as Chapter 5 of the Handbook of Markov Chain Monte Carlo, 2011. arXiv:1206.1901
  • Odersky (2006) Martin Odersky. 2006. The Scala Experiment: Can We Provide Better Language Support for Component Systems?. In Conference Record of the 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’06). ACM, New York, NY, USA, 166–167.
  • Paige and Wood (2014) Brooks Paige and Frank Wood. 2014. A compilation target for probabilistic programming languages. In Proceedings of The 31st International Conference on Machine Learning. 1935–1943.
  • Paige et al. (2014) B. Paige, F. Wood, A. Doucet, and Y.W. Teh. 2014. Asynchronous Anytime Sequential Monte Carlo. In Advances in Neural Information Processing Systems.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017.

    Automatic differentiation in PyTorch. In

    NIPS Autodiff Workshop.
  • Perov (2016) Yura N. Perov. 2016. Applications of Probabilistic Programming (Master’s thesis, 2015). CoRR abs/1606.00075 (2016).
  • Pfeffer (2009) Avi Pfeffer. 2009. Figaro: An object-oriented probabilistic programming language. Charles River Analytics Technical Report 137 (2009), 96.
  • Rainforth (2017) Tom Rainforth. 2017. Automating Inference, Learning, and Design using Probabilistic Programming. Ph.D. Dissertation.
  • Rainforth et al. (2016) Tom Rainforth, Christian A Naesseth, Fredrik Lindsten, Brooks Paige, Jan-Willem van de Meent, Arnaud Doucet, and Frank Wood. 2016.

    Interacting Particle Markov Chain Monte Carlo. In

    Proceedings of the 33rd International Conference on Machine Learning (JMLR: W&CP), Vol. 48.
  • Rossum (1995) Guido Rossum. 1995. Python Reference Manual. Technical Report. Amsterdam, The Netherlands, The Netherlands.
  • Ruder (2016) Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. arXiv:1609.04747
  • Salvatier et al. (2016) John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2 (apr 2016), e55.
  • Scibior et al. (2015) Adam Scibior, Zoubin Ghahramani, and Andrew D. Gordon. 2015. Practical probabilistic programming with monads. In Proceedings of the 8th ACM SIGPLAN Symposium on Haskell, Haskell 2015, Vancouver, BC, Canada, September 3-4, 2015. 165–176.
  • Sennesh et al. (2018) Eli Sennesh, Adam Scibior, Hao Wu, and Jan-Willem van de Meent. 2018. Composing Modeling and Inference Operations with Probabilistic Program Combinators. CoRR abs/1811.05965 (2018). arXiv:1811.05965 http://arxiv.org/abs/1811.05965
  • Stan Development Team (2014) Stan Development Team. 2014. Stan: A C++ Library for Probability and Sampling, Version 2.4. (2014).
  • Stan Development Team (2018) Stan Development Team. 2018. Stan Modeling Language User’s Guide and Reference Manual, Version 2.18.0. http://mc-stan.org/
  • Staton et al. (2016a) S. Staton, H. Yang, C. Heunen, O. Kammar, and F. Wood. 2016a. Semantics for probabilistic programming: higher-order functions, continuous distributions, and soft constraints. In Thirty-First Annual ACM/IEEE Symposium on Logic In Computer Science.
  • Staton et al. (2016b) Sam Staton, Hongseok Yang, Frank Wood, Chris Heunen, and Ohad Kammar. 2016b. Semantics for Probabilistic Programming: Higher-order Functions, Continuous Distributions, and Soft Constraints. In Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science (LICS ’16). ACM, New York, NY, USA, 525–534. https://doi.org/10.1145/2933575.2935313
  • Stefik and Hanenberg (2014) Andreas Stefik and Stefan Hanenberg. 2014. The Programming Language Wars: Questions and Responsibilities for the Programming Language Community. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software (Onward! 2014). ACM, New York, NY, USA, 283–299.
  • Sun et al. (2015) Q. Sun, A. Laddha, and D. Batra. 2015. Active learning for structured probabilistic models with histogram approximation. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    . 3612–3621.
  • team (2009) The Go team. 2009. The Go programming language. http://golang.org/.
  • Tolpin et al. (2016) David Tolpin, Jan-Willem van de Meent, Hongseok Yang, and Frank Wood. 2016. Design and Implementation of Probabilistic Programming Language Anglican. In Proceedings of the 28th Symposium on the Implementation and Application of Functional Programming Languages (IFL 2016). ACM, New York, NY, USA, Article 6, 12 pages.
  • Tran et al. (2017) Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep probabilistic programming. In International Conference on Learning Representations.
  • van de Meent et al. (2016) Jan-Willem van de Meent, Brooks Paige, David Tolpin, and Frank Wood. 2016. Black-Box Policy Search with Probabilistic Programs. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016. 1195–1204.
  • van de Meent et al. (2018) Jan-Willem van de Meent, Brooks Paige, Hongseok Yang, and Frank Wood. 2018. An Introduction to Probabilistic Programming. arXiv:1809.10756
  • van de Meent et al. (2015) Jan-Willem van de Meent, Hongseok Yang, Vikash Mansinghka, and Frank Wood. 2015. Particle Gibbs with Ancestor Sampling for Probabilistic Programs. In Artificial Intelligence and Statistics. arXiv:1501.06769
  • van Merrienboer et al. (2018a) Bart van Merrienboer, Olivier Breuleux, Arnaud Bergeron, and Pascal Lamblin. 2018a. Automatic differentiation in ML: Where we are and where we should be going. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 8757–8767.
  • van Merrienboer et al. (2018b) Bart van Merrienboer, Dan Moldovan, and Alexander Wiltschko. 2018b. Tangent: Automatic differentiation using source-code transformation for dynamically typed array programming. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 6256–6265.
  • Wikipedia (2019) Wikipedia. 2019. Probabilistic programming language — Wikipedia, The Free Encyclopedia. [Online; accessed 16-April-2019].
  • Wingate et al. (2011) David Wingate, Andreas Stuhlmüller, and Noah D. Goodman. 2011. Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation. In Proceedings of the 14th Artificial Intelligence and Statistics.
  • Wingate and Weber (2013) David Wingate and Theophane Weber. 2013. Automated variational inference in probabilistic programming. arXiv:1301.1299
  • Winn et al. (2019) John Winn, Christopher M. Bishop, Thomas Diethe, and Yordan Zaykov. 2019. Model Based Machine Learning. http://mbmlbook.com/ electronic; retrieved 2019/4/21.
  • Wood et al. (2014) Frank Wood, Jan-Willem van de Meent, and Vikash Mansinghka. 2014. A New Approach to Probabilistic Programming Inference. In Artificial Intelligence and Statistics.
  • Yang et al. (2014) Lingfeng Yang, Pat Hanrahan, and Noah D Goodman. 2014. Generating Efficient MCMC Kernels from Probabilistic Programs. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics. 1068–1076.

Appendix A Latent Dirichlet Allocation Model

Infergo

1type LDAModel struct {
2  K     int       // num topics
3  V     int       // num words
4  M     int       // num docs
5  N     int       // total word instances
6  Word  []int     // word n
7  Doc   []int     // doc for word n
8  Alpha []float64 // topic prior
9  Beta  []float64 // word prior
10}
11
12func (m *LDAModel) Observe(x []float64) float64 {
13  ll := 0.0
14
15  // Regularize the parameter vector
16  ll += Normal.Logps(0, 1, x...)
17  // Destructure parameters
18  theta := make([][]float64, m.M)
19  m.FetchSimplices(&x, m.K, theta)
20  phi := make([][]float64, m.K)
21  m.FetchSimplices(&x, m.V, phi)
22
23  // Impose priors
24  ll += Dirichlet{m.K}.Logps(m.Alpha, theta...)
25  ll += Dirichlet{m.V}.Logps(m.Beta, phi...)
26
27  // Condition on observations
28  gamma := make([]float64, m.K)
29  for in := 0; in != m.N; in++ {
30    for ik := 0; ik != m.K; ik++ {
31      gamma[ik] = math.Log(theta[m.Doc[in]-1][ik]) +
32        math.Log(phi[ik][m.Word[in]-1])
33    }
34    ll += D.LogSumExp(gamma)
35  }
36  return ll
37}
38
39func (m *Model) FetchSimplices(
40    px *[]float64,
41    k int,
42    simplices [][]float64,
43) {
44    for i := range simplices {
45        simplices[i] = make([]float64, k)
46        D.SoftMax(model.Shift(px, k), simplices[i])
47    }
48}

Stan

data {
  int<lower=2> K;               // num topics
  int<lower=2> V;               // num words
  int<lower=1> M;               // num docs
  int<lower=1> N;           // total word instances
  int<lower=1,upper=V> w[N];    // word n
  int<lower=1,upper=M> doc[N];  // doc for word n
  vector<lower=0>[K] alpha;     // topic prior
  vector<lower=0>[V] beta;      // word prior
}
parameters {
  simplex[K] theta[M]; // topic dist for doc m
  simplex[V] phi[K];   // word dist for topic k
}
model {
  for (m in 1:M)
    theta[m] ~ dirichlet(alpha);  // prior
  for (k in 1:K)
    phi[k] ~ dirichlet(beta);     // prior
  for (n in 1:N) {
    real gamma[K];
    for (k in 1:K)
      gamma[k] <- log(theta[doc[n],k])
        + log(phi[k,w[n]]);
    increment_log_prob(log_sum_exp(gamma));
  }
}