Log In Sign Up

Deep Probabilistic Programming Languages: A Qualitative Study

by   Guillaume Baudart, et al.

Deep probabilistic programming languages try to combine the advantages of deep learning with those of probabilistic programming languages. If successful, this would be a big step forward in machine learning and programming languages. Unfortunately, as of now, this new crop of languages is hard to use and understand. This paper addresses this problem directly by explaining deep probabilistic programming languages and indirectly by characterizing their current strengths and weaknesses.


page 1

page 2

page 3

page 4


FSE/CACM Rebuttal^2: Correcting A Large-Scale Study of Programming Languages and Code Quality in GitHub

Ray, Devanbu and Filkov issued a rebuttal of our TOPLAS paper "On the Im...

PRoMoTo 2013 proceedings

Programming for Mobile and Touch (PRoMoTo'13) was held at the 2013 ACM S...

On the Evolution of Programming Languages

This paper attempts to connects the evolution of computer languages with...

Linguistic Relativity and Programming Languages

The use of programming languages can wax and wane across the decades. We...

A Survey of Machine Learning for Big Code and Naturalness

Research at the intersection of machine learning, programming languages,...

Deployable probabilistic programming

We propose design guidelines for a probabilistic programming facility su...

Extending Stan for Deep Probabilistic Programming

Deep probabilistic programming combines deep neural networks (for automa...

1. Introduction

A deep probabilistic programming language (PPL) is a language for specifying both deep neural networks and probabilistic models. In other words, a deep PPL draws upon programming languages, Bayesian statistics, and deep learning to ease the development of powerful machine-learning applications.

For decades, scientists have developed probabilistic models in various fields of exploration without the benefit of either dedicated programming languages or deep neural networks (Ghahramani, 2015)

. But since these models involve Bayesian inference with often intractable integrals, they sap the productivity of experts and are beyond the reach of non-experts. PPLs address this issue by letting users express a probabilistic model as a program 

(Gordon et al., 2014)

. The program specifies how to generate output data by sampling latent probability distributions. The compiler checks this program for type errors and translates it to a form suitable for an inference procedure, which uses observed output data to fit the latent distributions. Probabilistic models show great promise: they overtly represent uncertainty 

(Bornholt et al., 2014) and they have been demonstrated to enable explainable machine learning even in the important but difficult case of small training data (Lake et al., 2015; Rezende et al., 2016; Siddharth et al., 2017).

Over the last few years, machine learning with deep neural networks (deep learning, DL) has become enormously popular. This is because in several domains, DL solves what was previously a vexing problem (Domingos, 2012), namely manual feature engineering. Each layer of a neural network can be viewed as learning increasingly higher-level features. In other words, the essence of DL is automatic hierarchical representation learning (LeCun et al., 2015). Hence, DL powered recent breakthrough results in accurate supervised large-data tasks such as image recognition (Krizhevsky et al., 2012) and natural language translation (Wu et al., 2016)

. Today, most DL is based on frameworks that are well-supported, efficient, and expressive, such as TensorFlow 

(Abadi et al., 2016)

and PyTorch 

(Facebook, 2016)

. These frameworks provide automatic differentiation (users need not manually calculate gradients for gradient descent), GPU support (to efficiently execute vectorized computations), and Python-based embedded domain-specific languages 

(Hudak, 1998).

Deep PPLs, which have emerged just recently (Uber, 2017; Siddharth et al., 2017; Tran et al., 2017; Salvatier et al., 2016)

, aim to combine the benefits of PPLs and DL. Ideally, programs in deep PPLs would overtly represent uncertainty, yield explainable models, and require only a small amount of training data; be easy to write in a well-designed programming language; and match the breakthrough accuracy and fast training times of DL. Realizing all of these promises would yield tremendous advantages. Unfortunately, this is hard to achieve. Some of the strengths of PPLs and DL are seemingly at odds, such as explainability vs. automated feature engineering, or learning from small data vs. optimizing for large data. Furthermore, the barrier to entry for work in deep PPLs is high, since it requires non-trivial background in fields as diverse as statistics, programming languages, and deep learning. To tackle this problem, this paper characterizes deep PPLs, thus lowering the barrier to entry, providing a programming-languages perspective early when it can make a difference, and shining a light on gaps that the community should try to address.

This paper uses the Stan PPL as a representative of the state of the art in regular (not deep) PPLs (Carpenter et al., 2017). Stan is a main-stream, mature, and widely-used PPL: it is maintained by a large group of developers, has a yearly StanCon conference, and has an active forum. Stan is Turing complete and has its own stand-alone syntax and semantics, but provides bindings for several languages including Python.

Most importantly, this paper uses Edward (Tran et al., 2017) and Pyro (Uber, 2017) as representatives of the state of the art in deep PPLs. Edward is based on TensorFlow and Pyro is based on PyTorch. Edward was first released in mid-2016 and has a single main maintainer, who is focusing on a new version. Pyro is a much newer framework (released late 2017), but seems to be very responsive to community questions.

This paper characterizes deep PPLs by explaining them (Sections 2, 3, and 4), comparing them to each other and to regular PPLs and DL frameworks (Section 5), and envisioning next steps (Section 6). Additionally, the paper serves as a comparative tutorial to both Edward and Pyro. To this end, it presents examples of increasing complexity written in both languages, using deliberately uniform terminology and presentation style. By writing this paper, we hope to help the research community contribute to the exciting new field of deep PPLs, and ultimately, combine the strengths of both DL and PPLs.

2. Probabilistic Model Example

This section explains PPLs using an example that is probabilistic but not deep. The example, adapted from Section 9.1 of (Barber, 2012), is about learning the bias of a coin. We picked this example because it is simple, lets us introduce basic concepts, and shows how different PPLs represent these concepts.

We write if the result of the th coin toss is head and if it is tail. We assume that individual coin tosses are independent and identically distributed (IID) and that each toss follows a Bernoulli distribution with parameter : and . The latent (i.e., unobserved) variable is the bias of the coin. The task is to infer given the results of previously observed coin tosses, that is, . Figure 1 shows the corresponding graphical model. The model is generative: once the distribution of the latent variable has been inferred, one can draw samples to generate data points similar to the observed data.

Figure 1.

Graphical model for biased coin tosses. Circles represent random variables. The white circle for 

indicates that it is latent and the gray circle for  indicates that it is observed. The arrow represents dependency. The rounded rectangle is a plate, representing  distributions that are IID.

We now present this simple example in Stan, Edward, and Pyro. In all these languages we follow a Bayesian approach: the programmer first defines a probabilistic model of the problem. Assumptions are encoded with prior distributions over the variables of the model. Then the programmer launches an inference procedure to automatically compute the posterior distributions of the parameters of the model based on observed data. In other words, inference adjusts the prior distribution using the observed data to give a more precise model. Compared to other machine-learning models such as deep neural networks, the result of a probabilistic program is a probability distribution, which allows the programmer to explicitly visualize and manipulate the uncertainty associated with a result. This overt uncertainty is an advantage of PPLs. Furthermore, a probabilistic model has the advantage that it directly describes the corresponding world based on the programmer’s knowledge. Such descriptive models are more explainable than deep neural networks, whose representation is big and does not overtly resemble the world they model.

1# Model 2coin_code = ””” 3   data { 4     int<lower=0,upper=1> x[10]; 5   } 6   parameters { 7     real<lower=0,upper=1> theta; 8   } 9   model { 10     theta ~ uniform(0,1); 11     for (i in 1:10) 12        x[i] ~ bernoulli(theta); 13   }””” 14# Data 15data = {’x’: [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]} 16# Inference 17fit = pystan.stan(model_code=coin_code, 18                            data=data, iter=1000) 19# Results 20samples = fit.extract()[’theta’] 21print(”Posterior mean:”, np.mean(samples)) 22print(”Posterior stddev:”, np.std(samples)) 1# Model 2theta = Uniform(0.0, 1.0) 3x = Bernoulli(probs=theta, sample_shape=10) 4# Data 5data = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1]) 6# Inference 7qtheta = Empirical( 8                   tf.Variable(tf.ones(1000) * 0.5)) 9inference = ed.HMC({theta: qtheta}, 10                                   data={x: data}) 12# Results 13mean, stddev = ed.get_session().run( 14                    [qtheta.mean(),qtheta.stddev()]) 15print(”Posterior mean:”, mean) 16print(”Posterior stddev:”, stddev) 1# Model 2def coin(): 3   theta = pyro.sample(”theta”, Uniform( 4                                      Variable(torch.Tensor([0])), 5                                      Variable(torch.Tensor([1]))) 6   pyro.sample(”x”, Bernoulli( 7                                  theta * Variable(torch.ones(10))) 8# Data 9data = {”x”: Variable(torch.Tensor( 10                                       [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]))} 11# Inference 12cond = pyro.condition(coin, data=data) 13sampler = pyro.infer.Importance(cond, 14                                                      num_samples=1000) 15post = pyro.infer.Marginal(sampler, sites=[”theta”]) 16# Result 17samples = [post()[”theta”].data[0] for _ in range(1000)] 18print(”Posterior mean:”, np.mean(samples)) 19print(”Posterior stddev:”, np.std(samples))
(a) Stan (b) Edward (c) Pyro
Figure 2. Probabilistic model for learning the bias of a coin.

Figure 2(a) solves the biased-coin task in Stan, a well-established PPL (Carpenter et al., 2017). This example uses PyStan, the Python interface for Stan. Lines 2-13 contain code in Stan syntax in a multi-line Python string. The data block introduces observed random variables, which are placeholders for concrete data to be provided as input to the inference procedure, whereas the parameters block introduces latent random variables, which will not be provided and must be inferred. Line 4 declares as a vector of ten discrete random variables, constrained to only take on values from a finite set, in this case, either 0 or 1. Line 7 declares as a continuous random variable, which can take on values from an infinite set, in this case, real numbers between 0 and 1. Stan uses the tilde operator (~) for sampling. Line 10 samples from a uniform distribution (same probability for all values) between 0 and 1. Since is a latent variable, this distribution is merely a prior belief, which inference will replace by a posterior distribution. Line 12 samples the

from a Bernoulli distribution with parameter 

. Since the are observed variables, this sampling is really a check for how well the model corresponds to data provided at inference time. One can think of sampling an observed variable like an assertion in verification (Gordon et al., 2014). Line 15 specifies the data and Lines 17-18 run the inference using the model and the data. By default, Stan uses a form of Monte-Carlo sampling for inference (Carpenter et al., 2017)

. Lines 20-22 extract and print the mean and standard deviation of the posterior distribution of 


Figure 2(b) solves the same task in Edward (Tran et al., 2017). Line 2 samples  from the prior distribution, and Line 3 samples a vector of random variables from a Bernoulli distribution of parameter , one for each coin toss. Line 5 specifies the data. Lines 7-8 define a placeholder that will be used by the inference procedure to compute the posterior distribution of . The shape and size of the placeholder depends on the inference procedure. Here we use the Hamiltonian Monte-Carlo inference HMC, the posterior distribution is thus computed based on a set of random samples and follows an empirical distribution. The size of the placeholder corresponds to the number of random samples computed during inference. Lines 9-11 launch the inference. The inference takes as parameter the prior:posterior pair theta:qtheta and links the data to the variable . Lines 13-16 extract and print the mean and standard deviation of the posterior distribution of .

Figure 2(c) solves the same task in Pyro (Uber, 2017). Lines 2-7 define the model as a function coin. Lines 3-5 sample  from the prior distribution, and Lines 6-7 sample a vector of random variable  from a Bernoulli distribution of parameter . Pyro stores random variables in a dictionary keyed by the first argument of the function pyro.sample. Lines 9-10 define the data as a dictionary. Line 12 conditions the model using the data by matching the value of the data dictionary with the random variables defined in the model. Lines 13-15 apply inference to this conditioned model, using importance sampling. Compared to Stan and Edward, we first define a conditioned model with the observed data before running the inference instead of passing the data as an argument to the inference. The inference returns a probabilistic model, post, that can be sampled to extract the mean and standard deviation of the posterior distribution of  in Lines 17-19.

The deep PPLs Edward and Pyro are built on top of two popular deep learning frameworks TensorFlow (Abadi et al., 2016) and PyTorch (Facebook, 2016). They benefit from efficient computations over large datasets, automatic differentiation, and optimizers provided by these frameworks that can be used to efficiently implement inference procedures. As we will see in the next sections, this design choice also reduces the gap between DL and probabilistic models, allowing the programmer to combine the two. On the other hand, this choice leads to piling up abstractions (Edward/TensorFlow/Numpy/Python or Pyro/PyTorch/Numpy/Python) that can complicate the code. We defer a discussion of these towers of abstraction to Section 5.

1# Inference Guide 2qalpha = tf.Variable(1.0) 3qbeta = tf.Variable(1.0) 4qtheta = Beta(qalpha, qbeta) 5# Inference 6inference = ed.KLqp({theta: qtheta}, {x: data}) 1# Inference Guide 2def guide(): 3    qalpha = pyro.param(”qalpha”, Variable(torch.Tensor([1.0]), requires_grad=True)) 4    qbeta = pyro.param(”qbeta”, Variable(torch.Tensor([1.0]), requires_grad=True)) 5    pyro.sample(”theta”, Beta(qalpha, qbeta)) 6# Inference 7svi = SVI(cond, guide, Adam({}), loss=”ELBO”, num_particles=7) 8for step in range(1000): 9    svi.step()
(a) Edward (b) Pyro
Figure 3. Variational Inference for learning the bias of a coin.

Variational Inference

Inference, for Bayesian models, computes the posterior distribution of the latent parameters given a set of observations , that is, . For complex models, computing the exact posterior distribution can be costly or even intractable. Variational inference turns the inference problem into an optimization problem and tends to be faster and more adapted to large datasets than sampling-based Monte-Carlo methods (Blei et al., 2017).

Variational infence tries to find the member  of a family  of simpler distribution over the latent variables that is the closest to the true posterior . The fitness of a candidate is measured using the Kullback-Leibler (KL) divergence from the true posterior, a similarity measure between probability distributions.

It is up to the programmer to choose a family of candidates, or guides, that is sufficiently expressive to capture a close approximation of the true posterior, but simple enough to make the optimization problem tractable.

Both Edward and Pyro support variational inference. Figure 3 shows how to adapt Figure 2 to use it. In Edward (Figure 3

(a)), the programmer defines the family of guides by changing the shape of the placeholder used in the inference. Lines 2-4 use a beta distribution with unknown parameters 


that will be computed during inference. Lines 6-7 do variational inference using the Kullback-Leibler divergence. In Pyro (Figure 

3(b)), this is done by defining a guide function. Lines 2-5 also define a beta distribution with parameters  and . Lines 7-9 do inference using Stochastic Variational Inference, an optimized algorithm for variational inference. Both Edward and Pyro rely on the underlying framework to solve the optimization problem. Probabilistic inference thus closely follows the scheme used for training procedures of DL models.

This section gave a high-level introduction to PPLs and introduced basic concepts (generative models, sampling, prior and posterior, latent and observed, discrete and continuous). Next, we turn our attention to deep learning.

3. Probabilistic Models in DL

Figure 4.

Multi-layer perceptron (MLP) for classifying images. Circles and squares are probabilistic and non-probabilistic variables. Black rectangles are pure functions. Arrows represent dependencies and forward data flow.

1batch_size, nx, nh, ny = 128, 28 * 28, 1024, 10 2# Model 3x = tf.placeholder(tf.float32, [batch_size, nx]) 4l = tf.placeholder(tf.int32, [batch_size]) 5def mlp(theta, x): 6    h = tf.nn.relu(tf.matmul(x, theta[”Wh”]) + theta[”bh”]) 7    yhat = tf.matmul(h, theta[”Wy”]) + theta[”by”] 8    log_pi = tf.nn.log_softmax(yhat) 9    return log_pi 10theta = { 11    ’Wh’: Normal(loc=tf.zeros([nx, nh]), scale=tf.ones([nx, nh])), 12    ’bh’: Normal(loc=tf.zeros(nh), scale=tf.ones(nh)), 13    ’Wy’: Normal(loc=tf.zeros([nh, ny]), scale=tf.ones([nh, ny])), 14    ’by’: Normal(loc=tf.zeros(ny), scale=tf.ones(ny)) } 15lhat = Categorical(logits=mlp(theta, x)) 16# Inference Guide 17def vr(*shape): 18    return tf.Variable(tf.random_normal(shape)) 19qtheta = { 20    ’Wh’: Normal(loc=vr(nx, nh), scale=tf.nn.softplus(vr(nx, nh))), 21    ’bh’: Normal(loc=vr(nh), scale=tf.nn.softplus(vr(nh))), 22    ’Wy’: Normal(loc=vr(nh, ny), scale=tf.nn.softplus(vr(nh, ny))), 23    ’by’: Normal(loc=vr(ny), scale=tf.nn.softplus(vr(ny))) } 24# Inference 25inference = ed.KLqp({ theta[”Wh”]: qtheta[”Wh”], 26                                    theta[”bh”]: qtheta[”bh”], 27                                    theta[”Wy”]: qtheta[”Wy”], 28                                    theta[”by”]: qtheta[”by”] }, 29                                  data={lhat: l}) 1# Model 2class MLP(nn.Module): 3    def __init__(self): 4        super(MLP, self).__init__() 5        self.lh = torch.nn.Linear(nx, nh) 6 = torch.nn.Linear(nh, ny) 7    def forward(self, x): 8        h = F.relu(self.lh(x.view((-1, nx)))) 9        log_pi = F.log_softmax(, dim=-1) 10        return log_pi 11mlp = MLP() 12def v0s(*shape): return Variable(torch.zeros(*shape)) 13def v1s(*shape): return Variable(torch.ones(*shape)) 14def model(x, l): 15    theta = { ’lh.weight’: Normal(v0s(nh, nx), v1s(nh, nx)), 16                   ’lh.bias’: Normal(v0s(nh), v1s(nh)), 17                   ’ly.weight’: Normal(v0s(ny, nh), v1s(ny, nh)), 18                   ’ly.bias’: Normal(v0s(ny), v1s(ny)) } 19    lifted_mlp = pyro.random_module(”mlp”, mlp, theta)() 20    pyro.observe(”obs”, Categorical(logits=lifted_mlp(x)), one_hot(l)) 21# Inference Guide 22def vr(name, *shape): 23    return pyro.param(name, 24               Variable(torch.randn(*shape), requires_grad=True)) 25def guide(x, l): 26    qtheta = { 27       ’lh.weight’: Normal(vr(”Wh_m”, nh, nx), F.softplus(vr(”Wh_s”, nh, nx))), 28       ’lh.bias’: Normal(vr(”bh_m”, nh), F.softplus(vr(”bh_s”, nh))), 29       ’ly.weight’: Normal(vr(”Wy_m”, ny, nh), F.softplus(vr(”Wy_s”, ny, nh))), 30       ’ly.bias’: Normal(vr(”by_m”, ny), F.softplus(vr(”by_s”, ny))) } 31    return pyro.random_module(”mlp”, mlp, qtheta)() 32# Inference 33inference = SVI(model, guide, Adam({”lr”: 0.01}), loss=”ELBO”)
(a) Edward (b) Pyro
Figure 5.

Probabilistic multilayer perceptron for classifying images.

This section explains DL using an example of a deep neural network and shows how to make that probabilistic. The task is multiclass classification: given input features , e.g., an image of a handwritten digit (LeCun et al., 1998) comprising pixels, predict a label , e.g., one of digits. Before we explain how to solve this task using DL, let us clarify some terminology. In cases where DL uses different terminology from PPLs, this paper favors the PPL terminology. So we say inference for what DL calls training; predicting for what DL calls inferencing; and observed variables for what DL calls training data. For instance, the observed variables for inference in the classifier task are the image  and label .

Among the many neural network architectures suitable for this task, we chose a simple one: a multi-layer perceptron (MLP (Rumelhart et al., 1986)). We start with the non-probabilistic version. Figure 4(a) shows an MLP with a 2-feature input layer, a 3-feature hidden layer, and a 2-feature output layer; of course, this generalizes to wider (more features) and deeper (more layers) MLPs. From left to right, there is a fully-connected subnet where each input feature  contributes to each hidden feature , multiplied with a weight and offset with a bias . The weights and biases are latent variables. Treating the input, biases, and weights as vectors and a matrix lets us cast this subnet in linear algebra

, which can run efficiently via vector instructions on CPUs or GPUs. Next, a rectified linear unit

computes the hidden feature vector . The ReLU lets the MLP discriminate input spaces that are not linearly separable. The hidden layer exhibits both the advantage and disadvantage of deep learning: automatically learned features that need not be hand-engineered but would require non-trivial reverse engineering to explain in real-world terms. Next, another fully-connected subnet computes the output layer . Then, the softmax computes a vector  of scores that add up to one. The higher the value of , the more strongly the classifier predicts label . Using the output of the MLP, the argmax extracts the label  with highest score.

Traditional methods to train such a neural network incrementally update the latent parameters of the network to optimize a loss function via gradient descent 

(Rumelhart et al., 1986). In the case of hand-written digits, the loss function is a distance metrics between observed and predicted labels. The variables and computations are non-probabilistic, in that they use concrete values rather than probability distributions.

Deep PPLs can express Bayesian neural networks with probabilistic weights and biases (Blundell et al., 2015). One way to visualize this is by replacing rectangles with circles for latent variables in Figure 4(a) to indicate that they hold probability distributions. Figure 4(b) shows the corresponding graphical model, where the latent variable  denotes all the parameters of the MLP: , , , .

Bayesian inference starts from prior beliefs about the parameters and learns distributions that fit observed data (such as, images and labels). We can then sample concrete weights and biases to obtain a concrete MLP. In fact, we do not need to stop at a single MLP: we can sample an ensemble of as many MLPs as we like. Then, we can feed a concrete image to all the sampled MLPs to get their predictions, followed by a vote.

Figure 5(a) shows the probabilistic MLP example in Edward. Lines 3-4 are placeholders for observed variables (i.e., batches of images  and labels ). Lines 5-9 defines the MLP parameterized by , a dictionary containing all the network parameters. Lines 10-14 sample the parameters from the prior distributions. Line 15 defines the output of the network: a categorical distribution over all possible label values parameterized by the output of the MLP. Line 17-23 define the guides for the latent variables, initialized with random numbers. Later, the variational inference will update these during optimization, so they will ultimately hold an approximation of the posterior distribution after inference. Lines 25-29 set up the inference with one prior:posterior pair for each parameter of the network and link the output of the network to the observed data.

Figure 5(b) shows the same example in Pyro. Lines 2-11 contain the basic neural network, where torch.nn.Linear wraps the low-level linear algebra. Lines 3-6 declare the structure of the net, that is, the type and dimension of each layer. Lines 7-10 combine the layers to define the network. It is possible to use equivalent high-level TensorFlow APIs for this in Edward as well, but we refrained from doing so to illustrate the transition of the parameters to random variables. Lines 14-20 define the model. Lines 15-18 sample priors for the parameters, associating them with object properties created by torch.nn.Linear (i.e., the weight and bias of each layer). Line 19 lifts the MLP definition from concrete to probabilistic. We thus obtain a MLP where all parameters are treated as random variables. Line 20 conditions the model using a categorical distribution over all possible label values. Lines 26-31 define the guide for the latent variables, initialized with random numbers, just like in the Edward version. Line 33 sets up the inference.

1def predict(x): 2  theta_samples = [ { Wh”: qtheta[”Wh”].sample(), bh”: qtheta[”bh”].sample(), 3                                   Wy”: qtheta[”Wy”].sample(), by”: qtheta[”by”].sample() } 4                                   for _ in range(args.num_samples) ] 5  yhats = [ mlp(theta_samp, x).eval() 6                  for theta_samp in theta_samples ] 7  mean = np.mean(yhats, axis=0) 8  return np.argmax(mean, axis=1) 1def predict(x): 2  sampled_models = [ guide(None) 3                                    for _ in range (args.num_samples) ] 4  yhats = [ model(Variable(x)).data 5                  for model in sampled_models ] 6  mean = torch.mean(torch.stack(yhats), 0) 7  return np.argmax(mean, axis=1)
(a) Edward (b) Pyro
Figure 6. Predictions by the probabilistic multilayer perceptron.

After the inference, Figure 6 shows how to use the posterior distribution of the MLP parameters to classify unknown data. In Edward (Figure 6(a)), Lines 2-4 draw several samples of the parameters from the posterior distribution. Then, Lines 5-6 execute the MLP with each concrete model. Line 7 computes the score of a label as the average of the scores returned by the MLPs. Finally, Line 8 predicts the label with the highest score. In Pyro (Figure 6(b)), the prediction is done similarly but we obtain multiple versions of the MLP by sampling the guide (Line 2-3), not the parameters.

This section showed how to use probabilistic variables as building blocks for a DL model. Compared to non-probabilistic DL, this approach has the advantage of reduced overfitting and accurately quantified uncertainty (Blundell et al., 2015). On the other hand, this approach requires inference techniques, like variational inference, that are more advanced than classic back-propagation. The next section will present the dual approach, showing how to use neural networks as building blocks for a probabilistic model.

4. DL in Probabilistic Models

(a) Graphical models. 1batch_size, nx, nh, nz = 256, 28 * 28, 1024, 4 2def vr(*shape): return tf.Variable(0.01 * tf.random_normal(shape)) 3# Model 4X = tf.placeholder(tf.int32, [batch_size, nx]) 5def decoder(theta, z): 6    hidden = tf.nn.relu(tf.matmul(z, theta[’Wh’]) + theta[’bh’]) 7    mu = tf.matmul(hidden, theta[’Wy’]) + theta[’by’] 8    return mu 9theta = { ’Wh’: vr(nz, nh), 10               ’bh’: vr(nh), 11               ’Wy’: vr(nh, nx), 12               ’by’: vr(nx) } 13z = Normal(loc=tf.zeros([batch_size, nz]), scale=tf.ones([batch_size, nz])) 14logits = decoder(theta, z) 15x = Bernoulli(logits=logits) 16# Inference Guide 17def encoder(phi, x): 18    x = tf.cast(x, tf.float32) 19    hidden = tf.nn.relu(tf.matmul(x, phi[’Wh’]) + phi[’bh’]) 20    z_mu = tf.matmul(hidden, phi[’Wy_mu’]) + phi[’by_mu’] 21    z_sigma = tf.nn.softplus( 22        tf.matmul(hidden, phi[’Wy_sigma’]) + phi[’by_sigma’]) 23    return z_mu, z_sigma 24phi = { ’Wh’: vr(nx, nh), 25             ’bh’: vr(nh), 26             ’Wy_mu’: vr(nh, nz), 27             ’by_mu’: vr(nz), 28             ’Wy_sigma’: vr(nh, nz), 29             ’by_sigma’: vr(nz) } 30loc, scale = encoder(phi, X) 31qz = Normal(loc=loc, scale=scale) 32# Inference 33inference = ed.KLqp({z: qz}, data={x: X}) 1# Model 2class Decoder(nn.Module): 3    def __init__(self): 4        super(Decoder, self).__init__() 5        self.lh = nn.Linear(nz, nh) 6        self.lx = nn.Linear(nh, nx) 7        self.relu = nn.ReLU() 8    def forward(self, z): 9        hidden = self.relu(self.lh(z)) 10        mu = self.lx(hidden) 11        return mu 12decoder = Decoder() 13def model(x): 14    z_mu = Variable(torch.zeros(x.size(0), nz)) 15    z_sigma = Variable(torch.ones(x.size(0), nz)) 16    z = pyro.sample(”z”, dist.Normal(z_mu, z_sigma)) 17    pyro.module(”decoder”, decoder) 18    mu = decoder.forward(z) 19    pyro.sample(”xhat”, dist.Bernoulli(mu), obs=x.view(-1, nx)) 20# Inference Guide 21class Encoder(nn.Module): 22    def __init__(self): 23        super(Encoder, self).__init__() 24        self.lh = torch.nn.Linear(nx, nh) 25        self.lz_mu = torch.nn.Linear(nh, nz) 26        self.lz_sigma = torch.nn.Linear(nh, nz) 27        self.softplus = nn.Softplus() 28    def forward(self, x): 29        hidden = F.relu(self.lh(x.view((-1, nx)))) 30        z_mu = self.lz_mu(hidden) 31        z_sigma = self.softplus(self.lz_sigma(hidden)) 32        return z_mu, z_sigma 33encoder = Encoder() 34def guide(x): 35    pyro.module(”encoder”, encoder) 36    z_mu, z_sigma = encoder.forward(x) 37    pyro.sample(”z”, dist.Normal(z_mu, z_sigma)) 38# Inference 39inference = SVI(model, guide, Adam({”lr”: 0.01}), loss=”ELBO”)
(b) Edward (c) Pyro
Figure 7.

Variational autoencoder for encoding and decoding images.

This section explains how deep PPLs can use non-probabilistic deep neural networks as components in probabilistic models. The example task is learning a vector-space representation. Such a representation reduces the number of input dimensions to make other machine-learning tasks more manageable by counter-acting the curse of dimensionality 

(Domingos, 2012). The observed random variable is , for instance, an image of a hand-written digit with pixels. The latent random variable is , the vector-space representation, for instance, with features. Learning a vector-space representation is an unsupervised problem, requiring only images but no labels. While not too useful on its own, such a representation is an essential building block. For instance, it can help in other image generation tasks, e.g., to generate an image for a given writing style (Siddharth et al., 2017). Furthermore, it can help learning with small data, e.g., via a K-nearest neighbors approach in vector space (Babkin et al., 2017).

Each image  depends on the latent representation  in a complex non-linear way (i.e., via a deep neural network). The task is to learn this dependency between  and . The top half of Figure 7(a) shows the corresponding graphical model. The output of the neural network, named decoder, is a vector  that parameterizes a Bernoulli distribution over each pixel in the image . Each pixel is thus associated to a probability of being present in the image. Similarly to Figure 4(b) the parameter  of the decoder is global (i.e., shared by all data points) and is thus drawn outside the plate. Compared to Section 3 the network here is not probabilistic, hence the square around .

The main idea of the VAE (Kingma and Welling, 2013; Rezende et al., 2014) is to use variational inference to learn the latent representation. As for the examples presented in the previous sections, we need to define a guide for the inference. The guide maps each  to a latent variable  via another neural network. The bottom half of Figure 7(a) shows the graphical model of the guide. The network, named encoder, returns, for each image , the parameters  and 

of a Gaussian distribution in the latent space. Again the parameter 

of the network is global and not probabilistic. Then inference tries to learn good values for parameter  and , simultaneously training the decoder and the encoder, according to the data and the prior beliefs on the latent variables (e.g., Gaussian distribution).

After the inference, we can generate a latent representation of an image with the encoder and reconstruct the image with the decoder. The similarity of the two images gives an indication of the success of the inference. The model and the guide together can thus be seen as an autoencoder, hence the term variational autoencoder.

Figure 7(b) shows the VAE examples in Edward. Lines 4-12 define the decoder: a simple 2-layers neural network similar to the one presented in Section 3. The parameter is initialized with random noise. Line 13 samples the priors for the latent variable  from a Gaussian distribution. Lines 14-15 define the dependency between  and , as a Bernoulli distribution parameterized by the output of the decoder. Lines 17-29 define the encoder: a neural network with one hidden layer and two distinct output layers for  and . The parameter is also initialized with random noise. Lines 30-31 define the inference guide for the latent variable, that is, a Gaussian distribution parameterized by the outputs of the encoder. Line 33 sets up the inference matching the prior:posterior pair for the latent variable and linking the data with the output of the decoder.

Figure 7(c) shows the VAE example in Pyro. Lines 2-12 define the decoder. Lines 13-19 define the model. Lines 14-16 sample the priors for the latent variable . Lines 18-19 define the dependency between  and  via the decoder. In contrast to Figure 5(b), the decoder is not probabilistic, so there is no need for lifting the network. Lines 34-37 define the guide as in Edward linking  and  via the decoder defined Lines 21-33. Line 39 sets up the inference.

This example illustrates that we can embed non-probabilistic DL models inside probabilistic programs and learn the parameters of the DL models during inference. Sections 2, 3, and 4 were about explaining deep PPLs with examples. The next section is about comparing deep PPLs with each other and with their potential.

5. Characterization

This section attempts to answer the following research question: At this point in time, how well do deep PPLs live up to their potential? Deep PPLs combine probabilistic models, deep learning, and programming languages in an effort to combine their advantages. This section explores those advantages grouped by pedigree and uses them to characterize Edward and Pyro.

Before we dive in, some disclaimers are in order. First, both Edward and Pyro are young, not mature projects with years of improvements based on user experiences, and they enable new applications that are still under active research. We should keep this in mind when we criticize them. On the positive side, early criticism can be a positive influence. Second, since getting even a limited number of example programs to support direct side-by-side comparisons was non-trivial, we kept our study qualitative. A respectable quantitative study would require more programs and data sets. On the positive side, all of the code shown in this paper actually runs. Third, doing an outside-in study risks missing subtleties that the designers of Edward and Pyro may be more expert in. On the positive side, the outside-in view resembles what new users experience.

5.1. Advantages from Probabilistic Models

Probabilistic models support overt uncertainty: they give not just a prediction but also a meaningful probability. This is useful to avoid uncertainty bugs (Bornholt et al., 2014)

, track compounding effects of uncertainty, and even make better exploration decisions in reinforcement learning 

(Blundell et al., 2015). Both Edward and Pyro support overt uncertainty well, see e.g. the lines under the comment “# Results” in Figure 2.

Probabilistic models give users a choice of inference procedures: the user has the flexibility to pick and configure different approaches. Deep PPLs support two primary families of inference procedures: those based on Monte-Carlo sampling and those based on variational inference. Edward supports both and furthermore flexible compositions, where different inference procedures are applied to different parts of the model. Pyro supports primarily variational inference and focuses less on Monte-Carlo sampling. In comparison, Stan makes a form of Monte-Carlo sampling the default, focusing on making it easy-to-tune in practice (Carpenter, 2017).

Probabilistic models can help with small data: even when inference uses only small amount of labeled data, there have been high-profile cases where probabilistic models still make accurate predictions (Lake et al., 2015). Working with small data is useful to avoid the cost of hand-labeling, to improve privacy, to build personalized models, and to do well on underrepresented corners of a big-data task. The intuition for how probabilistic models help is that they can make up for lacking labeled data for a task by domain knowledge incorporated in the model, by unlabeled data, or by labeled data for other tasks. There are some promising initial successes of using deep probabilistic programming on small data (Rezende et al., 2016; Siddharth et al., 2017); at the same time, this remains an active area of research.

Probabilistic models can support explainability: when the components of a probabilistic model correspond directly to concepts of a real-world domain being modeled, predictions can include an explanation in terms of those concepts. Explainability is useful when predictions are consulted for high-stakes decisions, as well as for transparency around bias (Calmon et al., 2017). Unfortunately, the parameters of a deep neural network are just as opaque with as without probabilistic programming. There is cause for hope though. For instance, Siddharth et al. advocate disentangled representations that help explainability (Siddharth et al., 2017). Overall, the jury is still out on the extent to which deep PPLs can leverage this advantage from PPLs.

5.2. Advantages from Deep Learning

Deep learning is automatic hierarchical representation learning (LeCun et al., 2015): each unit in a deep neural network can be viewed as an automatically learned feature. Learning features automatically is useful to avoid the cost of engineering features by hand. Fortunately, this DL advantage remains true in the context of a deep PPL. In fact, a deep PPL makes the trade-off between automated and hand-crafted features more flexible than most other machine-learning approaches.

Deep learning can accomplish high accuracy

: for various tasks, there have been high-profile cases where deep neural networks beat out earlier approaches with years of research behind them. Arguably, the victory of DL at the ImageNet competition in 2012 ushered in the latest craze around DL 

(Krizhevsky et al., 2012). Record-breaking accuracy is useful not just for publicity but also to cross thresholds where practical deployments become desirable, for instance, in machine translation (Wu et al., 2016). Since a deep PPL can use deep neural networks, in principle, it inherits this advantage from DL (Tran et al., 2017). However, even non-probabilistic DL requires tuning, and in our experience with the example programs in this paper, the tuning burden is exacerbated with variational inference.

Deep learning supports fast inference: even for large models and a large data set, the wall-clock time for a batch job to infer posteriors is short. The fast-inference advantage is the result of the back-propagation algorithm (Rumelhart et al., 1986), novel techniques for parallelization (Niu et al., 2011) and data representation (Gupta et al., 2015)

, and massive investments in the efficiency of DL frameworks such as TensorFlow and PyTorch, with vectorization on CPU and GPU. Fast inference is useful for iterating more quickly on ideas, trying more hyperparameter during, and wasting fewer resources. Tran et al. measured the efficiency of the Edward deep PPL, demonstrating that it does benefit from the efficiency advantage of the underlying DL framework 

(Tran et al., 2017).

5.3. Advantages from Programming Languages

Programming language design is essential for composability: bigger models can be composed from smaller components. Composability is useful for testing, team-work, and reuse. Conceptually, both graphical probabilistic models and deep neural networks compose well. On the other hand, some PPLs impose structure in a way that reduces composability; fortunately, this can be mitigated (Gorinova et al., 2018). Both Edward and Pyro are embedded in Python, and, as our example programs demonstrate, work with Python functions and classes. For instance, users are not limited to declaring all latent variables in one place; instead, they can compose models, such as MLPs, with separately declared latent variables. Edward and Pyro also work with higher-level DL framework modules such as tf.layers.dense or torch.nn.Linear, and Pyro even supports automatically lifting those to make them probabilistic. Edward and Pyro also do not prevent users from composing probabilistic models with non-probabilistic code, but doing so requires care. For instance, when Monte-Carlo sampling runs the same code multiple times, it is up to the programmer to watch out for unwanted side-effects. One area where more work is needed is the extensibility of Edward or Pyro itself (Carpenter, 2017). Finally, in addition to composing models, Edward also emphasizes composing inference procedures.

Not all PPLs have the same expressiveness: some are Turing complete, others not (Gordon et al., 2014). For instance, BUGS is not Turing complete, but has nevertheless been highly successful (Gilks et al., 1994). The field of deep probabilistic programming is too young to judge which levels of expressiveness are how useful. Edward and Pyro are both Turing complete. However, Edward makes it harder to express while-loops and conditionals than Pyro. Since Edward builds on TensorFlow, the user must use special APIs to incorporate dynamic control constructs into the computational graph. In contrast, since Pyro builds on PyTorch, it can use native Python control constructs, one of the advantages of dynamic DL frameworks (Neubig et al., 2017).

Programming language design affects conciseness: it determines whether a model can be expressed in few lines of code. Conciseness is useful to make models easier to write and, when used in good taste, easier to read. In our code examples, Edward is more concise than Pyro. Pyro arguably trades conciseness for structure, making heavier use of Python classes and functions. Wrapping the model and guide into functions allows compiling them into co-routines, an ingredient for implementing efficient inference procedures (Goodman and Stuhlmüller, 2014). In both Edward and Pyro, conciseness is hampered by the Bayesian requirement for explicit priors and by the variational-inference requirement for explicit guides.

Programming languages can offer watertight abstractions: they can abstract away lower-level concepts and prevent them from leaking out, for instance, using types and constraints (Carpenter, 2017). Consider the expression from Section 3. At face value, this looks like eager arithmetic on concrete scalars, running just once in the forward direction. But actually, it may be lazy (building a computational graph) arithmetic on probability distributions (not concrete values) of tensors (not scalars), running several times (for different Monte-Carlo samples or data batches), possibly in the backward direction (for back-propagation of gradients). Abstractions are useful to reduce the cognitive burden on the programmer, but only if they are watertight. Unfortunately, abstractions in deep PPLs are leaky. Our code examples directly invoke features from several layers of the technology stack (Edward or Pyro, on TensorFlow or PyTorch, on NumPy, on Python). Furthermore, we found that error messages rarely refer to source code constructs. For instance, names of Python variables are erased from the computational graph, making it hard to debug tensor dimensions, a common cause for mistakes. It does not have to be that way. For instance, DL frameworks are good at hiding the abstraction of back-propagation. More work is required to make deep PPL abstractions more watertight.

6. Conclusion and Outlook

This paper is a study of two deep PPLs, Edward and Pyro. The study is qualitative and driven by code examples. This paper explains how to solve common tasks, contributing side-by-side comparisons of Edward and Pyro. The potential of deep PPLs is to combine the advantages of probabilistic models, deep learning, and programming languages. In addition to comparing Edward and Pyro to each other, this paper also compares them to that potential. A quantitative study is left to future work. Based on our experience, we confirm that Edward and Pyro combine three advantages out-of-the-box: the overt uncertainty of probabilistic models; the hierarchical representation learning of DL; and the composability of programming languages.

Following are possible next steps in deep PPL research.

  • [leftmargin=]

  • Choice of inference procedures: Especially Pyro should support Monte-Carlo methods at the same level as variational inference.

  • Small data: While possible in theory, this has yet to be demonstrated on Edward and Pyro, with interesting data sets.

  • High accuracy: Edward and Pyro need to be improved to simplify the tuning required to improve model accuracy.

  • Expressiveness: While Turing complete in theory, Edward should adopt recent dynamic TensorFlow features for usability.

  • Conciseness: Both Edward and Pyro would benefit from reducing the repetitive idioms of priors and guides.

  • Watertight abstractions: Both Edward and Pyro fall short on this goal, necessitating more careful language design.

  • Explainability: This is inherently hard with deep PPLs, necessitating more machine-learning innovation.

In summary, deep PPLs show great promises and remain an active field with many research opportunities.