1. Introduction
A deep probabilistic programming language (PPL) is a language for specifying both deep neural networks and probabilistic models. In other words, a deep PPL draws upon programming languages, Bayesian statistics, and deep learning to ease the development of powerful machinelearning applications.
For decades, scientists have developed probabilistic models in various fields of exploration without the benefit of either dedicated programming languages or deep neural networks (Ghahramani, 2015)
. But since these models involve Bayesian inference with often intractable integrals, they sap the productivity of experts and are beyond the reach of nonexperts. PPLs address this issue by letting users express a probabilistic model as a program
(Gordon et al., 2014). The program specifies how to generate output data by sampling latent probability distributions. The compiler checks this program for type errors and translates it to a form suitable for an inference procedure, which uses observed output data to fit the latent distributions. Probabilistic models show great promise: they overtly represent uncertainty
(Bornholt et al., 2014) and they have been demonstrated to enable explainable machine learning even in the important but difficult case of small training data (Lake et al., 2015; Rezende et al., 2016; Siddharth et al., 2017).Over the last few years, machine learning with deep neural networks (deep learning, DL) has become enormously popular. This is because in several domains, DL solves what was previously a vexing problem (Domingos, 2012), namely manual feature engineering. Each layer of a neural network can be viewed as learning increasingly higherlevel features. In other words, the essence of DL is automatic hierarchical representation learning (LeCun et al., 2015). Hence, DL powered recent breakthrough results in accurate supervised largedata tasks such as image recognition (Krizhevsky et al., 2012) and natural language translation (Wu et al., 2016)
. Today, most DL is based on frameworks that are wellsupported, efficient, and expressive, such as TensorFlow
(Abadi et al., 2016)and PyTorch
(Facebook, 2016). These frameworks provide automatic differentiation (users need not manually calculate gradients for gradient descent), GPU support (to efficiently execute vectorized computations), and Pythonbased embedded domainspecific languages
(Hudak, 1998).Deep PPLs, which have emerged just recently (Uber, 2017; Siddharth et al., 2017; Tran et al., 2017; Salvatier et al., 2016)
, aim to combine the benefits of PPLs and DL. Ideally, programs in deep PPLs would overtly represent uncertainty, yield explainable models, and require only a small amount of training data; be easy to write in a welldesigned programming language; and match the breakthrough accuracy and fast training times of DL. Realizing all of these promises would yield tremendous advantages. Unfortunately, this is hard to achieve. Some of the strengths of PPLs and DL are seemingly at odds, such as explainability vs. automated feature engineering, or learning from small data vs. optimizing for large data. Furthermore, the barrier to entry for work in deep PPLs is high, since it requires nontrivial background in fields as diverse as statistics, programming languages, and deep learning. To tackle this problem, this paper characterizes deep PPLs, thus lowering the barrier to entry, providing a programminglanguages perspective early when it can make a difference, and shining a light on gaps that the community should try to address.
This paper uses the Stan PPL as a representative of the state of the art in regular (not deep) PPLs (Carpenter et al., 2017). Stan is a mainstream, mature, and widelyused PPL: it is maintained by a large group of developers, has a yearly StanCon conference, and has an active forum. Stan is Turing complete and has its own standalone syntax and semantics, but provides bindings for several languages including Python.
Most importantly, this paper uses Edward (Tran et al., 2017) and Pyro (Uber, 2017) as representatives of the state of the art in deep PPLs. Edward is based on TensorFlow and Pyro is based on PyTorch. Edward was first released in mid2016 and has a single main maintainer, who is focusing on a new version. Pyro is a much newer framework (released late 2017), but seems to be very responsive to community questions.
This paper characterizes deep PPLs by explaining them (Sections 2, 3, and 4), comparing them to each other and to regular PPLs and DL frameworks (Section 5), and envisioning next steps (Section 6). Additionally, the paper serves as a comparative tutorial to both Edward and Pyro. To this end, it presents examples of increasing complexity written in both languages, using deliberately uniform terminology and presentation style. By writing this paper, we hope to help the research community contribute to the exciting new field of deep PPLs, and ultimately, combine the strengths of both DL and PPLs.
2. Probabilistic Model Example
This section explains PPLs using an example that is probabilistic but not deep. The example, adapted from Section 9.1 of (Barber, 2012), is about learning the bias of a coin. We picked this example because it is simple, lets us introduce basic concepts, and shows how different PPLs represent these concepts.
We write if the result of the ^{th} coin toss is head and if it is tail. We assume that individual coin tosses are independent and identically distributed (IID) and that each toss follows a Bernoulli distribution with parameter : and . The latent (i.e., unobserved) variable is the bias of the coin. The task is to infer given the results of previously observed coin tosses, that is, . Figure 1 shows the corresponding graphical model. The model is generative: once the distribution of the latent variable has been inferred, one can draw samples to generate data points similar to the observed data.
We now present this simple example in Stan, Edward, and Pyro. In all these languages we follow a Bayesian approach: the programmer first defines a probabilistic model of the problem. Assumptions are encoded with prior distributions over the variables of the model. Then the programmer launches an inference procedure to automatically compute the posterior distributions of the parameters of the model based on observed data. In other words, inference adjusts the prior distribution using the observed data to give a more precise model. Compared to other machinelearning models such as deep neural networks, the result of a probabilistic program is a probability distribution, which allows the programmer to explicitly visualize and manipulate the uncertainty associated with a result. This overt uncertainty is an advantage of PPLs. Furthermore, a probabilistic model has the advantage that it directly describes the corresponding world based on the programmer’s knowledge. Such descriptive models are more explainable than deep neural networks, whose representation is big and does not overtly resemble the world they model.
⬇ 1# Model 2coin_code = ””” 3 data { 4 int<lower=0,upper=1> x[10]; 5 } 6 parameters { 7 real<lower=0,upper=1> theta; 8 } 9 model { 10 theta ~ uniform(0,1); 11 for (i in 1:10) 12 x[i] ~ bernoulli(theta); 13 }””” 14# Data 15data = {’x’: [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]} 16# Inference 17fit = pystan.stan(model_code=coin_code, 18 data=data, iter=1000) 19# Results 20samples = fit.extract()[’theta’] 21print(”Posterior mean:”, np.mean(samples)) 22print(”Posterior stddev:”, np.std(samples))  ⬇ 1# Model 2theta = Uniform(0.0, 1.0) 3x = Bernoulli(probs=theta, sample_shape=10) 4# Data 5data = np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1]) 6# Inference 7qtheta = Empirical( 8 tf.Variable(tf.ones(1000) * 0.5)) 9inference = ed.HMC({theta: qtheta}, 10 data={x: data}) 11inference.run() 12# Results 13mean, stddev = ed.get_session().run( 14 [qtheta.mean(),qtheta.stddev()]) 15print(”Posterior mean:”, mean) 16print(”Posterior stddev:”, stddev)  ⬇ 1# Model 2def coin(): 3 theta = pyro.sample(”theta”, Uniform( 4 Variable(torch.Tensor([0])), 5 Variable(torch.Tensor([1]))) 6 pyro.sample(”x”, Bernoulli( 7 theta * Variable(torch.ones(10))) 8# Data 9data = {”x”: Variable(torch.Tensor( 10 [0, 1, 0, 0, 0, 0, 0, 0, 0, 1]))} 11# Inference 12cond = pyro.condition(coin, data=data) 13sampler = pyro.infer.Importance(cond, 14 num_samples=1000) 15post = pyro.infer.Marginal(sampler, sites=[”theta”]) 16# Result 17samples = [post()[”theta”].data[0] for _ in range(1000)] 18print(”Posterior mean:”, np.mean(samples)) 19print(”Posterior stddev:”, np.std(samples)) 
(a) Stan  (b) Edward  (c) Pyro 
Figure 2(a) solves the biasedcoin task in Stan, a wellestablished PPL (Carpenter et al., 2017). This example uses PyStan, the Python interface for Stan. Lines 213 contain code in Stan syntax in a multiline Python string. The data block introduces observed random variables, which are placeholders for concrete data to be provided as input to the inference procedure, whereas the parameters block introduces latent random variables, which will not be provided and must be inferred. Line 4 declares as a vector of ten discrete random variables, constrained to only take on values from a finite set, in this case, either 0 or 1. Line 7 declares as a continuous random variable, which can take on values from an infinite set, in this case, real numbers between 0 and 1. Stan uses the tilde operator (~) for sampling. Line 10 samples from a uniform distribution (same probability for all values) between 0 and 1. Since is a latent variable, this distribution is merely a prior belief, which inference will replace by a posterior distribution. Line 12 samples the
from a Bernoulli distribution with parameter
. Since the are observed variables, this sampling is really a check for how well the model corresponds to data provided at inference time. One can think of sampling an observed variable like an assertion in verification (Gordon et al., 2014). Line 15 specifies the data and Lines 1718 run the inference using the model and the data. By default, Stan uses a form of MonteCarlo sampling for inference (Carpenter et al., 2017). Lines 2022 extract and print the mean and standard deviation of the posterior distribution of
.Figure 2(b) solves the same task in Edward (Tran et al., 2017). Line 2 samples from the prior distribution, and Line 3 samples a vector of random variables from a Bernoulli distribution of parameter , one for each coin toss. Line 5 specifies the data. Lines 78 define a placeholder that will be used by the inference procedure to compute the posterior distribution of . The shape and size of the placeholder depends on the inference procedure. Here we use the Hamiltonian MonteCarlo inference HMC, the posterior distribution is thus computed based on a set of random samples and follows an empirical distribution. The size of the placeholder corresponds to the number of random samples computed during inference. Lines 911 launch the inference. The inference takes as parameter the prior:posterior pair theta:qtheta and links the data to the variable . Lines 1316 extract and print the mean and standard deviation of the posterior distribution of .
Figure 2(c) solves the same task in Pyro (Uber, 2017). Lines 27 define the model as a function coin. Lines 35 sample from the prior distribution, and Lines 67 sample a vector of random variable from a Bernoulli distribution of parameter . Pyro stores random variables in a dictionary keyed by the first argument of the function pyro.sample. Lines 910 define the data as a dictionary. Line 12 conditions the model using the data by matching the value of the data dictionary with the random variables defined in the model. Lines 1315 apply inference to this conditioned model, using importance sampling. Compared to Stan and Edward, we first define a conditioned model with the observed data before running the inference instead of passing the data as an argument to the inference. The inference returns a probabilistic model, post, that can be sampled to extract the mean and standard deviation of the posterior distribution of in Lines 1719.
The deep PPLs Edward and Pyro are built on top of two popular deep learning frameworks TensorFlow (Abadi et al., 2016) and PyTorch (Facebook, 2016). They benefit from efficient computations over large datasets, automatic differentiation, and optimizers provided by these frameworks that can be used to efficiently implement inference procedures. As we will see in the next sections, this design choice also reduces the gap between DL and probabilistic models, allowing the programmer to combine the two. On the other hand, this choice leads to piling up abstractions (Edward/TensorFlow/Numpy/Python or Pyro/PyTorch/Numpy/Python) that can complicate the code. We defer a discussion of these towers of abstraction to Section 5.
⬇ 1# Inference Guide 2qalpha = tf.Variable(1.0) 3qbeta = tf.Variable(1.0) 4qtheta = Beta(qalpha, qbeta) 5# Inference 6inference = ed.KLqp({theta: qtheta}, {x: data}) 7inference.run()  ⬇ 1# Inference Guide 2def guide(): 3 qalpha = pyro.param(”qalpha”, Variable(torch.Tensor([1.0]), requires_grad=True)) 4 qbeta = pyro.param(”qbeta”, Variable(torch.Tensor([1.0]), requires_grad=True)) 5 pyro.sample(”theta”, Beta(qalpha, qbeta)) 6# Inference 7svi = SVI(cond, guide, Adam({}), loss=”ELBO”, num_particles=7) 8for step in range(1000): 9 svi.step() 
(a) Edward  (b) Pyro 
Variational Inference
Inference, for Bayesian models, computes the posterior distribution of the latent parameters given a set of observations , that is, . For complex models, computing the exact posterior distribution can be costly or even intractable. Variational inference turns the inference problem into an optimization problem and tends to be faster and more adapted to large datasets than samplingbased MonteCarlo methods (Blei et al., 2017).
Variational infence tries to find the member of a family of simpler distribution over the latent variables that is the closest to the true posterior . The fitness of a candidate is measured using the KullbackLeibler (KL) divergence from the true posterior, a similarity measure between probability distributions.
It is up to the programmer to choose a family of candidates, or guides, that is sufficiently expressive to capture a close approximation of the true posterior, but simple enough to make the optimization problem tractable.
Both Edward and Pyro support variational inference. Figure 3 shows how to adapt Figure 2 to use it. In Edward (Figure 3
(a)), the programmer defines the family of guides by changing the shape of the placeholder used in the inference. Lines 24 use a beta distribution with unknown parameters
andthat will be computed during inference. Lines 67 do variational inference using the KullbackLeibler divergence. In Pyro (Figure
3(b)), this is done by defining a guide function. Lines 25 also define a beta distribution with parameters and . Lines 79 do inference using Stochastic Variational Inference, an optimized algorithm for variational inference. Both Edward and Pyro rely on the underlying framework to solve the optimization problem. Probabilistic inference thus closely follows the scheme used for training procedures of DL models.This section gave a highlevel introduction to PPLs and introduced basic concepts (generative models, sampling, prior and posterior, latent and observed, discrete and continuous). Next, we turn our attention to deep learning.
3. Probabilistic Models in DL
Multilayer perceptron (MLP) for classifying images. Circles and squares are probabilistic and nonprobabilistic variables. Black rectangles are pure functions. Arrows represent dependencies and forward data flow.
⬇ 1batch_size, nx, nh, ny = 128, 28 * 28, 1024, 10 2# Model 3x = tf.placeholder(tf.float32, [batch_size, nx]) 4l = tf.placeholder(tf.int32, [batch_size]) 5def mlp(theta, x): 6 h = tf.nn.relu(tf.matmul(x, theta[”Wh”]) + theta[”bh”]) 7 yhat = tf.matmul(h, theta[”Wy”]) + theta[”by”] 8 log_pi = tf.nn.log_softmax(yhat) 9 return log_pi 10theta = { 11 ’Wh’: Normal(loc=tf.zeros([nx, nh]), scale=tf.ones([nx, nh])), 12 ’bh’: Normal(loc=tf.zeros(nh), scale=tf.ones(nh)), 13 ’Wy’: Normal(loc=tf.zeros([nh, ny]), scale=tf.ones([nh, ny])), 14 ’by’: Normal(loc=tf.zeros(ny), scale=tf.ones(ny)) } 15lhat = Categorical(logits=mlp(theta, x)) 16# Inference Guide 17def vr(*shape): 18 return tf.Variable(tf.random_normal(shape)) 19qtheta = { 20 ’Wh’: Normal(loc=vr(nx, nh), scale=tf.nn.softplus(vr(nx, nh))), 21 ’bh’: Normal(loc=vr(nh), scale=tf.nn.softplus(vr(nh))), 22 ’Wy’: Normal(loc=vr(nh, ny), scale=tf.nn.softplus(vr(nh, ny))), 23 ’by’: Normal(loc=vr(ny), scale=tf.nn.softplus(vr(ny))) } 24# Inference 25inference = ed.KLqp({ theta[”Wh”]: qtheta[”Wh”], 26 theta[”bh”]: qtheta[”bh”], 27 theta[”Wy”]: qtheta[”Wy”], 28 theta[”by”]: qtheta[”by”] }, 29 data={lhat: l})  ⬇ 1# Model 2class MLP(nn.Module): 3 def __init__(self): 4 super(MLP, self).__init__() 5 self.lh = torch.nn.Linear(nx, nh) 6 self.ly = torch.nn.Linear(nh, ny) 7 def forward(self, x): 8 h = F.relu(self.lh(x.view((1, nx)))) 9 log_pi = F.log_softmax(self.ly(h), dim=1) 10 return log_pi 11mlp = MLP() 12def v0s(*shape): return Variable(torch.zeros(*shape)) 13def v1s(*shape): return Variable(torch.ones(*shape)) 14def model(x, l): 15 theta = { ’lh.weight’: Normal(v0s(nh, nx), v1s(nh, nx)), 16 ’lh.bias’: Normal(v0s(nh), v1s(nh)), 17 ’ly.weight’: Normal(v0s(ny, nh), v1s(ny, nh)), 18 ’ly.bias’: Normal(v0s(ny), v1s(ny)) } 19 lifted_mlp = pyro.random_module(”mlp”, mlp, theta)() 20 pyro.observe(”obs”, Categorical(logits=lifted_mlp(x)), one_hot(l)) 21# Inference Guide 22def vr(name, *shape): 23 return pyro.param(name, 24 Variable(torch.randn(*shape), requires_grad=True)) 25def guide(x, l): 26 qtheta = { 27 ’lh.weight’: Normal(vr(”Wh_m”, nh, nx), F.softplus(vr(”Wh_s”, nh, nx))), 28 ’lh.bias’: Normal(vr(”bh_m”, nh), F.softplus(vr(”bh_s”, nh))), 29 ’ly.weight’: Normal(vr(”Wy_m”, ny, nh), F.softplus(vr(”Wy_s”, ny, nh))), 30 ’ly.bias’: Normal(vr(”by_m”, ny), F.softplus(vr(”by_s”, ny))) } 31 return pyro.random_module(”mlp”, mlp, qtheta)() 32# Inference 33inference = SVI(model, guide, Adam({”lr”: 0.01}), loss=”ELBO”) 
(a) Edward  (b) Pyro 
Probabilistic multilayer perceptron for classifying images.
This section explains DL using an example of a deep neural network and shows how to make that probabilistic. The task is multiclass classification: given input features , e.g., an image of a handwritten digit (LeCun et al., 1998) comprising pixels, predict a label , e.g., one of digits. Before we explain how to solve this task using DL, let us clarify some terminology. In cases where DL uses different terminology from PPLs, this paper favors the PPL terminology. So we say inference for what DL calls training; predicting for what DL calls inferencing; and observed variables for what DL calls training data. For instance, the observed variables for inference in the classifier task are the image and label .
Among the many neural network architectures suitable for this task, we chose a simple one: a multilayer perceptron (MLP (Rumelhart et al., 1986)). We start with the nonprobabilistic version. Figure 4(a) shows an MLP with a 2feature input layer, a 3feature hidden layer, and a 2feature output layer; of course, this generalizes to wider (more features) and deeper (more layers) MLPs. From left to right, there is a fullyconnected subnet where each input feature contributes to each hidden feature , multiplied with a weight and offset with a bias . The weights and biases are latent variables. Treating the input, biases, and weights as vectors and a matrix lets us cast this subnet in linear algebra
, which can run efficiently via vector instructions on CPUs or GPUs. Next, a rectified linear unit
computes the hidden feature vector . The ReLU lets the MLP discriminate input spaces that are not linearly separable. The hidden layer exhibits both the advantage and disadvantage of deep learning: automatically learned features that need not be handengineered but would require nontrivial reverse engineering to explain in realworld terms. Next, another fullyconnected subnet computes the output layer . Then, the softmax computes a vector of scores that add up to one. The higher the value of , the more strongly the classifier predicts label . Using the output of the MLP, the argmax extracts the label with highest score.Traditional methods to train such a neural network incrementally update the latent parameters of the network to optimize a loss function via gradient descent
(Rumelhart et al., 1986). In the case of handwritten digits, the loss function is a distance metrics between observed and predicted labels. The variables and computations are nonprobabilistic, in that they use concrete values rather than probability distributions.Deep PPLs can express Bayesian neural networks with probabilistic weights and biases (Blundell et al., 2015). One way to visualize this is by replacing rectangles with circles for latent variables in Figure 4(a) to indicate that they hold probability distributions. Figure 4(b) shows the corresponding graphical model, where the latent variable denotes all the parameters of the MLP: , , , .
Bayesian inference starts from prior beliefs about the parameters and learns distributions that fit observed data (such as, images and labels). We can then sample concrete weights and biases to obtain a concrete MLP. In fact, we do not need to stop at a single MLP: we can sample an ensemble of as many MLPs as we like. Then, we can feed a concrete image to all the sampled MLPs to get their predictions, followed by a vote.
Figure 5(a) shows the probabilistic MLP example in Edward. Lines 34 are placeholders for observed variables (i.e., batches of images and labels ). Lines 59 defines the MLP parameterized by , a dictionary containing all the network parameters. Lines 1014 sample the parameters from the prior distributions. Line 15 defines the output of the network: a categorical distribution over all possible label values parameterized by the output of the MLP. Line 1723 define the guides for the latent variables, initialized with random numbers. Later, the variational inference will update these during optimization, so they will ultimately hold an approximation of the posterior distribution after inference. Lines 2529 set up the inference with one prior:posterior pair for each parameter of the network and link the output of the network to the observed data.
Figure 5(b) shows the same example in Pyro. Lines 211 contain the basic neural network, where torch.nn.Linear wraps the lowlevel linear algebra. Lines 36 declare the structure of the net, that is, the type and dimension of each layer. Lines 710 combine the layers to define the network. It is possible to use equivalent highlevel TensorFlow APIs for this in Edward as well, but we refrained from doing so to illustrate the transition of the parameters to random variables. Lines 1420 define the model. Lines 1518 sample priors for the parameters, associating them with object properties created by torch.nn.Linear (i.e., the weight and bias of each layer). Line 19 lifts the MLP definition from concrete to probabilistic. We thus obtain a MLP where all parameters are treated as random variables. Line 20 conditions the model using a categorical distribution over all possible label values. Lines 2631 define the guide for the latent variables, initialized with random numbers, just like in the Edward version. Line 33 sets up the inference.
⬇ 1def predict(x): 2 theta_samples = [ { ”Wh”: qtheta[”Wh”].sample(), ”bh”: qtheta[”bh”].sample(), 3 ”Wy”: qtheta[”Wy”].sample(), ”by”: qtheta[”by”].sample() } 4 for _ in range(args.num_samples) ] 5 yhats = [ mlp(theta_samp, x).eval() 6 for theta_samp in theta_samples ] 7 mean = np.mean(yhats, axis=0) 8 return np.argmax(mean, axis=1)  ⬇ 1def predict(x): 2 sampled_models = [ guide(None) 3 for _ in range (args.num_samples) ] 4 yhats = [ model(Variable(x)).data 5 for model in sampled_models ] 6 mean = torch.mean(torch.stack(yhats), 0) 7 return np.argmax(mean, axis=1) 
(a) Edward  (b) Pyro 
After the inference, Figure 6 shows how to use the posterior distribution of the MLP parameters to classify unknown data. In Edward (Figure 6(a)), Lines 24 draw several samples of the parameters from the posterior distribution. Then, Lines 56 execute the MLP with each concrete model. Line 7 computes the score of a label as the average of the scores returned by the MLPs. Finally, Line 8 predicts the label with the highest score. In Pyro (Figure 6(b)), the prediction is done similarly but we obtain multiple versions of the MLP by sampling the guide (Line 23), not the parameters.
This section showed how to use probabilistic variables as building blocks for a DL model. Compared to nonprobabilistic DL, this approach has the advantage of reduced overfitting and accurately quantified uncertainty (Blundell et al., 2015). On the other hand, this approach requires inference techniques, like variational inference, that are more advanced than classic backpropagation. The next section will present the dual approach, showing how to use neural networks as building blocks for a probabilistic model.
4. DL in Probabilistic Models
(a) Graphical models. ⬇ 1batch_size, nx, nh, nz = 256, 28 * 28, 1024, 4 2def vr(*shape): return tf.Variable(0.01 * tf.random_normal(shape)) 3# Model 4X = tf.placeholder(tf.int32, [batch_size, nx]) 5def decoder(theta, z): 6 hidden = tf.nn.relu(tf.matmul(z, theta[’Wh’]) + theta[’bh’]) 7 mu = tf.matmul(hidden, theta[’Wy’]) + theta[’by’] 8 return mu 9theta = { ’Wh’: vr(nz, nh), 10 ’bh’: vr(nh), 11 ’Wy’: vr(nh, nx), 12 ’by’: vr(nx) } 13z = Normal(loc=tf.zeros([batch_size, nz]), scale=tf.ones([batch_size, nz])) 14logits = decoder(theta, z) 15x = Bernoulli(logits=logits) 16# Inference Guide 17def encoder(phi, x): 18 x = tf.cast(x, tf.float32) 19 hidden = tf.nn.relu(tf.matmul(x, phi[’Wh’]) + phi[’bh’]) 20 z_mu = tf.matmul(hidden, phi[’Wy_mu’]) + phi[’by_mu’] 21 z_sigma = tf.nn.softplus( 22 tf.matmul(hidden, phi[’Wy_sigma’]) + phi[’by_sigma’]) 23 return z_mu, z_sigma 24phi = { ’Wh’: vr(nx, nh), 25 ’bh’: vr(nh), 26 ’Wy_mu’: vr(nh, nz), 27 ’by_mu’: vr(nz), 28 ’Wy_sigma’: vr(nh, nz), 29 ’by_sigma’: vr(nz) } 30loc, scale = encoder(phi, X) 31qz = Normal(loc=loc, scale=scale) 32# Inference 33inference = ed.KLqp({z: qz}, data={x: X})  ⬇ 1# Model 2class Decoder(nn.Module): 3 def __init__(self): 4 super(Decoder, self).__init__() 5 self.lh = nn.Linear(nz, nh) 6 self.lx = nn.Linear(nh, nx) 7 self.relu = nn.ReLU() 8 def forward(self, z): 9 hidden = self.relu(self.lh(z)) 10 mu = self.lx(hidden) 11 return mu 12decoder = Decoder() 13def model(x): 14 z_mu = Variable(torch.zeros(x.size(0), nz)) 15 z_sigma = Variable(torch.ones(x.size(0), nz)) 16 z = pyro.sample(”z”, dist.Normal(z_mu, z_sigma)) 17 pyro.module(”decoder”, decoder) 18 mu = decoder.forward(z) 19 pyro.sample(”xhat”, dist.Bernoulli(mu), obs=x.view(1, nx)) 20# Inference Guide 21class Encoder(nn.Module): 22 def __init__(self): 23 super(Encoder, self).__init__() 24 self.lh = torch.nn.Linear(nx, nh) 25 self.lz_mu = torch.nn.Linear(nh, nz) 26 self.lz_sigma = torch.nn.Linear(nh, nz) 27 self.softplus = nn.Softplus() 28 def forward(self, x): 29 hidden = F.relu(self.lh(x.view((1, nx)))) 30 z_mu = self.lz_mu(hidden) 31 z_sigma = self.softplus(self.lz_sigma(hidden)) 32 return z_mu, z_sigma 33encoder = Encoder() 34def guide(x): 35 pyro.module(”encoder”, encoder) 36 z_mu, z_sigma = encoder.forward(x) 37 pyro.sample(”z”, dist.Normal(z_mu, z_sigma)) 38# Inference 39inference = SVI(model, guide, Adam({”lr”: 0.01}), loss=”ELBO”) 
(b) Edward  (c) Pyro 
Variational autoencoder for encoding and decoding images.
This section explains how deep PPLs can use nonprobabilistic deep neural networks as components in probabilistic models. The example task is learning a vectorspace representation. Such a representation reduces the number of input dimensions to make other machinelearning tasks more manageable by counteracting the curse of dimensionality
(Domingos, 2012). The observed random variable is , for instance, an image of a handwritten digit with pixels. The latent random variable is , the vectorspace representation, for instance, with features. Learning a vectorspace representation is an unsupervised problem, requiring only images but no labels. While not too useful on its own, such a representation is an essential building block. For instance, it can help in other image generation tasks, e.g., to generate an image for a given writing style (Siddharth et al., 2017). Furthermore, it can help learning with small data, e.g., via a Knearest neighbors approach in vector space (Babkin et al., 2017).Each image depends on the latent representation in a complex nonlinear way (i.e., via a deep neural network). The task is to learn this dependency between and . The top half of Figure 7(a) shows the corresponding graphical model. The output of the neural network, named decoder, is a vector that parameterizes a Bernoulli distribution over each pixel in the image . Each pixel is thus associated to a probability of being present in the image. Similarly to Figure 4(b) the parameter of the decoder is global (i.e., shared by all data points) and is thus drawn outside the plate. Compared to Section 3 the network here is not probabilistic, hence the square around .
The main idea of the VAE (Kingma and Welling, 2013; Rezende et al., 2014) is to use variational inference to learn the latent representation. As for the examples presented in the previous sections, we need to define a guide for the inference. The guide maps each to a latent variable via another neural network. The bottom half of Figure 7(a) shows the graphical model of the guide. The network, named encoder, returns, for each image , the parameters and
of a Gaussian distribution in the latent space. Again the parameter
of the network is global and not probabilistic. Then inference tries to learn good values for parameter and , simultaneously training the decoder and the encoder, according to the data and the prior beliefs on the latent variables (e.g., Gaussian distribution).After the inference, we can generate a latent representation of an image with the encoder and reconstruct the image with the decoder. The similarity of the two images gives an indication of the success of the inference. The model and the guide together can thus be seen as an autoencoder, hence the term variational autoencoder.
Figure 7(b) shows the VAE examples in Edward. Lines 412 define the decoder: a simple 2layers neural network similar to the one presented in Section 3. The parameter is initialized with random noise. Line 13 samples the priors for the latent variable from a Gaussian distribution. Lines 1415 define the dependency between and , as a Bernoulli distribution parameterized by the output of the decoder. Lines 1729 define the encoder: a neural network with one hidden layer and two distinct output layers for and . The parameter is also initialized with random noise. Lines 3031 define the inference guide for the latent variable, that is, a Gaussian distribution parameterized by the outputs of the encoder. Line 33 sets up the inference matching the prior:posterior pair for the latent variable and linking the data with the output of the decoder.
Figure 7(c) shows the VAE example in Pyro. Lines 212 define the decoder. Lines 1319 define the model. Lines 1416 sample the priors for the latent variable . Lines 1819 define the dependency between and via the decoder. In contrast to Figure 5(b), the decoder is not probabilistic, so there is no need for lifting the network. Lines 3437 define the guide as in Edward linking and via the decoder defined Lines 2133. Line 39 sets up the inference.
This example illustrates that we can embed nonprobabilistic DL models inside probabilistic programs and learn the parameters of the DL models during inference. Sections 2, 3, and 4 were about explaining deep PPLs with examples. The next section is about comparing deep PPLs with each other and with their potential.
5. Characterization
This section attempts to answer the following research question: At this point in time, how well do deep PPLs live up to their potential? Deep PPLs combine probabilistic models, deep learning, and programming languages in an effort to combine their advantages. This section explores those advantages grouped by pedigree and uses them to characterize Edward and Pyro.
Before we dive in, some disclaimers are in order. First, both Edward and Pyro are young, not mature projects with years of improvements based on user experiences, and they enable new applications that are still under active research. We should keep this in mind when we criticize them. On the positive side, early criticism can be a positive influence. Second, since getting even a limited number of example programs to support direct sidebyside comparisons was nontrivial, we kept our study qualitative. A respectable quantitative study would require more programs and data sets. On the positive side, all of the code shown in this paper actually runs. Third, doing an outsidein study risks missing subtleties that the designers of Edward and Pyro may be more expert in. On the positive side, the outsidein view resembles what new users experience.
5.1. Advantages from Probabilistic Models
Probabilistic models support overt uncertainty: they give not just a prediction but also a meaningful probability. This is useful to avoid uncertainty bugs (Bornholt et al., 2014)
, track compounding effects of uncertainty, and even make better exploration decisions in reinforcement learning
(Blundell et al., 2015). Both Edward and Pyro support overt uncertainty well, see e.g. the lines under the comment “# Results” in Figure 2.Probabilistic models give users a choice of inference procedures: the user has the flexibility to pick and configure different approaches. Deep PPLs support two primary families of inference procedures: those based on MonteCarlo sampling and those based on variational inference. Edward supports both and furthermore flexible compositions, where different inference procedures are applied to different parts of the model. Pyro supports primarily variational inference and focuses less on MonteCarlo sampling. In comparison, Stan makes a form of MonteCarlo sampling the default, focusing on making it easytotune in practice (Carpenter, 2017).
Probabilistic models can help with small data: even when inference uses only small amount of labeled data, there have been highprofile cases where probabilistic models still make accurate predictions (Lake et al., 2015). Working with small data is useful to avoid the cost of handlabeling, to improve privacy, to build personalized models, and to do well on underrepresented corners of a bigdata task. The intuition for how probabilistic models help is that they can make up for lacking labeled data for a task by domain knowledge incorporated in the model, by unlabeled data, or by labeled data for other tasks. There are some promising initial successes of using deep probabilistic programming on small data (Rezende et al., 2016; Siddharth et al., 2017); at the same time, this remains an active area of research.
Probabilistic models can support explainability: when the components of a probabilistic model correspond directly to concepts of a realworld domain being modeled, predictions can include an explanation in terms of those concepts. Explainability is useful when predictions are consulted for highstakes decisions, as well as for transparency around bias (Calmon et al., 2017). Unfortunately, the parameters of a deep neural network are just as opaque with as without probabilistic programming. There is cause for hope though. For instance, Siddharth et al. advocate disentangled representations that help explainability (Siddharth et al., 2017). Overall, the jury is still out on the extent to which deep PPLs can leverage this advantage from PPLs.
5.2. Advantages from Deep Learning
Deep learning is automatic hierarchical representation learning (LeCun et al., 2015): each unit in a deep neural network can be viewed as an automatically learned feature. Learning features automatically is useful to avoid the cost of engineering features by hand. Fortunately, this DL advantage remains true in the context of a deep PPL. In fact, a deep PPL makes the tradeoff between automated and handcrafted features more flexible than most other machinelearning approaches.
Deep learning can accomplish high accuracy
: for various tasks, there have been highprofile cases where deep neural networks beat out earlier approaches with years of research behind them. Arguably, the victory of DL at the ImageNet competition in 2012 ushered in the latest craze around DL
(Krizhevsky et al., 2012). Recordbreaking accuracy is useful not just for publicity but also to cross thresholds where practical deployments become desirable, for instance, in machine translation (Wu et al., 2016). Since a deep PPL can use deep neural networks, in principle, it inherits this advantage from DL (Tran et al., 2017). However, even nonprobabilistic DL requires tuning, and in our experience with the example programs in this paper, the tuning burden is exacerbated with variational inference.Deep learning supports fast inference: even for large models and a large data set, the wallclock time for a batch job to infer posteriors is short. The fastinference advantage is the result of the backpropagation algorithm (Rumelhart et al., 1986), novel techniques for parallelization (Niu et al., 2011) and data representation (Gupta et al., 2015)
, and massive investments in the efficiency of DL frameworks such as TensorFlow and PyTorch, with vectorization on CPU and GPU. Fast inference is useful for iterating more quickly on ideas, trying more hyperparameter during, and wasting fewer resources. Tran et al. measured the efficiency of the Edward deep PPL, demonstrating that it does benefit from the efficiency advantage of the underlying DL framework
(Tran et al., 2017).5.3. Advantages from Programming Languages
Programming language design is essential for composability: bigger models can be composed from smaller components. Composability is useful for testing, teamwork, and reuse. Conceptually, both graphical probabilistic models and deep neural networks compose well. On the other hand, some PPLs impose structure in a way that reduces composability; fortunately, this can be mitigated (Gorinova et al., 2018). Both Edward and Pyro are embedded in Python, and, as our example programs demonstrate, work with Python functions and classes. For instance, users are not limited to declaring all latent variables in one place; instead, they can compose models, such as MLPs, with separately declared latent variables. Edward and Pyro also work with higherlevel DL framework modules such as tf.layers.dense or torch.nn.Linear, and Pyro even supports automatically lifting those to make them probabilistic. Edward and Pyro also do not prevent users from composing probabilistic models with nonprobabilistic code, but doing so requires care. For instance, when MonteCarlo sampling runs the same code multiple times, it is up to the programmer to watch out for unwanted sideeffects. One area where more work is needed is the extensibility of Edward or Pyro itself (Carpenter, 2017). Finally, in addition to composing models, Edward also emphasizes composing inference procedures.
Not all PPLs have the same expressiveness: some are Turing complete, others not (Gordon et al., 2014). For instance, BUGS is not Turing complete, but has nevertheless been highly successful (Gilks et al., 1994). The field of deep probabilistic programming is too young to judge which levels of expressiveness are how useful. Edward and Pyro are both Turing complete. However, Edward makes it harder to express whileloops and conditionals than Pyro. Since Edward builds on TensorFlow, the user must use special APIs to incorporate dynamic control constructs into the computational graph. In contrast, since Pyro builds on PyTorch, it can use native Python control constructs, one of the advantages of dynamic DL frameworks (Neubig et al., 2017).
Programming language design affects conciseness: it determines whether a model can be expressed in few lines of code. Conciseness is useful to make models easier to write and, when used in good taste, easier to read. In our code examples, Edward is more concise than Pyro. Pyro arguably trades conciseness for structure, making heavier use of Python classes and functions. Wrapping the model and guide into functions allows compiling them into coroutines, an ingredient for implementing efficient inference procedures (Goodman and Stuhlmüller, 2014). In both Edward and Pyro, conciseness is hampered by the Bayesian requirement for explicit priors and by the variationalinference requirement for explicit guides.
Programming languages can offer watertight abstractions: they can abstract away lowerlevel concepts and prevent them from leaking out, for instance, using types and constraints (Carpenter, 2017). Consider the expression from Section 3. At face value, this looks like eager arithmetic on concrete scalars, running just once in the forward direction. But actually, it may be lazy (building a computational graph) arithmetic on probability distributions (not concrete values) of tensors (not scalars), running several times (for different MonteCarlo samples or data batches), possibly in the backward direction (for backpropagation of gradients). Abstractions are useful to reduce the cognitive burden on the programmer, but only if they are watertight. Unfortunately, abstractions in deep PPLs are leaky. Our code examples directly invoke features from several layers of the technology stack (Edward or Pyro, on TensorFlow or PyTorch, on NumPy, on Python). Furthermore, we found that error messages rarely refer to source code constructs. For instance, names of Python variables are erased from the computational graph, making it hard to debug tensor dimensions, a common cause for mistakes. It does not have to be that way. For instance, DL frameworks are good at hiding the abstraction of backpropagation. More work is required to make deep PPL abstractions more watertight.
6. Conclusion and Outlook
This paper is a study of two deep PPLs, Edward and Pyro. The study is qualitative and driven by code examples. This paper explains how to solve common tasks, contributing sidebyside comparisons of Edward and Pyro. The potential of deep PPLs is to combine the advantages of probabilistic models, deep learning, and programming languages. In addition to comparing Edward and Pyro to each other, this paper also compares them to that potential. A quantitative study is left to future work. Based on our experience, we confirm that Edward and Pyro combine three advantages outofthebox: the overt uncertainty of probabilistic models; the hierarchical representation learning of DL; and the composability of programming languages.
Following are possible next steps in deep PPL research.

[leftmargin=]

Choice of inference procedures: Especially Pyro should support MonteCarlo methods at the same level as variational inference.

Small data: While possible in theory, this has yet to be demonstrated on Edward and Pyro, with interesting data sets.

High accuracy: Edward and Pyro need to be improved to simplify the tuning required to improve model accuracy.

Expressiveness: While Turing complete in theory, Edward should adopt recent dynamic TensorFlow features for usability.

Conciseness: Both Edward and Pyro would benefit from reducing the repetitive idioms of priors and guides.

Watertight abstractions: Both Edward and Pyro fall short on this goal, necessitating more careful language design.

Explainability: This is inherently hard with deep PPLs, necessitating more machinelearning innovation.
In summary, deep PPLs show great promises and remain an active field with many research opportunities.
References
 (1)
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for LargeScale Machine Learning. In Operating Systems Design and Implementation (OSDI). 265–283. https://www.usenix.org/conference/osdi16/technicalsessions/presentation/abadi
 Babkin et al. (2017) Petr Babkin, Md. Faisal Mahbub Chowdhury, Alfio Gliozzo, Martin Hirzel, and Avraham Shinnar. 2017. Bootstrapping Chatbots for Novel Domains. In Workshop at NIPS on Learning with Limited Labeled Data (LLD). https://lldworkshop.github.io/papers/LLD_2017_paper_10.pdf
 Barber (2012) David Barber. 2012. Bayesian Reasoning and Machine Learning. Cambridge University Press. http://www.cs.ucl.ac.uk/staff/d.barber/brml/
 Blei et al. (2017) David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. 2017. Variational inference: A review for statisticians. J. Amer. Statist. Assoc. 112, 518 (2017), 859–877. https://arxiv.org/abs/1601.00670
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. Weight Uncertainty in Neural Network. In International Conference on Machine Learning (ICML). 1613–1622. http://proceedings.mlr.press/v37/blundell15.html
 Bornholt et al. (2014) James Bornholt, Todd Mytkowicz, and Kathryn S. McKinley. 2014. Uncertain¡T¿: A Firstorder Type for Uncertain Data. In Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 51–66. https://doi.org/10.1145/2541940.2541958
 Calmon et al. (2017) Flavio P. Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R. Varshney. 2017. Optimized PreProcessing for Discrimination Prevention. In Neural Information Processing Systems (NIPS). 3995–4004. http://papers.nips.cc/paper/6988optimizedpreprocessingfordiscriminationprevention.pdf
 Carpenter (2017) Bob Carpenter. 2017. Hello, world! Stan, PyMC3, and Edward. (2017). http://andrewgelman.com/2017/05/31/comparestanpymc3edwardhelloworld/ (Retrieved February 2018).
 Carpenter et al. (2017) Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A. Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. Stan: A probabilistic programming language. Journal of Statistical Software 76, 1 (2017), 1–37. https://www.jstatsoft.org/article/view/v076i01
 Domingos (2012) Pedro Domingos. 2012. A Few Useful Things to Know About Machine Learning. Communications of the ACM (CACM) 55, 10 (Oct. 2012), 78–87. https://doi.org/10.1145/2347736.2347755
 Facebook (2016) Facebook. 2016. PyTorch. (2016). http://pytorch.org/ (Retrieved February 2018).

Ghahramani (2015)
Zoubin Ghahramani.
2015.
Probabilistic machine learning and artificial intelligence.
Nature 521, 7553 (May 2015), 452–459. https://www.nature.com/articles/nature14541  Gilks et al. (1994) W. R. Gilks, A. Thomas, and D. J. Spiegelhalter. 1994. A Language and Program for Complex Bayesian Modelling. The Statistician 43, 1 (Jan. 1994), 169–177.
 Goodman and Stuhlmüller (2014) Noah D. Goodman and Andreas Stuhlmüller. 2014. The Design and Implementation of Probabilistic Programming Languages. (2014). http://dippl.org (Retrieved February 2018).
 Gordon et al. (2014) Andrew D. Gordon, Thomas A. Henzinger, Aditya V. Nori, and Sriram K. Rajamani. 2014. Probabilistic Programming. In ICSE track on Future of Software Engineering (FOSE). 167–181. https://doi.org/10.1145/2593882.2593900
 Gorinova et al. (2018) Maria I. Gorinova, Andrew D. Gordon, and Charles Sutton. 2018. SlicStan: Improving Probabilistic Programming using Information Flow Analysis. In Workshop on Probabilistic Programming Languages, Semantics, and Systems (PPS). https://pps2018.soic.indiana.edu/files/2017/12/SlicStanPPS.pdf
 Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning (ICML). 1737–1746. http://proceedings.mlr.press/v37/gupta15.pdf
 Hudak (1998) Paul Hudak. 1998. Modular domain specific languages and tools. In International Conference on Software Reuse (ICSR). 134–142. https://doi.org/10.1109/ICSR.1998.685738
 Kingma and Welling (2013) Diederik P. Kingma and Max Welling. 2013. Autoencoding variational Bayes. (2013). https://arxiv.org/abs/1312.6114
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Networks. In Advances in Neural Information Processing Systems (NIPS). http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks
 Lake et al. (2015) Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2015. Humanlevel concept learning through probabilistic program induction. Science 350 (Dec. 2015), 1332–1338. Issue 6266. http://science.sciencemag.org/content/350/6266/1332
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep Learning. Nature 521, 7553 (May 2015), 436–444. https://www.nature.com/articles/nature14539

LeCun
et al. (1998)
Yann LeCun, Corinna
Cortes, and Christopher J.C. Burges.
1998.
The MNIST Database of Handwritten Digits.
(1998). http://yann.lecun.com/exdb/mnist/ (Retrieved February 2018).  Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. DyNet: The Dynamic Neural Network Toolkit. (2017).

Niu
et al. (2011)
Feng Niu, Benjamin Recht,
Christopher Ré, and Stephen J.
Wright. 2011.
Hogwild: A LockFree Approach to Parallelizing Stochastic Gradient Descent. In
Conference on Neural Information Processing Systems (NIPS). 693–701. http://papers.nips.cc/paper/4390hogwildalockfreeapproachtoparallelizingstochasticgradientdescent  Rezende et al. (2016) Danilo J. Rezende, Shakir Mohamed, Ivo Danihelka, Karol Gregor, and Daan Wierstra. 2016. Oneshot Generalization in Deep Generative Models. In International Conference on Machine Learning (ICML). 1521–1529. http://proceedings.mlr.press/v48/rezende16.html

Rezende
et al. (2014)
Danilo J. Rezende, Shakir
Mohamed, and Daan Wierstra.
2014.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In
International Conference on Machine Learning (ICML). 1278–1286. http://proceedings.mlr.press/v32/rezende14.html  Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by backpropagating errors. Nature 323 (Oct. 1986), 533–536. https://doi.org/doi:10.1038/323533a0
 Salvatier et al. (2016) John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. 2016. Probabilistic programming in Python using PyMC3. PeerJ Computer Science 2 (April 2016), e55. https://doi.org/10.7717/peerjcs.55

Siddharth et al. (2017)
N. Siddharth, Brooks
Paige, JanWillem van de Meent, Alban
Desmaison, Noah D. Goodman, Pushmeet
Kohli, Frank Wood, and Philip Torr.
2017.
Learning Disentangled Representations with
SemiSupervised Deep Generative Models. In
Advances in Neural Information Processing Systems
(NIPS). 5927–5937.
http://papers.nips.cc/paper/7174learningdisentangledrepresentationswithsemisuperviseddeep
generativemodels.pdf  Tran et al. (2017) Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep Probabilistic Programming. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1701.03757
 Uber (2017) Uber. 2017. Pyro. (2017). http://pyro.ai/ (Retrieved February 2018).
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. (2016). https://arxiv.org/abs/1609.08144