LifeScienceAI
Reference MEMO of LifeScience AI
view repo
We propose Edward, a Turing-complete probabilistic programming language. Edward defines two compositional representations---random variables and inference. By treating inference as a first class citizen, on a par with modeling, we show that probabilistic programming can be as flexible and computationally efficient as traditional deep learning. For flexibility, Edward makes it easy to fit the same model using a variety of composable inference methods, ranging from point estimation to variational inference to MCMC. In addition, Edward can reuse the modeling representation as part of inference, facilitating the design of rich variational models and generative adversarial networks. For efficiency, Edward is integrated into TensorFlow, providing significant speedups over existing probabilistic systems. For example, we show on a benchmark logistic regression task that Edward is at least 35x faster than Stan and 6x faster than PyMC3. Further, Edward incurs no runtime overhead: it is as fast as handwritten TensorFlow.
READ FULL TEXT VIEW PDF
In this paper we introduce ZhuSuan, a python probabilistic programming
l...
read it
Probabilistic modeling is iterative. A scientist posits a simple model, ...
read it
This work offers a broad perspective on probabilistic modeling and infer...
read it
Probabilistic modeling is a powerful approach for analyzing empirical
in...
read it
A central tenet of probabilistic programming is that a model is specifie...
read it
The aim of probabilistic programming is to automatize every aspect of
pr...
read it
A probabilistic program defines a probability measure over its semantic
...
read it
Reference MEMO of LifeScience AI
The nature of deep neural networks is compositional. Users can connect layers in creative ways, without having to worry about how to perform testing (forward propagation) or inference (gradient-based optimization, with back propagation and automatic differentiation).
In this paper, we design compositional representations for
probabilistic programming. Probabilistic programming lets users
specify generative probabilistic models as programs and then
“compile” those models down into inference procedures.
Probabilistic models are also compositional in nature, and much work
has enabled rich probabilistic programs via compositions of random
variables
(Goodman et al., 2012; Ghahramani, 2015; Lake et al., 2016).
Less work, however, has considered an analogous compositionality for
inference. Rather, many existing PPLPPL treat
the inference engine as a black box, abstracted away from the model.
These cannot capture probabilistic inferences that
reuse the model’s representation—a key idea in
recent advances in variational
inference (Kingma & Welling, 2014; Rezende & Mohamed, 2015; Tran et al., 2016b),
GANGAN (Goodfellow et al., 2014),
and also in more classic inferences
(Dayan et al., 1995; Gutmann & Hyvärinen, 2010).
We propose Edward^{1}^{1}1See Tran et al. (2016a)
for details of the API. A companion webpage for this paper is available at
http://edwardlib.org/iclr2017. It contains more complete
examples with runnable code., a
Turing-complete PPL which builds on two compositional
representations—one for random variables and one for inference.
By treating inference as a first class citizen, on a
par with modeling, we show that probabilistic programming can be as
flexible and computationally efficient as traditional deep learning.
For flexibility, we show how Edward makes it easy to fit
the same model using a variety of composable inference methods,
ranging from point estimation to variational inference to
MCMC.
For efficiency, we
show how to integrate Edward into existing computational graph
frameworks such as TensorFlow (Abadi et al., 2016)
. Frameworks like TensorFlow provide computational benefits like distributed training, parallelism, vectorization, and GPUGPU support “for free.” For example, we show on a benchmark task that Edward’s HMC is many times faster than existing software. Further, Edward incurs no runtime overhead: it is as fast as handwritten TensorFlow.
PPL typically trade off the expressiveness of the language with the computational efficiency of inference. On one side, there are languages which emphasize expressiveness (Pfeffer, 2001; Milch et al., 2005; Pfeffer, 2009; Goodman et al., 2012), representing a rich class beyond graphical models. Each employs a generic inference engine, but scales poorly with respect to model and data size. On the other side, there are languages which emphasize efficiency (Spiegelhalter et al., 1995; Murphy, 2001; Plummer, 2003; Salvatier et al., 2015; Carpenter et al., 2016). The PPL is restricted to a specific class of models, and inference algorithms are optimized to be efficient for this class. For example, Infer.NET enables fast message passing for graphical models (Minka et al., 2014)
, and Augur enables data parallelism with GPU for Gibbs sampling in Bayesian networks
(Tristan et al., 2014). Edward bridges this gap. It is Turing complete—it supports any computable probability distribution—and it supports efficient algorithms, such as those that leverage model structure and those that scale to massive data.
There has been some prior research on efficient algorithms in Turing-complete languages. Venture and Anglican design inference as a collection of local inference problems, defined over program fragments (Mansinghka et al., 2014; Wood et al., 2014). This produces fast program-specific inference code, which we build on. Neither system supports inference methods such as programmable posterior approximations, inference models, or data subsampling. Concurrent with our work, WebPPL features amortized inference (Ritchie et al., 2016). Unlike Edward, WebPPL does not reuse the model’s representation; rather, it annotates the original program and leverages helper functions, which is a less flexible strategy. Finally, inference is designed as program transformations in Kiselyov & Shan (2009); Ścibior et al. (2015); Zinkov & Shan (2016). This enables the flexibility of composing inference inside other probabilistic programs. Edward builds on this idea to compose not only inference within modeling but also modeling within inference (e.g., variational models).
We first develop compositional representations for probabilistic models. We desire two criteria: (a) integration with computational graphs, an efficient framework where nodes represent operations on data and edges represent data communicated between them (Culler, 1986); and (b) invariance of the representation under the graph, that is, the representation can be reused during inference.
Edward defines random variables as the key compositional representation. They are class objects with methods, for example, to compute the log density and to sample. Further, each random variable
is associated to a tensor (multi-dimensional array)
, which represents a single sample . This association embeds the random variable onto a computational graph on tensors.The design’s simplicity makes it easy to develop probabilistic programs in a computational graph framework. Importantly, all computation is represented on the graph. This enables one to compose random variables with complex deterministic structure such as deep neural networks, a diverse set of math operations, and third party libraries that build on the same framework. The design also enables compositions of random variables to capture complex stochastic structure.
As an illustration, we use a Beta-Bernoulli model, , where
is a latent probability shared across the 50 data points
. The random variable x is 50-dimensional, parameterized by the random tensor . Fetching the object x runs the graph: it simulates from the generative process and outputs a binary vector of elements.All computation is registered symbolically on random variables and not over their execution. Symbolic representations do not require reifying the full model, which leads to unreasonable memory consumption for large models (Tristan et al., 2014). Moreover, it enables us to simplify both deterministic and stochastic operations in the graph, before executing any code (Ścibior et al., 2015; Zinkov & Shan, 2016).
With computational graphs, it is also natural to build mutable states within the probabilistic program. As a typical use of computational graphs, such states can define model parameters; in TensorFlow, this is given by a tf.Variable. Another use case is for building discriminative models , where are features that are input as training or test data. The program can be written independent of the data, using a mutable state (tf.placeholder) for in its graph. During training and testing, we feed the placeholder the appropriate values.
In Appendix A, we provide examples of a Bayesian neural network for classification (A.1), latent Dirichlet allocation (A.2), and Gaussian matrix factorization (A.3). We present others below.
Figure 2 implements a VAE (Kingma & Welling, 2014; Rezende et al., 2014) in Edward. It comprises a probabilistic model over data and a variational model designed to approximate the former’s posterior. Here we use random variables to construct both the probabilistic model and the variational model; they are fit during inference (more details in Section 4).
There are data points each with latent variables,
. The program uses Keras
(Chollet, 2015) to define neural networks. The probabilistic model is parameterized by a 2-layer neural network, with 256 hidden units (and ReLU activation), and generates pixel images. The variational model is parameterized by a 2-layer inference network, with 256 hidden units and outputs parameters of a normal posterior approximation.The probabilistic program is concise. Core elements of the VAE—such as its distributional assumptions and neural net architectures—are all extensible. With model compositionality, we can embed it into more complicated models (Gregor et al., 2015; Rezende et al., 2016) and for other learning tasks (Kingma et al., 2014). With inference compositionality (which we discuss in Section 4), we can embed it into more complicated algorithms, such as with expressive variational approximations (Rezende & Mohamed, 2015; Tran et al., 2016b; Kingma et al., 2016) and alternative objectives (Ranganath et al., 2016a; Li & Turner, 2016; Dieng et al., 2016).
Random variables can also be composed with control flow operations. As an example, Figure 3 implements a Bayesian RNNRNN with variable length. The data is a sequence of inputs and outputs of length with and per time step. For , a RNN applies the update
where the previous hidden state is . We feed each hidden state into the output’s likelihood, , and we place a standard normal prior over all parameters
. Our implementation is dynamic: it differs from a RNN with fixed length, which pads and unrolls the computation.
Random variables can also be placed in the control flow itself, enabling probabilistic programs with stochastic control flow. Stochastic control flow defines dynamic conditional dependencies, known in the literature as contingent or existential dependencies (Mansinghka et al., 2014; Wu et al., 2016). See Figure 4, where may or may not depend on for a given execution. In Section A.4, we use stochastic control flow to implement a Dirichlet process mixture model. Tensors with stochastic shape are also possible: for example, tf.zeros(Poisson(lam=5.0)) defines a vector of zeros with length given by a Poisson draw with rate .
Stochastic control flow produces difficulties for algorithms that use the graph structure because the relationship of conditional dependencies changes across execution traces. The computational graph, however, provides an elegant way of teasing out static conditional dependence structure () from dynamic dependence structure (. We can perform model parallelism (parallel computation across components of the model) over the static structure with GPU and batch training. We can use more generic computations to handle the dynamic structure.
We described random variables as a representation for building rich probabilistic programs over computational graphs. We now describe a compositional representation for inference. We desire two criteria: (a) support for many classes of inference, where the form of the inferred posterior depends on the algorithm; and (b) invariance of inference under the computational graph, that is, the posterior can be further composed as part of another model.
To explain our approach, we will use a simple hierarchical model as a running example. Figure 5
displays a joint distribution
of data , local variables , and global variables . The ideas here extend to more expressive programs.The goal of inference is to calculate the posterior distribution given data , where are any model parameters that we will compute point estimates for.^{2}^{2}2For example, we could replace x’s sigma argument with tf.exp(tf.Variable(0.0))*tf.ones([N, D]). This defines a model parameter initialized at 0 and positive-constrained. We formalize this as the following optimization problem:
(1) |
where is an approximation to the posterior , and
is a loss function with respect to
and .The choice of approximation , loss , and rules to update parameters are specified by an inference algorithm. (Note can be nonparametric, such as a point or a collection of samples.)
In Edward, we write this problem as follows:
Inference is an abstract class which takes two inputs. The first is a collection of latent random variables beta and z, associated to their “posterior variables” qbeta and qz respectively. The second is a collection of observed random variables x, which is associated to their realizations x_train.
The idea is that Inference defines and solves the optimization in Equation 1. It adjusts parameters of the distribution of qbeta and qz (and any model parameters) to be close to the posterior.
Class methods are available to finely control the inference. Calling inference.initialize() builds a computational graph to update . Calling inference.update() runs this computation once to update ; we call the method in a loop until convergence. Importantly, no efficiency is lost in Edward’s language: the computational graph is the same as if it were handwritten for a specific model. This means the runtime is the same; also see our experiments in Section 5.2.
A key concept in Edward is that there is no distinct “model” or “inference” block. A model is simply a collection of random variables, and inference is a way of modifying parameters in that collection subject to another. This reductionism offers significant flexibility. For example, we can infer only parts of a model (e.g., layer-wise training (Hinton et al., 2006)), infer parts used in multiple models (e.g., multi-task learning), or plug in a posterior into a new model (e.g., Bayesian updating).
The design of Inference is very general. We describe subclasses to represent many algorithms below: variational inference, Monte Carlo, and GAN.
Variational inference posits a family of approximating distributions and finds the closest member in the family to the posterior (Jordan et al., 1999). In Edward, we build the variational family in the graph; see Figure 6 (left). For our running example, the family has mutable variables as parameters , where and .
Specific variational algorithms inherit from the VariationalInference class. Each defines its own methods, such as a loss function and gradient. For example, we represent MAP estimation with an approximating family (qbeta and qz) of PointMass random variables, i.e., with all probability mass concentrated at a point. MAP inherits from VariationalInference and defines the negative log joint density as the loss function; it uses existing optimizers inside TensorFlow. In Section 5.1, we experiment with multiple gradient estimators for black box variational inference (Ranganath et al., 2014). Each estimator implements the same loss (an objective proportional to the divergence ) and a different update rule (stochastic gradient).
Monte Carlo approximates the posterior using samples (Robert & Casella, 1999). Monte Carlo is an inference where the approximating family is an empirical distribution, and . The parameters are . See Figure 6 (right). Monte Carlo algorithms proceed by updating one sample at a time in the empirical approximation. Specific MCMC samplers determine the update rules: they can use gradients such as in Hamiltonian Monte Carlo (Neal, 2011) and graph structure such as in sequential Monte Carlo (Doucet et al., 2001).
Edward also supports non-Bayesian methods such as GAN (Goodfellow et al., 2014). See Figure 7. The model posits random noise eps over data points, each with dimensions; this random noise feeds into a generative_network function, a neural network that outputs real-valued data x. In addition, there is a discriminative_network which takes data as input and outputs the probability that the data is real (in logit parameterization). We build GANInference; running it optimizes parameters inside the two neural network functions. This approach extends to many advances in GAN (e.g., Denton et al. (2015); Li et al. (2015)).
Finally, one can design algorithms that would otherwise require tedious algebraic manipulation. With symbolic algebra on nodes of the computational graph, we can uncover conjugacy relationships between random variables. Users can then integrate out variables to automatically derive classical Gibbs (Gelfand & Smith, 1990), mean-field updates (Bishop, 2006), and exact inference. These algorithms are being currently developed in Edward.
Core to Edward’s design is that inference can be written as a collection of separate inference programs. Below we demonstrate variational EM, with an (approximate) E-step over local variables and an M-step over global variables. We instantiate two algorithms, each of which conditions on inferences from the other, and we alternate with one update of each (Neal & Hinton, 1993),
This extends to many other cases such as exact EM for exponential families, contrastive divergence
(Hinton, 2002), pseudo-marginal methods (Andrieu & Roberts, 2009), and Gibbs sampling within variational inference (Wang & Blei, 2012; Hoffman & Blei, 2015). We can also write message passing algorithms, which solve a collection of local inference problems (Koller & Friedman, 2009). For example, classical message passing uses exact local inference and expectation propagation locally minimizes the Kullback-Leibler divergence,
(Minka, 2001).Stochastic optimization (Bottou, 2010) scales inference to massive data and is key to algorithms such as stochastic gradient Langevin dynamics (Welling & Teh, 2011) and stochastic variational inference (Hoffman et al., 2013). The idea is to cheaply estimate the model’s log joint density in an unbiased way. At each step, one subsamples a data set of size and then scales densities with respect to local variables,
To support stochastic optimization, we represent only a subgraph of the full model. This prevents reifying the full model, which can lead to unreasonable memory consumption (Tristan et al., 2014). During initialization, we pass in a dictionary to properly scale the arguments. See Figure 8.
Conceptually, the scale argument represents scaling for each random variable’s plate, as if we had seen that random variable as many times. As an example, Appendix B shows how to implement stochastic variational inference in Edward. The approach extends naturally to streaming data (Doucet et al., 2000; Broderick et al., 2013; McInerney et al., 2015), dynamic batch sizes, and data structures in which working on a subgraph does not immediately apply (Binder et al., 1997; Johnson & Willsky, 2014; Foti et al., 2014).
In this section, we illustrate two main benefits of Edward: flexibility and efficiency. For the former, we show how it is easy to compare different inference algorithms on the same model. For the latter, we show how it is easy to get significant speedups by exploiting computational graphs.
Inference method | Negative log-likelihood |
---|---|
VAE (Kingma & Welling, 2014) | 88.2 |
VAE without analytic KL | 89.4 |
VAE with analytic entropy | 88.1 |
VAE with score function gradient | 87.9 |
Normalizing flows (Rezende & Mohamed, 2015) | 85.8 |
Hierarchical variational model (Ranganath et al., 2016b) | 85.4 |
Importance-weighted auto-encoders () (Burda et al., 2016) | 86.3 |
HVM with IWAE objective () | 85.2 |
Rényi divergence () (Li & Turner, 2016) | 140.5 |
Inference methods for a probabilistic decoder on binarized MNIST. The Edward PPL is a convenient research platform, making it easy to both develop and experiment with many algorithms.
We demonstrate Edward’s flexibility for experimenting with complex inference algorithms. We consider the VAE setup from Figure 2 and the binarized MNIST data set (Salakhutdinov & Murray, 2008). We use latent variables per data point and optimize using ADAM. We study different components of the VAE setup using different methods; Section C.1 is a complete script. After training we evaluate held-out log likelihoods, which are lower bounds on the true value.
Table 1 shows the results. The first method uses the VAE from Figure 2. The next three methods use the same VAE but apply different gradient estimators: reparameterization gradient without an analytic KL; reparameterization gradient with an analytic entropy; and the score function gradient (Paisley et al., 2012; Ranganath et al., 2014). This typically leads to the same optima but at different convergence rates. The score function gradient was slowest. Gradients with an analytic entropy produced difficulties around convergence: we switched to stochastic estimates of the entropy as it approached an optima. We also use HVM (Ranganath et al., 2016b) with a normalizing flow prior; it produced similar results as a normalizing flow on the latent variable space (Rezende & Mohamed, 2015), and better than IWAE (Burda et al., 2016).
We also study novel combinations, such as HVM with the IWAE objective, GAN-based optimization on the decoder (with pixel intensity-valued data), and Rényi divergence on the decoder. GAN-based optimization does not enable calculation of the log-likelihood; Rényi divergence does not directly optimize for log-likelihood so it does not perform well. The key point is that Edward is a convenient research platform: they are all easy modifications of a given script.
Probabilistic programming system | Runtime (s) |
---|---|
Handwritten NumPy (1 CPU) | 534 |
Stan (1 CPU) (Carpenter et al., 2016) | 171 |
PyMC3 (12 CPU) (Salvatier et al., 2015) | 30.0 |
Edward (12 CPU) | 8.2 |
Handwritten TensorFlow (GPU) | 5.0 |
Edward (GPU) | 4.9 |
We benchmark runtimes for a fixed number of Hamiltonian Monte Carlo (HMC; Neal, 2011) iterations on modern hardware: a 12-core Intel i7-5930K CPU at 3.50GHz and an NVIDIA Titan X (Maxwell) GPU. We apply logistic regression on the Covertype dataset (, ; responses were binarized) using Edward, Stan (with PyStan) (Carpenter et al., 2016), and PyMC3 (Salvatier et al., 2015). We ran 100 HMC iterations, with 10 leapfrog updates per iteration, a step size of , and single precision. Figure 9 illustrates the program in Edward.
Table 2 displays the runtimes.^{3}^{3}3In a previous version of this paper, we reported PyMC3 took 361s. This was caused by a bug preventing PyMC3 from correctly handling single-precision floating point. (PyMC3 with double precision is roughly 14x slower than Edward (GPU).) This has been fixed after discussion with Thomas Wiecki. The reported numbers also exclude compilation time, which is significant for Stan. Edward (GPU) features a dramatic 35x speedup over Stan (1 CPU) and 6x speedup over PyMC3 (12 CPU). This showcases the value of building a PPL on top of computational graphs. The speedup stems from fast matrix multiplication when calculating the model’s log-likelihood; GPUs can efficiently parallelize this computation. We expect similar speedups for models whose bottleneck is also matrix multiplication, such as deep neural networks.
There are various reasons for the speedup. Stan only used 1 CPU as it leverages multiple cores by running HMC chains in parallel. Stan also used double-precision floating point as it does not allow single-precision. For PyMC3, we note Edward’s speedup is not a result of PyMC3’s Theano backend compared to Edward’s TensorFlow. Rather, PyMC3 does not use Theano for all its computation, so it experiences communication overhead with NumPy. (PyMC3 was actually slower when using the GPU.) We predict that porting Edward’s design to Theano would feature similar speedups.
In addition to these speedups, we highlight that Edward has no runtime overhead: it is as fast as handwritten TensorFlow. Following Section 4.1, this is because the computational graphs for inference are in fact the same for Edward and the handwritten code.
In addition to Edward, we also release the Probability Zoo, a community repository for pre-trained probability models and their posteriors.^{4}^{4}4The Probability Zoo is available at http://edwardlib.org/zoo. It includes model parameters and inferred posterior factors, such as local and global variables during training and any inference networks.
It is inspired by the model zoo in Caffe
(Jia et al., 2014), which provides many pre-trained discriminative neural networks, and which has been key to making large-scale deep learning more transparent and accessible. It is also inspired by Forest (Stuhlmüller, 2012), which provides examples of probabilistic programs.We described Edward, a Turing-complete PPL with compositional representations for probabilistic models and inference. Edward expands the scope of probabilistic programming to be as flexible and computationally efficient as traditional deep learning. For flexibility, we showed how Edward can use a variety of composable inference methods, capture recent advances in variational inference and GAN, and finely control the inference algorithms. For efficiency, we showed how Edward leverages computational graphs to achieve fast, parallelizable computation, scales to massive data, and incurs no runtime overhead over handwritten code.
In present work, we are applying Edward as a research platform for developing new probabilistic models (Rudolph et al., 2016; Tran et al., 2017) and new inference algorithms (Dieng et al., 2016). As with any language design, Edward makes tradeoffs in pursuit of its flexibility and speed for research. For example, an open challenge in Edward is to better facilitate programs with complex control flow and recursion. While possible to represent, it is unknown how to enable their flexible inference strategies. In addition, it is open how to expand Edward’s design to dynamic computational graph frameworks—which provide more flexibility in their programming paradigm—but may sacrifice performance. A crucial next step for probabilistic programming is to leverage dynamic computational graphs while maintaining the flexibility and efficiency that Edward offers.
We thank the probabilistic programming community—for sharing our enthusiasm and motivating further work—including developers of Church, Venture, Gamalon, Hakaru, and WebPPL. We also thank Stan developers for providing extensive feedback as we developed the language, as well as Thomas Wiecki for experimental details. We thank the Google BayesFlow team—Joshua Dillon, Ian Langmore, Ryan Sepassi, and Srinivas Vasudevan—as well as Amr Ahmed, Matthew Johnson, Hung Bui, Rajesh Ranganath, Maja Rudolph, and Francisco Ruiz for their helpful feedback. This work is supported by NSF IIS-1247664, ONR N00014-11-1-0651, DARPA FA8750-14-2-0009, DARPA N66001-15-C-4032, Adobe, Google, NSERC PGS-D, and the Sloan Foundation.
International Joint Conference on Artificial Intelligence
, 1997.Large-scale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT’2010, pp. 177–186. Springer, 2010.Importance weighted autoencoders.
In International Conference on Learning Representations, 2016.Stochastic variational inference for hidden Markov models.
In Neural Information Processing Systems, 2014.DRAW: A recurrent neural network for image generation.
In International Conference on Machine Learning, 2015.Generative moment matching networks.
In International Conference on Machine Learning, 2015.The Population Posterior and Bayesian Inference on Streams.
In Neural Information Processing Systems, 2015.Handbook of Markov Chain Monte Carlo
, 2011., pp. 733–740. Citeseer, 2001.
Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning, 2014.On the quantitative analysis of deep belief networks.
In International Conference on Machine Learning, 2008.There are many examples available at http://edwardlib.org, including models, inference methods, and complete scripts. Below we describe several model examples; Appendix B describes an inference example (stochastic variational inference); Appendix C describes complete scripts. All examples in this paper are comprehensive, only leaving out import statements and fixed values. See the companion webpage for this paper (http://edwardlib.org/iclr2017) for examples in a machine-readable format with runnable code.
A Bayesian neural network is a neural network with a prior distribution on its weights.
Define the likelihood of an observation with binary label as
where is a 2-layer neural network whose weights and biases form the latent variables . Define the prior on the weights and biases to be the standard normal. See Figure 10. There are data points, features, and hidden units.
See Figure 11. Note that the program is written for illustration. We recommend vectorization in practice: instead of storing scalar random variables in lists of lists, one should prefer to represent few random variables, each which have many dimensions.
See Figure 12.
See Figure 13.
In the subgraph setting, we do data subsampling while working with a subgraph of the full model. This setting is necessary when the data and model do not fit in memory. It is scalable in that both the algorithm’s computational complexity (per iteration) and memory complexity are independent of the data set size.
For the code, we use the running example, a mixture model described in Figure 5.
The model is
To avoid memory issues, we work on only a subgraph of the model,
Assume the variational model is
parameterized by . Again, we work on only a subgraph of the model,
parameterized by . Importantly, only parameters are stored in memory for rather than .
We use KLqp, a variational method that minimizes the divergence measure (Jordan et al., 1999). We instantiate two algorithms: a global inference over given the subset of and a local inference over the subset of given . We also pass in a TensorFlow placeholder x_ph for the data, so we can change the data at each step.
We initialize the algorithms with the scale argument, so that computation on z and x
will be scaled appropriately. This enables unbiased estimates for stochastic gradients.
We now run the algorithm, assuming there is a next_batch function which provides the next batch of data.
After each iteration, we also reinitialize the parameters for ; this is because we do inference on a new set of local variational factors for each batch. This demo readily applies to other inference algorithms such as SGLD (stochastic gradient Langevin dynamics): simply replace qbeta and qz with Empirical random variables; then call ed.SGLD instead of ed.KLqp.
Note that if the data and model fit in memory but you’d still like to perform data subsampling for fast inference, we recommend not defining subgraphs. You can reify the full model, and simply index the local variables with a placeholder. The placeholder is fed at runtime to determine which of the local variables to update at a time. (For more details, see the website’s API.)
See Figure 14.
See Figure 15. This example uses data subsampling (Section 4.4). The priors and conditional likelihoods are defined only for a minibatch of data. Similarly the variational model only models the embeddings used in a given minibatch. TensorFlow variables contain the embedding vectors for the entire vocabulary. TensorFlow placeholders ensure that the correct embedding vectors are used as variational parameters for a given minibatch.
The Bernoulli variables y_pos and y_neg are fixed to be ’s and ’s respectively. They model whether a word is indeed the target word for a given context window or has been drawn as a negative sample. Without regularization (via priors), the objective we optimize is identical to negative sampling.
Comments
There are no comments yet.