VI is a technique for approximating the posterior distribution in probabilistic models (Jordan et al., 1999; Wainwright and Jordan, 2008). Given a probabilistic model of observed variables and hidden variables , the goal of VI is to approximate the posterior , which is intractable to compute exactly for many models. The idea of VI is to posit a family of distributions over the latent variables with free variational parameters . VI then fits those parameters to find the member of the family that is closest in KL divergence to the exact posterior, . This turns inference into optimization, and different ways of doing VI amount to different optimization algorithms for solving this problem.
For a certain class of probabilistic models, those where each conditional distribution is in an exponential family, we can easily use coordinate ascent optimization to minimize the KL divergence (Ghahramani and Beal, 2001)
. However, many important models do not fall into this class (e.g., probabilistic neural networks or Bayesian generalized linear models). This is the scenario that we focus on in this paper. Much recent research in VI has focused on these difficult settings, seeking effective optimization algorithms that can be used with any model. This has enabled the application of VI on nonconjugate probabilistic models(Carbonetto et al., 2009; Paisley et al., 2012; Ranganath et al., 2014; Titsias and Lázaro-Gredilla, 2014), deep neural networks (Neal, 1992; Hinton et al., 1995; Mnih and Gregor, 2014; Kingma and Welling, 2014), and probabilistic programming (Wingate and Weber, 2013; Kucukelbir et al., 2015; van de Meent et al., 2016).
One strategy for VI in nonconjugate models is to obtain Monte Carlo estimates of the gradient of the variational objective and to use stochastic optimization to fit the variational parameters. Within this strategy, there have been two main lines of research: BBVI(Ranganath et al., 2014) and reparameterization gradients (Salimans and Knowles, 2013; Kingma and Welling, 2014). Each enjoys different advantages and limitations.
BBVI expresses the gradient of the variational objective as an expectation with respect to the variational distribution using the log-derivative trick, also called reinforce or score function method (Glynn, 1990; Williams, 1992). It then takes samples from the variational distribution to calculate noisy gradients. BBVI is generic—it can be used with any type of latent variables and any model. However, the gradient estimates typically suffer from high variance, which can lead to slow convergence. Ranganath et al. (2014) reduce the variance of these estimates using Rao-Blackwellization (Casella and Robert, 1996) and control variates (Ross, 2002; Paisley et al., 2012; Gu et al., 2016). Other researchers have proposed further reductions, e.g., through local expectations (Titsias and Lázaro-Gredilla, 2015) and importance sampling (Ruiz et al., 2016).
The second approach to Monte Carlo gradients of the variational objective is through reparameterization (Price, 1958; Bonnet, 1964; Salimans and Knowles, 2013; Kingma and Welling, 2014; Rezende et al., 2014). This approach reparameterizes the latent variable
in terms of a set of auxiliary random variables whose distributions do not depend on the variational parameters (typically, a standard normal). This facilitates taking gradients of the variational objective because the gradient operator can be pushed inside the expectation, and because the resulting procedure only requires drawing samples from simple distributions, such as standard normals. We describe this in detail in Section2.
Reparameterization gradients exhibit lower variance than BBVI gradients. They typically need only one Monte Carlo sample to estimate a noisy gradient, which leads to fast algorithms. Further, for some models, their variance can be bounded (Fan et al., 2015). However, reparameterization is not as generic as BBVI. It is typically used with Gaussian variational distributions and does not easily generalize to other common distributions, such as the gamma or beta, without using further approximations. (See Knowles (2015)
for an alternative approach to deal with the gamma distribution.)
We develop the G-REP gradient, a new method to extend reparameterization to other variational distributions. The main idea is to define an invertible transformation of the latent variables such that the distribution of the transformed variables is only weakly governed by the variational parameters. (We make this precise in Section 3.) Our technique naturally combines both BBVI and reparameterization; it applies to a wide class of nonconjugate models; it maintains the black-box criteria of reusing variational families; and it avoids approximations. We empirically show in two probabilistic models—a nonconjugate factorization model and a deep exponential family (Ranganath et al., 2015)—that a single Monte Carlo sample is enough to build an effective low-variance estimate of the gradient. In terms of speed, G-REP outperforms BBVI. In terms of accuracy, it outperforms ADVI (Kucukelbir et al., 2016), which considers Gaussian variational distributions on a transformed space.
Consider a probabilistic model , where denotes the latent variables and the observations. We assume that the posterior distribution is analytically intractable and we wish to apply VI. We introduce a tractable distribution to approximate and minimize the KL divergence with respect to the variational parameters . This minimization is equivalently expressed as the maximization of the so-called ELBO (Jordan et al., 1999),
to be the model log-joint density and to be the entropy of the variational distribution. When the expectation is analytically tractable, the maximization of the ELBO can be carried out using standard optimization methods. Otherwise, when it is intractable, other techniques are needed. Recent approaches rely on stochastic optimization to construct Monte Carlo estimates of the gradient with respect to the variational parameters. Below, we review the two main methods for building such Monte Carlo estimates: the score function method and the reparameterization trick.
Score function method. A general way to obtain unbiased stochastic gradients is to use the score function method, also called log-derivative trick or reinforce (Williams, 1992; Glynn, 1990), which has been recently applied to VI (Paisley et al., 2012; Ranganath et al., 2014; Mnih and Gregor, 2014). It is based on writing the gradient of the ELBO with respect to as
and then building Monte Carlo estimates by approximating the expectation with samples from . The resulting estimator suffers from high variance, making it necessary to apply variance reduction methods such as control variates (Ross, 2002) or Rao-Blackwellization (Casella and Robert, 1996). Such variance reduction techniques have been used in BBVI (Ranganath et al., 2014).
Reparameterization. The reparameterization trick (Salimans and Knowles, 2013; Kingma and Welling, 2014) expresses the latent variables as an invertible function of another set of variables , i.e., , such that the distribution of the new random variables does not depend on the variational parameters . Under these assumptions, expectations with respect to can be expressed as , and the gradient with respect to can be pushed into the expectation, yielding
The assumption here is that the log-joint is differentiable with respect to the latent variables. The gradient depends on the model, but it can be computed using automatic differentiation tools (Baydin et al., 2015). Monte Carlo estimates of the reparameterization gradient typically present much lower variance than those based on Eq. 3. In practice, a single sample from is enough to estimate a low-variance gradient.111In the literature, there is no formal proof that reparameterization has lower variance than the score function estimator, except for some simple models (Fan et al., 2015). Titsias and Lázaro-Gredilla (2014) provide some intuitions, and Rezende et al. (2014) show some benefits of reparameterization in the Gaussian case.
The reparameterization trick is thus a powerful technique to reduce the variance of the estimator, but it requires a transformation such that does not depend on the variational parameters . For instance, if the variational distribution is Gaussian with mean and covariance , a straightforward transformation consists of standardizing the random variable , i.e.,
This transformation ensures that the (Gaussian) distribution does not depend on or . For a general variational distribution , Kingma and Welling (2014) discuss three families of transformations: inverse CDF, location-scale, and composition. However, these transformations may not apply in certain cases.222The inverse CDF approach sets to the CDF. This leads to a uniform distribution over
to the CDF. This leads to a uniform distribution overon the unit interval, but it is not practical because the inverse CDF, , does not have analytical solution in general. We develop an approach that does not require computation of CDF or their derivatives. Notably, none of them apply to the gamma333Composition is only available when it is possible to express the gamma as a sum of exponentials, i.e., its shape parameter is an integer, which is not generally the case in VI.
and the beta distributions, although these distributions are often used in VI.
Next, we show how to relax the constraint that the transformed density must not depend on the variational parameters . We follow a standardization procedure similar to the Gaussian case in Eq. 5, but we allow the distribution of the standardized variable to depend (at least weakly) on .
3 The Generalized Reparameterization Gradient
We now generalize the reparameterization idea to distributions that, like the gamma or the beta, do not admit the standard reparameterization trick. We assume that we can efficiently sample from the variational distribution , and that is differentiable with respect to and . We introduce a random variable defined by an invertible transformation
where we can think of as a standardization procedure that attempts to make the distribution of weakly dependent on the variational parameters
. “Weakly” means that at least its first moment does not depend on. For instance, if is defined to have zero mean, then its first moment has become independent of . However, we do not assume that the resulting distribution of is completely independent of the variational parameters , and therefore we write it as . We use the distribution in the derivation of G-REP, but we write the final gradient as an expectation with respect to the original variational distribution , from which we can sample.
More in detail, by the standard change-of-variable technique, the transformed density is
is a short-hand for the absolute value of the determinant of the Jacobian. We first use the transformation to rewrite the gradient of in (1) as
We now express the gradient as the sum of two terms, which we name and for reasons that we will explain below. We apply the log-derivative trick and the product rule for derivatives, yielding
We rewrite Eq. 9 as an expression that involves expectations with respect to the original variational distribution only. For that, we define the following two auxiliary functions that depend on the transformation :
After some algebra (see the Supplement for details), we obtain
Thus, we can finally write the full gradient of the ELBO as
Interpretation of the generalized reparameterization gradient. The term is easily recognizable as the standard reparameterization gradient, and hence the label “rep.” Indeed, if the distribution does not depend on the variational parameters , then the term in Eq. 9 vanishes, making . Thus, we may interpret as a “correction” term that appears when the transformed density depends on the variational parameters.
Furthermore, we can recover the score function gradient in Eq. 3 by choosing the identity transformation, . In such case, the auxiliary functions in Eq. 10 become zero because the transformation does not depend on , i.e., and . This implies that and .
where the second line is the control variate, which involves the reparameterization gradient.
Transformations. Eqs. 9 and 11 are valid for any transformation . However, we may expect some transformations to perform better than others, in terms of the variance of the resulting estimator. It seems sensible to search for transformations that make small, as the reparameterization gradient is known to present low variance in practice under standard smoothness conditions of the log-joint (Fan et al., 2015).444Techniques such as Rao-Blackwellization could additionally be applied to reduce the variance of . We do not apply any such technique in this paper. Transformations that make small are such that becomes weakly dependent on the variational parameters . In the standard reparameterization of Gaussian random variables, the transformation takes the form in (5), and thus is a standardized version of . We mimic this standardization idea for other distributions as well. In particular, for exponential family distributions, we use transformations of the form (sufficient statistic expected sufficient statistic)(scale factor). We present several examples in the next section.
For concreteness, we show here some examples of the equations above for well-known probability distributions. In particular, we choose the gamma, log-normal, and beta distributions.
Gamma distribution. Let be a gamma distribution with shape and rate . We use a transformation based on standardization of the sufficient statistic , i.e.,
where denotes the digamma function, and is its -th derivative. This ensures that has zero mean and unit variance, and thus its two first moments do not depend on the variational parameters and . We now compute the auxiliary functions in Eq. 10 for the components of the gradient with respect to and , which take the form
The terms and are obtained after substituting these results in Eq. 11. We provide the final expressions in the Supplement. We remark here that the component of corresponding to the derivative with respect to the rate equals zero, i.e., , meaning that the distribution of does not depend on the parameter . Indeed, we can compute this distribution following Eq. 7 as
where we can verify that it does not depend on .
For a log-normal distribution with locationand scale , we can standardize the sufficient statistic as
This leads to a standard normal distribution on , which does not depend on the variational parameters, and thus . The auxiliary function , which is needed for , takes the form
Thus, the reparameterization gradient is given in this case by
This corresponds to ADVI (Kucukelbir et al., 2016) with a logarithmic transformation over a positive random variable, since the variational distribution over the transformed variable is Gaussian. For a general variational distribution, we recover ADVI if the transformation makes Gaussian.
Beta distribution. For a random variable , we could rewrite for and , and apply the gamma reparameterization for and . Instead, in the spirit of applying standardization directly over
, we define a transformation to standardize the logit function,(sum of sufficient statistics of the beta),
This ensures that
has zero mean. We can set the denominator to the standard deviation of. However, for larger-scaled models we found better performance with a denominator that makes for the currently drawn sample (see the Supplement for details), even though the variance of the transformed variable is not one in such case.555Note that this introduces some bias since we are ignoring the dependence of on . The reason is that suffers from high variance in the same way as the score function estimator does.
We now present our full algorithm for G-REP. It requires the specification of the variational family and the transformation . Given these, the full procedure is summarized in Algorithm 1. We use the adaptive step-size sequence proposed by Kucukelbir et al. (2016), which combines rmsprop (Tieleman and Hinton, 2012) and Adagrad (Duchi et al., 2011). Let be the -th component of the gradient at the -th iteration, and the step-size for that component. We set
where we set , , , and we explore several values of . Thus, we update the variational parameters as , where ‘’ is the element-wise product.
3.3 Related work
A closely related VI method is ADVI, which also relies on reparameterization and has been incorporated into Stan (Kucukelbir et al., 2015, 2016). ADVI applies a transformation to the random variables such that their support is on the reals and then uses a Gaussian variational posterior on the transformed space. For instance, random variables that are constrained to be positive are first transformed through a logarithmic function and then a Gaussian variational approximating distribution is placed on the unconstrained space. Thus, ADVI struggles to approximate probability densities with singularities, which are useful in models where sparsity is appropriate. In contrast, the G-REP method allows to estimate the gradient for a wider class of variational distributions, including gamma and beta distributions, which are more appropriate to encode sparsity constraints.
to automatically estimate the gradient through a backpropagation algorithm in the context of stochastic computation graphs. However, they do not provide additional insight into this equation, do not apply it to general VI, do not discuss transformations for any distributions, and do not report experiments. Thus, our paper complementsSchulman et al. (2015) and provides an off-the-shelf tool for general VI.
We apply G-REP to perform mean-field VI on two nonconjugate probabilistic models: the sparse gamma DEF and a beta-gamma MF model. The sparse gamma DEF (Ranganath et al., 2015) is a probabilistic model with several layers of latent locations and latent weights, mimicking the architecture of a deep neural network. The weights of the model are denoted by , where and run over latent components, and indexes the layer. The latent locations are , where
denotes the observation. We consider Poisson-distributed observationsfor each dimension . Thus, the model is specified as
We place gamma priors over the weights with rate and shape , and a gamma prior with rate and shape over the top-layer latent variables
. We set the hyperparameter, and we use layers with , , and latent factors.
The second model is a beta-gamma MF model with weights and latent locations . We use this model to describe binary observations , which are modeled as
where and is the inverse logit function. We place a gamma prior with shape and rate over the weights , a uniform prior over the variables , and we use latent components.
Datasets. We apply the sparse gamma DEF on two different databases: (i) the Olivetti database at AT&T,666http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html which consists of 400 (320 for training and 80 for test) images of human faces in a 8 bit scale (); and (ii) the collection of papers at the Neural Information Processing Systems (nips) 2011 conference, which consists of documents and a vocabulary of effective words in a bag-of-words format ( of words from all documents are set aside to form the test set).
We apply the beta-gamma MF on: (i) the binarizedmnist data,777http://yann.lecun.com/exdb/mnist which consists of images of hand-written digits (we use training and test images); and (ii) the Omniglot dataset (Lake et al., 2015), which consists of images of hand-written characters from different alphabets (we select 10 alphabets, with training images, test images, and characters).
Evaluation. We apply mean-field VI and we compare G-REP with BBVI (Ranganath et al., 2014) and ADVI (Kucukelbir et al., 2016). We do not apply BBVI on the Omniglot dataset due to its computational complexity. At each iteration, we evaluate the ELBO using one sample from the variational distribution, except for ADVI, for which we use samples (for the Omniglot dataset, we only use one sample). We run each algorithm with a fixed computational budget of CPU time. After that time, we also evaluate the predictive log-likelihood on the test set, averaging over 100 posterior samples. For the nips data, we also compute the test perplexity (with one posterior sample) every 10 iterations, given by
Experimental setup. To estimate the gradient, we use Monte Carlo samples for BBVI, and only for ADVI and G-REP. For BBVI, we use Rao-Blackwellization and control variates (we use a separate set of samples to estimate the control variates). For BBVI and G-REP, we use beta and gamma variational distributions, whereas ADVI uses Gaussian distributions on the transformed space, which correspond to log-normal or logit-normal distributions on the original space. Thus, only G-REP and BBVI optimize the same variational family. We parameterize the gamma distribution in terms of its shape and mean, and the beta in terms of its shape parameters and . To avoid constrained optimization, we apply the transformation to the variational parameters that are constrained to be positive and take stochastic gradient steps with respect to . We use the analytic gradient of the entropy terms. We implement ADVI as described by Kucukelbir et al. (2016).
We use the step-size schedule in Eq. 13, and we explore the parameter . For each algorithm and each dataset, we report the results based on the value of for which the best ELBO was achieved. We report the values of in Table 1 (left).
Results. We show in Figure 1 the evolution of the ELBO as a function of the running time for three of the considered datasets. BBVI converges slower than the rest of the methods, since each iteration involves drawing multiple samples and evaluating the log-joint for each of them. ADVI and G-REP achieve similar bounds, except for the mnist dataset, for which G-REP provides a variational approximation that is closer to the posterior, since the ELBO is higher. This is because a variational family with sparse gamma and beta distributions provides a better fit to the data than the variational family to which ADVI is limited (log-normal and logit-normal). ADVI seems to converge slower; however, we do not claim that ADVI converges slower than G-REP in general. Instead, the difference may be due to the different step-sizes schedules that we found to be optimal (see Table 1). We also report in Table 1 (right) the average time per iteration888On the full mnist with training images, G-REP (ADVI) took () seconds per iteration. for each method: BBVI is the slowest method, and ADVI is the fastest because it involves simulation of Gaussian random variables only.
However, G-REP provides higher likelihood values than ADVI. We show in Figure (a)a the evolution of the perplexity (lower is better) for the nips dataset, and in Figure (b)b the resulting test log-likelihood (larger is better) for the rest of the considered datasets. In Figure (b)b, we report the mean and standard deviation over
posterior samples. ADVI cannot fit the data as well as G-REP or BBVI because it is constrained to log-normal and logit-normal variational distributions. These cannot capture sparsity, which is an important feature for the considered models. We can also conclude this by a simple visual inspection of the fitted models. In the Supplement, we compare images sampled from the G-REP and the ADVI posteriors, where we can observe that the latter are more blurry or lack some details.
We have introduced the generalized reparameterization gradient (G-REP), a technique to extend the standard reparameterization gradient to a wider class of variational distributions. As the standard reparameterization method, our method is applicable to any probabilistic model that is differentiable with respect to the latent variables. We have demonstrated the generalized reparameterization gradient on two nonconjugate probabilistic models to fit a variational approximation involving gamma and beta distributions. We have also empirically shown that a single Monte Carlo sample is enough to obtain a noisy estimate of the gradient, therefore leading to a fast inference procedure.
This project has received funding from the EU H2020 programme (Marie Skłodowska-Curie grant agreement 706760), NFS IIS-1247664, ONR N00014-11-1-0651, DARPA FA8750-14-2-0009, DARPA N66001-15-C-4032, Adobe, the John Templeton Foundation, and the Sloan Foundation. The authors would also like to thank Kriste Krstovski, Alp Kuckukelbir, and Christian Naesseth for helpful comments and discussions.
- Baydin et al. (2015) Baydin, A. G., Pearlmutter, B. A., and Radul, A. A. (2015). Automatic differentiation in machine learning: a survey. arXiv:1502.05767.
- Bonnet (1964) Bonnet, G. (1964). Transformations des signaux aléatoires a travers les systemes non linéaires sans mémoire. Annals of Telecommunications, 19(9):203–220.
- Carbonetto et al. (2009) Carbonetto, P., King, M., and Hamze, F. (2009). A stochastic approximation method for inference in probabilistic graphical models. In Advances in Neural Information Processing Systems.
- Casella and Robert (1996) Casella, G. and Robert, C. P. (1996). Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81–94.
Duchi et al. (2011)
Duchi, J., Hazan, E., and Singer, Y. (2011).
Adaptive subgradient methods for online learning and stochastic
Journal of Machine Learning Research, 12:2121–2159.
- Fan et al. (2015) Fan, K., Wang, Z., Beck, J., Kwok, J., and Heller, K. A. (2015). Fast second order stochastic backpropagation for variational inference. In Advances in Neural Information Processing Systems.
- Ghahramani and Beal (2001) Ghahramani, Z. and Beal, M. J. (2001). Propagation algorithms for variational Bayesian learning. In Advances in Neural Information Processing Systems.
- Glynn (1990) Glynn, P. W. (1990). Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84.
- Gu et al. (2016) Gu, S., Levine, S., Sutskever, I., and Mnih, A. (2016). MuProp: Unbiased backpropagation for stochastic neural networks. In International Conference on Learning Representations.
- Hinton et al. (1995) Hinton, G., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268(5214):1158–1161.
- Jordan et al. (1999) Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233.
- Kingma and Welling (2014) Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In International Conference on Learning Representations.
- Knowles (2015) Knowles, D. A. (2015). Stochastic gradient variational Bayes for gamma approximating distributions. arXiv:1509.01631v1.
- Kucukelbir et al. (2015) Kucukelbir, A., Ranganath, R., Gelman, A., and Blei, D. M. (2015). Automatic variational inference in Stan. In Advances in Neural Information Processing Systems.
- Kucukelbir et al. (2016) Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2016). Automatic differentiation variational inference. arXiv:1603.00788.
- Lake et al. (2015) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338.
- Mnih and Gregor (2014) Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks. In International Conference on Machine Learning.
- Neal (1992) Neal, R. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56(1):71–113.
Paisley et al. (2012)
Paisley, J. W., Blei, D. M., and Jordan, M. I. (2012).
Variational Bayesian inference with stochastic search.In International Conference on Machine Learning.
- Price (1958) Price, R. (1958). A useful theorem for nonlinear devices having Gaussian inputs. IRE Transactions on Information Theory, 4(2):69–72.
- Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. M. (2014). Black box variational inference. In Artificial Intelligence and Statistics.
- Ranganath et al. (2015) Ranganath, R., Tang, L., Charlin, L., and Blei, D. M. (2015). Deep exponential families. In Artificial Intelligence and Statistics.
- Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning.
- Ross (2002) Ross, S. M. (2002). Simulation. Elsevier.
- Ruiz et al. (2016) Ruiz, F. J. R., Titsias, M. K., and Blei, D. M. (2016). Overdispersed black-box variational inference. In Uncertainty in Artificial Intelligence.
Salimans and Knowles (2013)
Salimans, T. and Knowles, D. A. (2013).
Fixed-form variational posterior approximation through stochastic linear regression.Bayesian Analysis, 8(4):837–882.
- Schulman et al. (2015) Schulman, J., Heess, N., Weber, T., and Abbeel, P. (2015). Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems.
- Tieleman and Hinton (2012) Tieleman, T. and Hinton, G. (2012). Lecture 6.5-RMSPROP: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning, 4.
- Titsias and Lázaro-Gredilla (2014) Titsias, M. K. and Lázaro-Gredilla, M. (2014). Doubly stochastic variational Bayes for non-conjugate inference. In International Conference on Machine Learning.
- Titsias and Lázaro-Gredilla (2015) Titsias, M. K. and Lázaro-Gredilla, M. (2015). Local expectation gradients for black box variational inference. In Advances in Neural Information Processing Systems.
- van de Meent et al. (2016) van de Meent, J.-W., Tolpin, D., Paige, B., and Wood, F. (2016). Black-box policy search with probabilistic programs. In Artificial Intelligence and Statistics.
- Wainwright and Jordan (2008) Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305.
Williams, R. J. (1992).
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8(3–4):229–256.
- Wingate and Weber (2013) Wingate, D. and Weber, T. (2013). Automated variational inference in probabilistic programming. arXiv:1301.1299.
Appendix A Derivation of the Generalized Reparameterization Gradient
Here we show the mathematical derivation of the generalized reparameterization gradient. Firstly, recall the definition of the functions
which are provided in the main text.
We start from the following expression of the gradient, also derived in the main text:
We can write the former term, , as
where we have first replaced the variational distribution on the transformed space with its form as a function of , i.e., . We have then applied the chain rule, and finally we have made a new change of variables back to the original space
. We have then applied the chain rule, and finally we have made a new change of variables back to the original space(thus multiplying by the inverse Jacobian).
For the latter, , we have that
The derivative can be obtained by the chain rule. If , then . We substitute this result in the above equation and revert the change of variables back to the original space (also multiplying by the inverse Jacobian), yielding
where we have used the definition of the functions and .
Appendix B Particularization for the Gamma Distribution
For the gamma distribution we choose the transformation
Thus, we have that
The derivatives of with respect to its arguments are given by
Therefore, the auxiliary functions and for the components of the gradient with respect to and can be written as
Thus, we finally obtain that the components of corresponding to the derivatives with respect to and are given by
while the components of can be similarly obtained by substituting the expressions above into Eq. 25. Remarkably, we obtain that
Appendix C Particularization for the Beta Distribution
For a random variable , we could rewrite for and , and apply the above method for the gamma-distributed variables and . Instead, in the spirit of applying standardization directly over , we define a transformation to standardize the logit function. This leads to
This transformation ensures that has mean zero. However, in this case we do not specify the form of , and we let it be a function of and . This allows us to choose in such a way that for the sampled value of , which we found to work well (even though this introduces some bias). For simplicity, we write .
Thus, we have that
The derivatives of with respect to its arguments are given by
Therefore, the auxiliary functions and for the components of the gradient with respect to and can be written as
Note that the term above can be computed from without knowledge of the value of as .
Thus, we finally obtain that the components of corresponding to the derivatives with respect to and are given by
where we are still free to choose and . We have found that the choice of these values such that works well in practice. Thus, we set the derivatives of such that the relationships
hold for the sampled value of . This involves solving a simple linear equation for and .
Appendix D Particularization for the Dirichlet Distribution
For a distribution, with , we can apply the standardization
where the mean is a -length vector and the covariance
-length vector and the covarianceis a matrix,999Instead, we could define a transformation that ignores the off-diagonal terms of the covariance matrix. This would lead to faster computations but higher variance of the resulting estimator.101010We could also apply the full-covariance transformation for the beta distribution. which are respectively given by
Here, we have defined . The covariance matrix can be rewritten as a diagonal matrix plus a rank one update, which can be exploited for faster computations: