1 Introduction
Neural networks, typically trained using backpropagation for parameter optimization, have recently demonstrated significant success across a wide range of applications. There has been interest in coupling neural networks with random variables, so as to embrace greater descriptive capacity. Recent examples of this include blackbox variational inference (BBVI) (Kingma & Welling, 2014; Rezende et al., 2014; Ranganath et al., 2014; HernándezLobato et al., 2016; Ranganath et al., 2016b; Li & Turner, 2016; Ranganath et al., 2016a; Zhang et al., 2018) and generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015; Zhao et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Gan et al., 2017; Li et al., 2018)
. Unfortunately, efficiently backpropagating gradients through general distributions (random variables) remains a bottleneck. Most current methodology focuses on distributions with continuous random variables, for which the reparameterization trick may be readily applied
(Kingma & Welling, 2014; Grathwohl et al., 2017).As an example, the aforementioned bottleneck greatly constrains the applicability of BBVI, by limiting variational approximations to reparameterizable distributions. This limitation excludes discrete random variables and many types of continuous ones. From the perspective of GAN, the need to employ reparameterization has constrained most applications to continuous observations. There are many forms of data that are morenaturally discrete.
The fundamental problem associated with the aforementioned challenges is the need to efficiently calculate an unbiased lowvariance gradient wrt parameters for an expectation objective of the form ^{1}^{1}1In this paper, we consider expectation objectives meeting basic assumptions: () is differentiable wrt ; () is differentiable for continuous ; and () for discrete . For simplicity of the main paper, those assumptions are implicitly made, as well as fundamental rules like Leibniz’s rule. . We are interested in general distributions , for which the components of may be either continuous or discrete. Typically the components of have a hierarchical structure, and a subset of the components of play a role in evaluating .
Unfortunately, classical methods for estimating gradients of
wrt have limitations. The REINFORCE gradient (Williams, 1992), although generally applicable (, for continuous and discrete random variables), exhibits high variance with Monte Carlo (MC) estimation of the expectation, forcing one to apply additional variancereduction techniques. The reparameterization trick (Rep) (Salimans et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014) works well, with as few as only one MC sample, but it is limited to continuous reparameterizable . Many efforts have been devoted to improving these two formulations, as detailed in Section 6. However, none of these methods is characterized by generalization (applicable to general distributions) and efficiency (working well with as few as one MC sample).The key contributions of this work are based on the recognition that REINFORCE and Rep are seeking to solve the same objective, but in practice Rep yields lowervariance estimations, albeit for a narrower class of distributions. Recent work (Ranganath et al., 2016b) has made a connection between REINFORCE and Rep, recognizing that the former estimates a term the latter evaluates analytically. The high variance by which REINFORCE approximates this term manifests high variance in the gradient estimation. Extending these ideas, we make the following main contributions. () We propose a new General and Onesample (GO) gradient in Section 3, that principally generalizes Rep to many nonreparameterizable distributions and justifies two recent methods (Figurnov et al., 2018; Jankowiak & Obermeyer, 2018); the “One sample” motivating the name GO is meant to highlight the low variance of the proposed method, although of course one may use more than one sample if desired. () We find that the core of the GO gradient is something we term a variablenabla, which can be interpreted as the gradient of a random variable wrt a parameter. () Utilizing variablenablas to propagate the chain rule through distributions, we broaden the applicability of the GO gradient in Sections 45 and present statistical backpropagation, a statistical generalization of classic backpropagation (Rumelhart & Hinton, 1986). Through this generalization, we may couple neural networks to general random variables, and compute needed gradients with low variance.
2 Background
To motivate this paper, we begin by briefly elucidating common machine learning problems for which there is a need to efficiently estimate gradients of for functions of the form . Assume access to data samples , drawn i.i.d. from the true (and unknown) underlying distribution . We seek to learn a model to approximate . A classic approach to such learning is to maximize the expected log likelihood , perhaps with an added regularization term on . Expectation is approximated via the available data samples, as .
It is often convenient to employ a model with latent variables , , , with prior on . The integral wrt is typically intractable, motivating introduction of the approximate posterior , with parameters . The wellknown evidence lower bound (ELBO) (Jordan et al., 1999; Bishop, 2006) is defined as
(1)  
(2) 
where is the true posterior, and
represents the KullbackLeibler divergence. Variational learning seeks
.While computation of the ELBO has been considered for many years, a problem introduced recently concerns adversarial learning of , or, more precisely, learning a model that allows one to efficiently and accurately draw samples that are similar to . With generative adversarial networks (GANs) (Goodfellow et al., 2014), one seeks to solve
(3) 
where is a discriminator with parameters
, quantifying the probability
was drawn from , with representing the probability that it was drawn from . There have been many recent extensions of GAN (Radford et al., 2015; Zhao et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Gan et al., 2017; Li et al., 2018), but the basic setup in (3) holds for most.To optimize (1) and (3), the most challenging gradients that must be computed are of the form ; for (1) and , while for (3) and . The need to evaluate expressions like arises in many other machine learning problems, and consequently it has generated much prior attention.
Evaluation of the gradient of the expectation is simplified if the gradient can be moved inside the expectation. REINFORCE (Williams, 1992) is based on the identity
(4) 
While simple in concept, this estimate is known to have high variance when the expectation is approximated (as needed in practice) by a finite number of samples.
An approach (Salimans et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014) that has attracted recent attention is termed the reparameterization trick, applicable when can be reparametrized as , with , where is a differentiable transformation and is a simple distribution that may be readily sampled. To keep consistency with the literature, we call a distribution reparameterizable if and only if those conditions are satisfied^{2}^{2}2A Bernoulli random variable is identical to with . But is called nonreparameterizable because the transformation is not differentiable.. In this case we have
(5) 
This gradient, termed Rep, is typically characterized by relatively low variance, when approximating with a small number of samples . This approach has been widely employed for computation of the ELBO and within GAN, but it limits one to models that satisfy the assumptions of Rep.
3 GO Gradient
The reparameterization trick (Rep) is limited to reparameterizable random variables with continuous components. There are situations for which Rep is not readily applicable, where the components of
may be discrete or nonnegative Gamma distributed. We seek to gain insights from the relationship between REINFORCE and Rep, and generalize the types of random variables
for which the latter approach may be effected. We term our proposed approach a General and Onesample (GO) gradient. In practice, we find that this approach works well with as few as one sample for evaluating the expectation, and it is applicable to more general settings than Rep.Recall that Rep was first applied within the context of variational learning (Kingma & Welling, 2014), as in (1). Specifically, it was assumed , omitting explicit dependence on data , for notational convenience; is component of . In Kingma & Welling (2014)
corresponded to a Gaussian distribution
, with mean and variance . In the following we generalize such that it need not be Gaussian. Applying integration by parts (Ranganath et al., 2016b)(6)  
(7) 
where denotes with excluded, and
is the cumulative distribution function (CDF) of
. The “0” term is readily proven to be zero for any , with the assumption that doesn’t tend to infinity faster than tending to zero when .The “Key” term exactly recovers the onedimensional Rep when reparameterization , exists (Ranganath et al., 2016b). Further, applying in (6) yields REINFORCE. Consequently, it appears that Rep yields low variance by analytically setting to zero the unnecessary but highvarianceinjecting “0” term, while in contrast REINFORCE implicitly seeks to numerically implement both terms in (7).
We generalize for discrete , here assuming . It is shown in Appendix A.2 that this framework is also applicable to discrete with a finite alphabet. It may be shown (see Appendix A.2) that
(8)  
where , and for all .
Theorem 1 (GO Gradient).
For expectation objectives , where satisfies () ; () the corresponding CDF is differentiable wrt parameters ; and () one can calculate , the General and Onesample (GO) gradient is defined as
(9) 
where specifies with variablenabla , , and
All proofs are provided in Appendix A, where we also list for a wide selection of possible , for both continuous and discrete . Note for the special case with continuous , GO reduces to Implicit Rep gradients (Figurnov et al., 2018) and pathwise derivatives (Jankowiak & Obermeyer, 2018); in other words, GO provides a principled explanation for their low variance, namely their foundation (implicit differentiation) originates from integration by parts. For highdimensional discrete , calculating is computationally expensive. Fortunately, for often used in practice special properties hold that can be exploited for efficient parallel computing. Also for discrete with finite support, it is possible that one could analytically evaluate a part of expectations in (9) for lower variance, mimicking the local idea in Titsias & LázaroGredilla (2015); Titsias (2015). Appendix I shows an example illustrating how to handle these two issues in practice.
4 Deep GO Gradient
The GO gradient in Theorem 1 can only handle singlelayer meanfield , characterized by an independence assumption on the components of . One may enlarge the descriptive capability of by modeling it as a marginal distribution of a deep model (Ranganath et al., 2016b; Bishop, 2006). Hereafter, we focus on this situation, and begin with a 2layer model for simple demonstration. Specifically, consider
where , is the leaf variable, and the internal variable is assumed to be continuous. Components of are assumed to be conditionally independent given , but upon marginalizing out this independence is removed.
The objective becomes and via Theorem 1
(10) 
Lemma 1.
Lemma 1 shows that Rep is a special case of our deep GO gradient in the following Theorem 2. Note neither Implicit Rep gradients nor pathwise derivatives can recover Rep in general, because a neuralnetworkparameterized may lead to nontrivial CDF .
For the gradient wrt , we first apply Theorem 1, yielding
For continuous internal variable one can apply Theorem 1 again, from which
(11) 
Now extending the same procedure to deeper models with layers, we generalize the GO gradient in Theorem 2. Random variable is assumed to be the leaf variable of interest, and may be continuous or discrete; latent/internal random variables are assumed continuous (these generalize from above).
Theorem 2 (Deep GO Gradient).
For expectation objectives with being the marginal distribution of
where , all internal variables are continuous, the leaf variable is either continuous or discrete, , and one has access to variablenablas and , as defined in Theorem 1, the General and Onesample (GO) gradient is defined as
(12) 
where and for .
Corollary 1.
The deep GO gradient in Theorem 2 exactly recovers backpropagation (Rumelhart & Hinton, 1986) when each element distribution
is specified as the Dirac delta function located at the activated value after activation function.
5 Statistical BackPropagation and Hierarchical Variational Inference
Recall the motivating discussion in Section 2, in which we considered generative model and inference model , the former used to model synthesis of the observed data and the latter used for inference of given observed . In recent deep architectures, a hierarchical representation for cumulative latent variables has been considered (Rezende et al., 2014; Ranganath et al., 2015; Zhou et al., 2015, 2016; Ranganath et al., 2016b; Cong et al., 2017; Zhang et al., 2018). As an example, there are models with
. When performing inference for such models, it is intuitive to consider firstorder Markov chain structure for
. The discussion in this section is most relevant for variational inference, for computation of , and consequently we specialize to that notation in the subsequent discussion (we consider representations in terms of , rather than the more general notation employed in Section 4).Before proceeding, we seek to make clear the distinction between this section and Section 4. In the latter, only the leaf variable appears in ; see (12), with . That is because in Section 4 the underlying model is a marginal distribution of , , , which is relevant to the generators of GANs; see (3), with and . Random variables were marginalized out of to represent . As discussed in Section 4, were added there to enhance the modeling flexibility of . In this section, the deep set of random variables are inherent components of the underlying generative model for , , . Hence, all components of manifested via inference model play a role in . Besides, no specific structure is imposed on and in this section, moving beyond the aforementioned firstorder Markov structure. For a practical application, one may employ domain knowledge to design suitable graphical models for and , and then use the following Theorem 3 for training.
Theorem 3 (Statistical BackPropagation).
For expectation objectives
where denotes continuous internal variables with at least one child variable, represents continuous or discrete leaf variables with no children except , and is constructed as a hierarchical probabilistic graphical model
with each element distribution having accessible variablenablas as defined in Theorem 1, denotes the parent variables of , the General and Onesample (GO) gradient for is defined as
(13) 
where denotes the children variables of , and with ,
and is iteratively calculated as until leaf variables where
Statistical backpropagation in Theorem 3 is relevant to hierarchical variational inference (HVI) (Ranganath et al., 2016b; Hoffman & Blei, 2015; Mnih & Gregor, 2014) (see Appendix G), greatly generalizing GO gradients to the inference of directed acyclic probabilistic graphical models. In HVI variational distributions are specified as hierarchical graphical models constructed by neural networks. Using statistical backpropagation, one may rely on GO gradients to perform HVI with low variance, while greatly broadening modeling flexibility.
6 Related Work
There are many methods directed toward lowvariance gradients for expectationbased objectives. Attracted by the generalization of REINFORCE, many works try to improve its performance via efficient variancereduction techniques, like control variants (Mnih & Gregor, 2014; Titsias & LázaroGredilla, 2015; Gu et al., 2015; Mnih & Rezende, 2016; Tucker et al., 2017; Grathwohl et al., 2017) or via data augmentation and permutation techniques (Yin & Zhou, 2018). Most of this research focuses on discrete random variables, likely because Rep (if it exists) works well for continuous random variables but it may not exist for discrete random variables. Other efforts are devoted to continuously relaxing discrete variables, to combine both REINFORCE and Rep for variance reduction (Jang et al., 2016; Maddison et al., 2016; Tucker et al., 2017; Grathwohl et al., 2017).
Inspired by the low variance of Rep, there are methods that try to generalize its scope. The Generalized Rep (GRep) gradient (Ruiz et al., 2016) employs an approximate reparameterization whose transformed distribution weakly depends on the parameters of interest. Rejection sampling variational inference (RSVI) (Naesseth et al., 2016) exploits highlytuned transformations in mature rejection sampling simulation to better approximate Rep for nonreparameterizable distributions. Compared to the aforementioned methods, the proposed GO gradient, containing Rep as a special case for continuous random variables, applies to both continuous and discrete random variables with the same lowvariance as the Rep gradient. Implicit Rep gradients (Figurnov et al., 2018) and pathwise derivatives (Jankowiak & Obermeyer, 2018) are recent lowvariance methods that exploit the gradient of the expected function; they are special cases of GO in the singlelayer continuous settings.
The idea of gradient backpropagation through random variables has been exploited before. RELAX (Grathwohl et al., 2017), employing neuralnetworkparametrized control variants to assist REINFORCE for that goal, has a variance potentially as low as the Rep gradient. SCG (Schulman et al., 2015) utilizes the generalizability of REINFORCE to construct widelyapplicable stochastic computation graphs. However, REINFORCE is known to have high variance, especially for highdimensional problems, where the proposed methods are preferable when applicable (Schulman et al., 2015). Stochastic backpropagation (Rezende et al., 2014; Fan et al., 2015), focusing mainly on reparameterizable Gaussian random variables and deep latent Gaussian models, exploits the product rule for an integral to derive gradient backpropagation through several continuous random variables. By comparison, the proposed statistical backpropagation based on the GO gradient is applicable to most distributions for continuous random variables. Further, it also flexibly generalizes to hierarchical probabilistic graphical models with continuous internal variables and continuous/discrete leaf ones.
7 Experiments
We examine the proposed GO gradients and statistical backpropagation with four experiments: () simple onedimensional (gamma and negative binomial) examples are presented to verify the GO gradient in Theorem 1, corresponding to nonnegative and discrete random variables; (
) the discrete variational autoencoder experiment from
Tucker et al. (2017) and Grathwohl et al. (2017) is reproduced to compare GO with the stateoftheart variancereduction methods; () a multinomial GAN, generating discrete observations, is constructed to demonstrate the deep GO gradient in Theorem 2; () hierarchical variational inference (HVI) for two deep nonconjugate Bayesian models is developed to verify statistical backpropagation in Theorem 3. Note the experiments of Figurnov et al. (2018) and Jankowiak & Obermeyer (2018) additionally support our GO in the singlelayer continuous settings.Many mature machine learning frameworks, like TensorFlow
(Abadi et al., )and PyTorch
(Paszke et al., 2017), are optimized for implementation of methods like backpropagation. Fortunately, all gradient calculations in the proposed theorems obey the chain rule in expectation, enabling convenient incorporation of the proposed approaches into existing frameworks. Experiments presented below were implemented in TensorFlow or PyTorch with a Titan Xp GPU. Code for all experiments can be found at github.com/YulaiCong/GOgradient.Notation denotes the gamma distribution with shape and rate ,
the negative binomial distribution with number of failures
and success probability ,the Bernoulli distribution with probability
, the multinomial distribution with number of trials and event probabilities ,the Poisson distribution with rate
, and the Dirichlet distribution with concentration parameters .7.1 Gamma and NB Onedimensional Simple Examples
We first consider illustrative onedimensional “toy” problems, to examine the GO gradient for both continuous and discrete random variables. The optimization objective is expressed as
where for continuous we assume for given set , with and ; for discrete we assume for given set , with and . Stochastic gradient ascent with onesampleestimated gradients is used to optimize the objective, which is equivalent to minimizing .
Figure 2 shows the results (see Appendix H for additional details). For the nonnegative continuous associated with the gamma distribution, we compare our GO gradient with GRep (Ruiz et al., 2016), RSVI (Naesseth et al., 2016), and their modified version using the “sticking” idea (Roeder et al., 2017), denoted as GRepStick and RSVIStick, respectively. For RSVI and RSVIStick, the shape augmentation parameter is set as by default. The only difference between GRep and GRepStick (also RSVI and RSVIStick) is the latter does not analytically express the entropy . Figure 2 clearly shows the utility of employing sticking to reduce variance; without it, GRep and RSVI exhibit high variance, that destabilizes the optimization for small gamma shape parameters, as shown in Figures 2 and 2. We adopt the sticking approach hereafter for all the compared methods. Among methods with sticking, GO exhibits the lowest variance in general, as shown in Figures 2 and 2. GO empirically provides more stable learning curves, as shown in Figure 2. For the discrete case corresponding to the NB distribution, GO is compared to REINFORCE (Williams, 1992). To address the concern about comparison with the same number of evaluations of the expected function^{3}^{3}3In the NB experiments, REINFORCE uses sample and evaluation of the expected function; REINFORCE2 uses sample and evaluations; and GO uses sample and evaluations., another curve of REINFORCE using samples is also added, termed REINFORCE2. It is apparent from Figure 2 that, thanks to analytically removing the “0” terms in (8), the GO gradient has much lower variance, even in this simple onedimensional case.
7.2 Discrete Variational Autoencoder
To demonstrate the low variance of the proposed GO gradient, we consider the discrete variational autoencoder (VAE) experiment from REBAR (Tucker et al., 2017) and RELAX (Grathwohl et al., 2017), to make a direct comparison with stateoftheart variancereduction methods. Since the statistical backpropagation in Theorem 3 cannot handle discrete internal variables, we focus on the singlelatentlayer settings (1 layer of Bernoulli random variables), i.e.,
where is the parameters of the prior , means using a neural network to project the latent binary code to the parameters of the likelihood , and is similarly defined for . The objective is given in (1). See Appendix I for more details.
) on the stochastically binarized MNIST dataset
(Salakhutdinov & Murray, 2008). All methods are run with the same learning rate for iterations. The black line represents the best training ELBO of REBAR and RELAX. ELBOs are calculated using all training/validation data.Dataset  Model  Training  Validation  

REBAR  RELAX  GO  REBAR  RELAX  GO  
MNIST  Linear 1 layer  112.16  112.89  110.21  114.85  115.36  114.27 
Nonlinear  96.99  95.99  82.26  112.96  112.42  111.48  
Omniglot  Linear 1 layer  122.19  122.17  119.03  124.81  124.95  123.84 
Nonlinear  79.51  80.67  54.96  129.00  128.99  126.59 
Figure 3 (also Figure 9 of Appendix I) shows the training curves versus iteration and running time for the compared methods. Even without any variancereduction techniques, GO provides better performance, faster convergence rate, and better running efficiency (about ten times faster in achieving the best training ELBO of RERAR/RELAX in this experiment). We believe GO’s better performance originates from: () its inherent lowvariance nature; () GO has less parameters compared to REBAR and RELAX (no control variant is adopted for GO); () efficient batch processing methods (see Appendix I) are adopted to benefit from parallel computing. Table 5 presents the best training/validation ELBOs under various experimental settings for the compared methods. GO provides the best performance in all situations. Additional experimental results are given in Appendix I.
Many variancereduction techniques can be used to further reduce the variance of GO, especially when complicated models are of interest. Compared to RELAX, GO cannot be directly applied when is not computable or where the interested model has discrete internal variables (like multilayer Sigmoid belief networks (Neal, 1992)). For the latter issue, we present in Appendix B.4 a procedure to assist GO (or statistical backpropagation in Theorem 3) in handling discrete internal variables.
7.3 Multinomial GAN
To demonstrate the deep GO gradient in Theorem 2, we adopt multinomial leaf variables and construct a new multinomial GAN (denoted as MNGANGO) for generating discrete observations with a finite alphabet. The corresponding generator is expressed as
For brevity, we integrate the generator’s parameters into the NN notation, and do not explicitly express them. Details for this example are provided in Appendix J.
We compare MNGANGO with the recently proposed boundaryseeking GAN (BGAN) (Hjelm et al., 2018) on 1bit (1state, Bernoulli leaf variables ), 1bit (2state), 2bit (4state), 3bit (8state) and 4bit (16state) discrete image generation tasks, using quantized MNIST datasets (LeCun et al., 1998). Table 2 presents inception scores (Salimans et al., 2016) of both methods. MNGANGO performs better in general. Further, with GO’s assistance, MNGANGO shows more potential to benefit from richer information coming from more quantized states. For demonstration, Figure 4 shows the generated samples from the 4bit experiment, where better image quality and higher diversity are observed for the samples from MNGANGO.
7.4 HVI for Deep Exponential Families and Deep Latent Dirichlet Allocation
To demonstrate statistical backpropagation in Theorem 3, we design variational inference nets for two nonconjugate hierarchical Bayesian models, i.e., deep exponential families (DEF) (Ranganath et al., 2015) and deep latent Dirichlet allocation (DLDA) (Zhou et al., 2015, 2016; Cong et al., 2017).
DEF  
DLDA 
For demonstration, we design the inference nets following the firstorder Markov chain construction in Section 5, namely
Further details are provided in Appendix K. One might also wish to design inference nets that have structure beyond the above firstorder Markov chain construction, as in Zhang et al. (2018); we do not consider that here, but Theorem 3 is applicable to that case.
HVI for a 2layer DEF is first performed, with the ELBO curves shown in Figure 5. GO enables faster and more stable convergence. Figure 6 presents the HVI results for a 3layer DLDA, for which stable ELBOs are again observed. More importantly, with the GO gradient, one can utilize pure gradientbased methods to efficiently train such complicated nonconjugate models for meaningful dictionaries (see Appendix K for more implementary details).
8 Conclusions
For expectationbased objectives, we propose a General and Onesample (GO) gradient that applies to continuous and discrete random variables. We further generalize the GO gradient to cases for which the underlying model is deep and has a marginal distribution corresponding to the latent variables of interest, and to cases for which the latent variables are hierarchical. The GOgradient setup is demonstrated to yield the same lowvariance estimation as the reparameterization trick, which is only applicable to reparameterizable continuous random variables. Alongside the GO gradient, we constitute a means of propagating the chain rule through distributions. Accordingly, we present statistical backpropagation, to flexibly integrate deep neural networks with general classes of random variables. s
Acknowledgments
We thank the anonymous reviewers for their useful comments. The research was supported by part by DARPA, DOE, NIH, NSF and ONR. The Titan Xp GPU used was donated by the NVIDIA Corporation. We also wish to thank Chenyang Tao, Liqun Chen, and Chunyuan Li for helpful discussions.
References
 (1) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Largescale machine learning on heterogeneous systems. URL https://www.tensorflow.org/. Software available from tensorflow.org.
 Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. In ICLR, 2017.
 Bishop (2006) C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
 Cong et al. (2017) Y. Cong, B. Chen, H. Liu, and M. Zhou. Deep latent Dirichlet allocation with topiclayeradaptive stochastic gradient Riemannian MCMC. In ICML, 2017.
 Fan et al. (2015) K. Fan, Z. Wang, J. Beck, J. Kwok, and K. Heller. Fast second order stochastic backpropagation for variational inference. In NIPS, pp. 1387–1395, 2015.
 Figurnov et al. (2018) M. Figurnov, S. Mohamed, and A. Mnih. Implicit reparameterization gradients. arXiv preprint arXiv:1805.08498, 2018.
 Gan et al. (2017) Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative adversarial networks. In NIPS, pp. 5253–5262, 2017.
 Geddes et al. (1990) K. Geddes, M. Glasser, R. Moore, and T. Scott. Evaluation of classes of definite integrals involving elementary functions via differentiation of special functions. Applicable Algebra in Engineering, Communication and Computing, 1(2):149–165, 1990.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pp. 2672–2680, 2014.
 Grathwohl et al. (2017) W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for blackbox gradient estimation. arXiv:1711.00123, 2017.
 Gu et al. (2015) S. Gu, S. Levine, I. Sutskever, and A. Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. arXiv:1511.05176, 2015.
 HernándezLobato et al. (2016) J. M. HernándezLobato, Y. Li, M. Rowland, D. HernándezLobato, T. Bui, and R. E. Turner. Blackbox divergence minimization. In ICML, 2016.
 Hjelm et al. (2018) R. Hjelm, A. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio. Boundary seeking GANs. In ICLR, 2018.
 Hoffman & Blei (2015) M. Hoffman and D. Blei. Stochastic structured variational inference. In AISTATS, pp. 361–369, 2015.
 Jang et al. (2016) E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbelsoftmax. arXiv:1611.01144, 2016.
 Jankowiak & Obermeyer (2018) M. Jankowiak and F. Obermeyer. Pathwise derivatives beyond the reparameterization trick. arXiv preprint arXiv:1806.01851, 2018.
 Jordan et al. (1999) M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 Kingma & Welling (2014) D. P. Kingma and M. Welling. Autoencoding variational Bayes. In ICLR, 2014.
 LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

Li et al. (2017)
C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin.
Alice: Towards understanding adversarial learning for joint distribution matching.
In NIPS, pp. 5501–5509, 2017.  Li et al. (2018) C. Li, J. Li, G. Wang, and L. Carin. Learning to sample with adversarially learned likelihoodratio. 2018.
 Li & Turner (2016) Y. Li and R. E. Turner. Rényi divergence variational inference. In NIPS, pp. 1073–1081, 2016.
 Maddison et al. (2016) C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016.
 Mnih & Gregor (2014) A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML, pp. 1791–1799, 2014.
 Mnih & Rezende (2016) A. Mnih and D. Rezende. Variational inference for monte carlo objectives. In ICML, pp. 2188–2196, 2016.
 Naesseth et al. (2016) C. A. Naesseth, F. J. R. Ruiz, S. W. Linderman, and D. M. Blei. Rejection sampling variational inference. arXiv:1610.05683, 2016.
 Neal (1992) R. M. Neal. Connectionist learning of belief networks. Artif. Intell., 56(1):71–113, 1992.
 Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. 2017.
 Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Ranganath et al. (2014) R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, 2014.
 Ranganath et al. (2015) R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep exponential families. In AISTATS, 2015.
 Ranganath et al. (2016a) R. Ranganath, D. Tran, J. Altosaar, and D. Blei. Operator variational inference. In NIPS, pp. 496–504, 2016a.
 Ranganath et al. (2016b) R. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. In ICML, pp. 324–333, 2016b.
 Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
 Roeder et al. (2017) G. Roeder, Y. Wu, and D. K. Duvenaud. Sticking the landing: Simple, lowervariance gradient estimators for variational inference. In NIPS, pp. 6928–6937, 2017.
 Ruiz et al. (2016) F. J. R. Ruiz, M. K. Titsias, and D. Blei. The generalized reparameterization gradient. In NIPS, pp. 460–468, 2016.
 Rumelhart & Hinton (1986) D. Rumelhart and G. Hinton. Learning representations by backpropagating errors. Nature, 323(9), 1986.

Salakhutdinov & Murray (2008)
R. Salakhutdinov and I. Murray.
On the quantitative analysis of deep belief networks.
In ICML, pp. 872–879. ACM, 2008. 
Salimans et al. (2013)
T. Salimans, D. A. Knowles, et al.
Fixedform variational posterior approximation through stochastic linear regression.
Bayesian Analysis, 8(4):837–882, 2013.  Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, pp. 2234–2242, 2016.
 Schulman et al. (2015) J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In NIPS, pp. 3528–3536, 2015.
 Titsias (2015) M. Titsias. Local expectation gradients for doubly stochastic variational inference. arXiv preprint arXiv:1503.01494, 2015.
 Titsias & LázaroGredilla (2015) M. Titsias and M. LázaroGredilla. Local expectation gradients for black box variational inference. In NIPS, pp. 2638–2646, 2015.
 Tucker et al. (2017) G. Tucker, A. Mnih, C. Maddison, J. Lawson, and J. SohlDickstein. REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models. In NIPS, pp. 2624–2633, 2017.

Williams (1992)
R. Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine learning, 8(34):229–256, 1992.  Yin & Zhou (2018) M. Yin and M. Zhou. ARM: AugmentREINFORCEmerge gradient for discrete latent variable models. arXiv preprint arXiv:1807.11143, 2018.
 Zhang et al. (2018) H. Zhang, B. Chen, D. Guo, and M. Zhou. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In ICLR, 2018.
 Zhao et al. (2016) J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. In ICLR, 2016.
 Zhou et al. (2015) M. Zhou, Y. Cong, and B. Chen. The Poisson gamma belief network. In NIPS, pp. 3025–3033, 2015.
 Zhou et al. (2016) M. Zhou, Y. Cong, and B. Chen. Augmentable gamma belief networks. JMLR, 17(1):5656–5699, 2016.
Appendix A Proof of Theorem 1
We first prove (7) in the main manuscript, followed by its discrete counterpart, i.e., (8) in the main manuscript. Then, it is easy to verify Theorem 1.
a.1 Proof of Equation (7) in the Main Manuscript
Similar proof in onedimension is also given in the supplemental materials of Ranganath et al. (2016b).
We want to calculate
where denotes with excluded. Without loss of generality, we assume .
Let , and we have
where is the cumulative distribution function (CDF) of . Further define , we then apply integration by parts (or partial integration) to get
(14)  
With and , it’s straightforward to verify that the first term is always zero for any , thus named the “0” term.
a.2 Proof of Equation (8) in the Main Manuscript
For discrete variables , we have
where and is the size of the alphabet.
To handle the summation of products of two sequences and develop discrete counterpart of (7), we first introduce Abel transformation.
Abel transformation. Given two sequences and , with , we define and for .
Accordingly, we have
Substituting , , , and