Reducing Reparameterization Gradient Variance

05/22/2017 ∙ by Andrew C. Miller, et al. ∙ Harvard University berkeley college University of Washington 0

Optimization with noisy gradients has become ubiquitous in statistics and machine learning. Reparameterization gradients, or gradient estimates computed via the "reparameterization trick," represent a class of noisy gradients often used in Monte Carlo variational inference (MCVI). However, when these gradient estimators are too noisy, the optimization procedure can be slow or fail to converge. One way to reduce noise is to use more samples for the gradient estimate, but this can be computationally expensive. Instead, we view the noisy gradient as a random variable, and form an inexpensive approximation of the generating procedure for the gradient sample. This approximation has high correlation with the noisy gradient by construction, making it a useful control variate for variance reduction. We demonstrate our approach on non-conjugate multi-level hierarchical models and a Bayesian neural net where we observed gradient variance reductions of multiple orders of magnitude (20-2,000x).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Representing massive datasets with flexible probabilistic models has been central to the success of many statistics and machine learning applications, but the computational burden of fitting these models is a major hurdle. For optimization-based fitting methods, a central approach to this problem has been replacing expensive evaluations of the gradient of the objective function with cheap, unbiased, stochastic estimates of the gradient. For example, stochastic gradient descent using small mini-batches of (conditionally) i.i.d. data to estimate the gradient at each iteration is a popular approach with massive data sets. Alternatively, some learning methods sample directly from a generative model or approximating distribution to estimate the gradients of interest, for example, in learning algorithms for implicit models 

Mohamed and Lakshminarayanan (2016); Tran et al. (2017) and generative adversarial networks Arjovsky et al. (2017); Goodfellow et al. (2014).

Approximate Bayesian inference using variational techniques (variational inference, or VI) has also motivated the development of new stochastic gradient estimators, as the variational approach reframes the integration problem of inference as an optimization problem 

Blei et al. (2017). VI approaches seek out the distribution from a well-understood variational family of distributions that best approximates an intractable posterior distribution. The VI objective function itself is often intractable, but recent work has shown that it can be optimized with stochastic gradient methods that use Monte Carlo estimates of the gradient Kingma and Welling (2014); Ranganath et al. (2014); Rezende et al. (2014), We call this approach Monte Carlo variational inference (MCVI). In MCVI, generating samples from an approximate posterior distribution is the source of gradient stochasticity. Alternatively, stochastic variational inference (SVI) Hoffman et al. (2013) and other stochastic optimization procedures induce stochasticity through data subsampling; MCVI can be augmented with data subsampling to accelerate computation for large data sets.

The two commonly used MCVI gradient estimators are the score function gradient Ranganath et al. (2014) and the reparameterization gradient Kingma and Welling (2014); Rezende et al. (2014)

. Broadly speaking, score function estimates can be applied to both discrete and continuous variables, but often have high variance and thus are frequently used in conjunction with variance reduction techniques. On the other hand, the reparameterization gradient often has lower variance, but is restricted to continuous random variables. See

Ruiz et al. (2016) for a unifying perspective on these two estimators. Like other stochastic gradient methods, the success of MCVI depends on controlling the variance of the stochastic gradient estimator.

In this work, we present a novel approach to controlling the variance of the reparameterization gradient estimator in MCVI. Existing MCVI methods control this variance naïvely by averaging several gradient estimates, which becomes expensive for large data sets and complex models, with error that only diminishes as . Our approach exploits the fact that, in MCVI, the randomness in the gradient estimator is completely determined by a known Monte Carlo generating process; this allows us to leverage knowledge about this generating process to de-noise the gradient estimator. In particular, we construct a cheaply computed control variate based on an analytical linear approximation to the gradient estimator. Taking a linear combination of a naïve gradient estimate with this control variate yields a new estimator for the gradient that remains unbiased but has lower variance. We apply the idea to Gaussian approximating families and measure the reduction in gradient variance under various conditions. We observe a 20-2,000 reduction in variance of the gradient norm in some conditions, and much faster convergence and more stable behavior of optimization traces.

2 Background

Variational Inference

Given a model, , of data  and parameters/latent variables , the goal of VI is to approximate the posterior distribution . VI approximates this intractable posterior distribution with one from a simpler family, , parameterized by variational parameters . VI procedures seek out the member of that family, , that minimizes some divergence between the approximation  and the true posterior .

Variational inference can be framed as an optimization problem, usually in terms of Kullback-Leibler (KL) divergence, of the following form

The task is to find a setting of that makes close to the posterior in KL divergence.111We use and interchangeably. Directly computing the KL divergence requires evaluating the posterior itself; therefore, VI procedures use the evidence lower bound (ELBO) as the optimization objective

(1)

which, when maximized, minimizes the KL divergence between and .

To maximize the ELBO with gradient methods, we need to compute the gradient of the ELBO, . The gradient inherits the ELBO’s form as an expectation, which is in general an intractable quantity to compute. In this work, we focus on reparameterization gradient estimators (RGEs) computed using the reparameterization trick. The reparameterization trick exploits the structure of the variational data generating procedure — the mechanism by which  is simulated from . To compute the RGE, we first express the sampling procedure from as a differentiable map

independent of (2)
differentiable map (3)

where the initial distribution and are jointly defined such that  has the desired distribution. As a simple concrete example, if we set to be a diagonal Gaussian where , , and . The sampling procedure could then be defined as

(4)

where denotes an element-wise product. We will also use and to denote pointwise division and squaring, respectively. Given this map, the reparameterization gradient estimator is simply the gradient of a Monte Carlo ELBO estimate with respect to . For a single sample, this is

and the -sample approximation can be computed by the sample average

(5)

Crucially, the reparameterization gradient is unbiased, , guaranteeing the convergence of stochastic gradient optimization procedures that use it Robbins and Monro (1951).

(a) step size = .05
(b) step size = .1
Figure 1:

Optimization traces for MCVI applied to a Bayesian neural network with various hyperparameter settings. Each trace is running

adam Kingma and Ba (2015). The three lines in each plot correspond to three different numbers of samples, , used to estimate the gradient at each step. (Left) small stepsize; (Right) stepsize 10 times larger. Large step sizes allow for quicker progress, however noisier (i.e., small ) gradients combined with large step sizes result in chaotic optimization dynamics. The converging traces reach different ELBOs due to the illustrative constant learning rates; in practice, one decreases the step size over time in keeping with Robbins and Monro (1951).
Gradient Variance and Convergence

The efficiency of Monte Carlo variational inference hinges on the magnitude of gradient noise and the step size chosen for the optimization procedure. When the gradient noise is large, smaller gradient steps must be taken to avoid unstable dynamics of the iterates. However, a smaller step size increases the number of iterations that must be performed to reach convergence.

We illustrate this trade-off in Figure 1, which shows realizations of an optimization procedure applied to a Bayesian neural network using reparameterization gradients. The posterior is over  parameters that we approximate with a diagonal Gaussian (see Appendix C.2). We compare the progress of the Adam algorithm using various numbers of samples Kingma and Ba (2015), fixing the learning rate. The noise present in the single-sample estimator causes extremely slow convergence, whereas the lower noise 50-sample estimator quickly converges, albeit at 50 times the cost.

The upshot is that with low noise gradients we are able to safely take larger steps, enabling faster convergence to a local optimum. The natural question is, how can we reduce the variance of gradient estimates without introducing too much extra computation? Our approach is to use information about the variational model, and carefully construct a control variate to the gradient.

Control Variates

Control variates are random quantities that are used to reduce the variance of a statistical estimator without introducing any bias by injecting information into the estimator. Given an unbiased estimator

such that (the quantity of interest), our goal is to construct another unbiased estimator with lower variance. We can do this by defining a control variate with known expectation . We can write our new estimator as

(6)

where for -dimensional . Clearly the new estimator has the same expectation as the original estimator, but a different variance. We can attain optimal variance reduction by appropriately setting . Intuitively, the optimal is very similar to a regression coefficient — it is related to the covariance between the control variate and the original estimator. See Appendix A for further details on optimally setting .

3 Method: Modeling Reparameterization Gradients

In this section we develop our main contribution, a new gradient estimator that can dramatically reduce reparameterization gradient variance. In MCVI, the reparameterization gradient estimator (RGE) is a Monte Carlo estimator of the true gradient — the estimator itself is a random variable. This random variable is generated using the “reparameterization trick” — we first generate some randomness and then compute the gradient of the ELBO with respect to holding fixed. This results in a complex distribution from which we can generate samples, but in general cannot characterize due to the complexity of the term arising from the gradient of the true posterior.

However, we do have a lot of information about the sampling procedure — we know the variational distribution , the transformation , and we can evaluate the model joint density 

pointwise. Furthermore, with automatic differentiation, it is often straightforward to obtain gradients and Hessian-vector products of our model

. We propose a scheme that uses the structure of and curvature of to construct a tractable approximation of the distribution of the RGE.222We require the model to be twice differentiable. This approximation has a known mean and is correlated with the RGE distribution, allowing us to use it as a control variate to reduce the RGE variance.

Given a variational family parameterized by , we can decompose the ELBO gradient into a few terms that reveal its “data generating procedure”

(7)
(8)

Certain terms in Eq. (8) have tractable distributions. The Jacobian of , given by , is defined by our choice of . For some transformations we can exactly compute the distribution of the Jacobian given the distribution of . The pathwise and parameter score terms are gradients of our approximate distribution with respect to (via or directly). If our approximation is tractable (e.g., a multivariate Gaussian), we can exactly characterize the distribution for these components.333In fact, we know that the expectation of the parameter score term is zero, and removing that term altogether can sometimes be a source of variance reduction that we do not explore here Roeder et al. (2017).

However, the data term in Eq. (8) involves a potentially complicated function of the latent variable (and therefore a complicated function of ), resulting in a difficult-to-characterize distribution. Our goal is to construct an approximation to the distribution of  and its interaction with  given a fixed distribution over . If the approximation yields random variables that are highly correlated with , then we can use it to reduce the variance of that RGE sample.

Linearizing the data term

To simplify notation, we write the data term of the gradient as

(9)

where  since . We then linearize about some value

(10)

where is the Hessian of the model, , with respect to evaluated at ,

(11)

Note that even though this uses second-order information about the model, it is a first-order approximation of the gradient. We also view this as a transformation of the random for a fixed

(12)

which is linear in . For some forms of we can analytically derive the distribution of the random variable . In Eq. (8), the data term interacts with the Jacobian of , given by

(13)

which importantly is a function of the same as in Eq. (12). We form our approximation of the first term in Eq. (8) by multiplying Eqs. (12) and (13):

(14)

The tractability of this approximation hinges on how Eq. (14) depends on . When is multivariate normal, we show that this approximation has a computable mean and can be used to reduce variance in MCVI settings. In the following sections we describe and empirically test this variance reduction technique applied to diagonal Gaussian posterior approximations.

3.1 Gaussian Variational Families

Perhaps the most common choice of approximating distribution for MCVI is a diagonal Gaussian, parameterized by a mean and scales . 444For diagonal Gaussian , we define . The log pdf is

To generate a random variate from this distribution, we use the sampling procedure in Eq. (4). We denote the Monte Carlo RGE as . From this variational distribution, it is straightforward to derive the distributions of the pathwise score, param score, and Jacobian terms in Eq. (8).

The Jacobian term of the sampling procedure has two straightforward components

(15)

The pathwise score term is the partial derivative of the approximate log density with respect to , ignoring variation due to the variational distribution parameters and noting that :

(16)

The parameter score term is the partial derivative of the approximation log density with respect to variational parameters , ignoring variation due to . The and components are given by

(17)
(18)

The data term, , multiplied by the Jacobian of is all that remains to be approximated in Eq. (8). We linearize around where the approximation is expected to be accurate

(19)
(20)
Putting It Together: Full RGE Approximation

We write the complete approximation of the RGE in Eq. (8) by combining Eqs. (15), (16), (17), (18), and (20) which results in two components that are concatenated, . Each component is defined as

(21)
(22)

To summarize, we have constructed an approximation, , of the reparameterization gradient, , as a function of . Because both and are functions of the same random variable , and because we have mimicked the random process that generates true gradient samples, the two gradient estimators will be correlated. This approximation yields two tractable distributions — a Gaussian for the mean parameter gradient, , and a location shifted, scaled non-central for the scale parameter gradient . Importantly, we can compute the mean of each component

(23)

We use (along with its expectation) as a control variate to reduce the variance of the RGE .

3.2 Reduced Variance Reparameterization Gradient Estimators

Figure 2: Relationship between the base randomness , RGE , and approximation . Arrows indicate deterministic functions. Sharing correlates the random variables. We know the distribution of , which allows us to use it as a control variate for .
1:procedure RV-RGE-Optimize()
2:       
3:       
4:       for  do
5:              Base randomness
6:              Reparameterization gradients
7:              Mean approx
8:              Scale approx
9:              Mean approx expectation
10:              Scale approx expectation
11:              Subtract control variate
12:              Gradient step (sgd, adam, etc.)        
13:       return
Algorithm 1 Gradient descent with RV-RGE with a diagonal Gaussian variational family

Now that we have constructed a tractable gradient approximation, , with high correlation to the original reparameterization gradient estimator, , we can use it as a control variate as in Eq. (6)

(24)

The optimal value for is the covariance between and (see Appendix A). We can try to estimate the value of (or a diagonal approximation to ) on the fly, or we can simply fix this value. In our case, because we are using an accurate linear approximation to the transformation of a spherical Gaussian, the optimal value of will be close to the identity (see Appendix A.1).

High Dimensional Models

For models with high dimensional posteriors, direct manipulation of the Hessian is computationally intractable. However, our approximations in Eqs. (21) and (22) only require a Hessian-vector product, which can be computed nearly as efficiently as the gradient Pearlmutter (1994). We note that the mean of the control variate (Eq. (23)), depends on the diagonal of the Hessian matrix. While computing the Hessian diagonal may be tractable in some cases, in general it may cost the time equivalent of function evaluations to compute Martens et al. (2012). Given a high dimensional problem, we can avoid this bottleneck in multiple ways. The first is simply to ignore the random variation in the Jacobian term due to — if we fix to be (as we do with the data term), the portion of the Jacobian that corresponds to will be zero (in Eq. (15)). This will result in the same Hessian-vector-product-based estimator for but will set , yielding variance reduction for the mean parameter but not the scale.

Alternatively, we can estimate the Hessian diagonal on the fly. If we use samples at each iteration, we can create a per-sample estimate of the -scaled diagonal of the Hessian using the other samples Bekas et al. (2007). As the scaled diagonal estimator is unbiased, we can construct an unbiased estimate of the control variate mean to use in lieu of the actual mean (possibly increasing the final variance). A similar local baseline strategy is used for variance reduction in Mnih and Rezende (2016).

RV-RGE Estimators

We introduce three different estimators based on variations of the gradient approximation defined in Eqs. (21), (22), and (23), each adressing the Hessian operations differently.

  • The Full Hessian estimator implements the three equations as written and can be used when it is computationally feasible to use the full Hessian.

  • The Hessian Diagonal estimator replaces the Hessian in (21) with a diagonal approximation, useful for models with a cheap Hessian diagonal.

  • The Hessian-vector product + local approximation (HVP+Local) uses an efficient Hessian-vector product in Eqs. (21) and (22), while approximating the diagonal term in Eq. (23) using a local baseline. The HVP+Local approximation is geared toward models where Hessian-vector products can be computed, but the exact diagonal of the Hessian cannot.

We detail the RV-RGE algorithm in Algorithm 1 and compare properties of these three estimators to the pure Monte Carlo estimator in the following section.

3.3 Related Work

Recently, Roeder et al. (2017) introduced a variance reduction technique for reparameterization gradients that ignores the parameter score component of the gradient and can be viewed as a type of control variate for the gradient throughout the optimization procedure. This approach is complementary to our method — our approximation is typically more accurate near the beginning of the optimization procedure, whereas the estimator in Roeder et al. (2017) is low-variance near convergence. We hope to incorporate information from both control variates in future work. Per-sample estimators in a multi-sample setting for variational inference were used in Mnih and Rezende (2016). We employ this technique in a different way; we use it to estimate computationally intractable quantities needed to keep the gradient estimator unbiased. Black box variational inference used control variates and Rao-Blackwellization to reduce the variance of score-function estimators Ranganath et al. (2014). Our development of variance reduction for reparameterization gradients compliments their work. Other variance reduction techniques for stochastic gradient descent have focused on stochasticity due to data subsampling Johnson and Zhang (2013); Wang et al. (2013). Johnson and Zhang (2013)

cache statistics about the entire dataset at each epoch to use as a control variate for noisy mini-batch gradients.

4 Experiments and Analysis

Iteration Estimator Ave Ave Ave
early Full Hessian 1.279 1.139 0.001 0.002 0.008 1.039
Hessian Diag 34.691 23.764 0.003 0.012 0.194 21.684
HVP + Local 1.279 1.139 0.013 0.039 0.020 1.037
mid Full Hessian 0.075 0.068 0.113 0.143 0.076 0.068
Hessian Diag 38.891 21.283 6.295 7.480 38.740 21.260
HVP + Local 0.075 0.068 30.754 39.156 0.218 0.071
late Full Hessian 0.042 0.030 1.686 0.431 0.043 0.030
Hessian Diag 40.292 53.922 23.644 28.024 40.281 53.777
HVP + Local 0.042 0.030 98.523 99.811 0.110 0.022
Table 1: Comparing variances for RV-RGEs: -sample estimators. Values are percentage of pure MC RGE variance — a value of 100 indicates equal variation samples, a value of 1 percent indicates a 100-fold decrease in variance (lower is better).
(a) adam with step size = 0.05
(b) adam with step size = .10
Figure 3: MCVI optimization trace applied to the frisk model for two values of and step size. We run the standard MC gradient estimator (solid line) and the RV-RGE with and samples.
(a) adam with step size = 0.05
(b) adam with step size = 0.10
Figure 4: MCVI optimization for the bnn model applied to the wine data for various and step sizes. The standard MC gradient estimator (dotted) was run with 2, 10, and 50 samples; RV-RGE (solid) was run with 2 and 10 samples. In 3(b) the 2-sample MC estimator falls below the frame.

In this section we empirically examine the variance properties of RVRGs and stochastic optimization for two real-data examples — a hierarchical Poisson GLM and a Bayesian neural network.555Code is available at https://github.com/andymiller/ReducedVarianceReparamGradients.

  • Hierarchical Poisson GLM The frisk model is a hierarchical Poisson GLM, described in Appendix C.1. This non-conjugate model has a dimensional posterior.

  • Bayesian Neural Network The non-conjugate bnn model is a Bayesian neural network applied to the wine dataset, (see Appendix C.2) and has a dimensional posterior.

Quantifying Gradient Variance Reduction

We measure the variance reduction of the RGE observed at various iterates, , during execution of gradient descent. Both the gradient magnitude, and the marginal variance of the gradient elements — using a sample of 1000 gradients — are reported. Further, we inspect both the mean and log-scale parameters separately. Table 3 compares gradient variances for the frisk model for four estimators: i) pure Monte Carlo (MC), ii) Full Hessian, iii) Hessian Diagonal, and iv) Hessian-vector product + local approximation (HVP+Local).

Each entry in the table measures the percent of the variance of the pure Monte Carlo estimator. We show the average variance over each component , and the variance of the norm . We separate out variance in mean parameters, , log scale parameters, , and the entire vector . The reduction in variance is dramatic. Using HVP+Local, in the norm of the mean parameters we see between a and reduction in variance depending on the progress of the optimizer. The importance of the full Hessian-vector product for reducing mean parameter variance is also demonstrated as the Hessian diagonal only reduces mean parameter variance by a factor of 2-5.

For the variational scale parameters, , we see that early on the HVP+Local approximation is able to reduce parameter variance by a large factor (). However, at later iterates the HVP+Local scale parameter variance is on par with the Monte Carlo estimator, while the full Hessian estimator still enjoys huge variance reduction. This indicates that, by this point, most of the noise is the local Hessian diagonal estimator. We also note that in this problem, most of the estimator variance is in the mean parameters. Because of this, the norm of the entire parameter gradient, is reduced by . In Appendix D we report results for other values of as a comparison.

Optimizer Convergence and Stability

We compare the optimization traces for the frisk and bnn model for the MC and the HVP+Local estimators under various conditions. At each iteration we estimate the true ELBO value using 2000 Monte Carlo samples. We optimize the ELBO objective using adam Kingma and Ba (2015) for two step sizes, each trace starting at the same value of .

Figure 3 compares ELBO optimization traces for  and  samples and step sizes  and  for the frisk model. We see that the HVP+Local estimators make early progress and converge quickly. We also see that the  pure MC estimator results in noisy optimization paths. Figure 4 shows objective value as a function of wall clock time under various settings for the bnn model. The HVP+Local estimator does more work per iteration, however it tends to converge faster. We observe the HVP+Local outperforming the MC estimator.

5 Conclusion

Variational inference reframes an integration problem as an optimization problem with the caveat that each step of the optimization procedure solves an easier integration problem. For general models, each sub-integration problem is itself intractable, and must be estimated, typically with Monte Carlo samples. Our work has shown that we can use more information about the variational family to create tighter estimators of the ELBO gradient, which leads to faster and more stable optimization. The efficacy of our approach relies on the complexity of the RGE distribution to be well-captured by linear structure which may not be true for all models. However, we found the idea effective for non-conjugate hierarchical Bayesian models and a neural network.

Our presentation is a specific instantiation of a more general idea — using cheap linear structure to remove variation from stochastic gradient estimates. We would like to extend this idea to more flexible variational distributions, including flow distributions Rezende and Mohamed (2015) and hierarchical distributions Ranganath et al. (2016), as well as model/inference schemes with recognition networks Kingma and Welling (2014).

Acknowledgements

The authors would like to thank Finale Doshi-Velez, Mike Hughes, Taylor Killian, Andrew Ross, and Matt Hoffman for helpful conversations and comments on this work. ACM is supported by the Applied Mathematics Program within the Office of Science Advanced Scientific Computing Research of the U.S. Department of Energy under contract No. DE-AC02-05CH11231. NJF is supported by a Washington Research Foundation Innovation Postdoctoral Fellowship in Neuroengineering and Data Science. RPA is supported by NSF IIS-1421780 and the Alfred P. Sloan Foundation.

References

  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
  • Bekas et al. [2007] Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. An estimator for the diagonal of a matrix. Applied numerical mathematics, 57(11):1214–1229, 2007.
  • Blei et al. [2017] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 2017.
  • Gelman and Hill [2006] Andrew Gelman and Jennifer Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2006.
  • Gelman et al. [2007] Andrew Gelman, Jeffrey Fagan, and Alex Kiss. An analysis of the NYPD’s stop-and-frisk policy in the context of claims of racial bias. Journal of the American Statistical Association, 102:813–823, 2007.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • Hoffman et al. [2013] Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
  • Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015.
  • Kingma and Welling [2014] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations, 2014.
  • Maclaurin et al. [2015] Dougal Maclaurin, David Duvenaud, Matthew Johnson, and Ryan P. Adams. Autograd: Reverse-mode differentiation of native Python, 2015. URL http://github.com/HIPS/autograd.
  • Martens et al. [2012] James Martens, Ilya Sutskever, and Kevin Swersky. Estimating the Hessian by back-propagating curvature. In Proceedings of the International Conference on Machine Learning, 2012.
  • Mnih and Rezende [2016] Andriy Mnih and Danilo Rezende. Variational inference for Monte Carlo objectives. In Proceedings of The 33rd International Conference on Machine Learning, pages 2188–2196, 2016.
  • Mohamed and Lakshminarayanan [2016] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
  • Pearlmutter [1994] Barak A Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1):147–160, 1994.
  • Ranganath et al. [2014] Rajesh Ranganath, Sean Gerrish, and David M Blei. Black box variational inference. In AISTATS, pages 814–822, 2014.
  • Ranganath et al. [2016] Rajesh Ranganath, Dustin Tran, and David M Blei. Hierarchical variational models. In International Conference on Machine Learning, 2016.
  • Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1530–1538, 2015.
  • Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In International Conference on Machine Learning, 2014.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
  • Roeder et al. [2017] Geoffrey Roeder, Yuhuai Wu Wu, and David Duvenaud. Sticking the landing: An asymptotically zero-variance gradient estimator for variational inference. arXiv preprint arXiv:1703.09194, 2017.
  • Ruiz et al. [2016] Francisco R Ruiz, Michalis Titsias RC AUEB, and David Blei. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, pages 460–468, 2016.
  • Tran et al. [2017] Dustin Tran, Matthew D Hoffman, Rif A Saurous, Eugene Brevdo, Kevin Murphy, and David M Blei. Deep probabilistic programming. In Proceedings of the International Conference on Learning Representations, 2017.
  • Wang et al. [2013] Chong Wang, Xi Chen, Alexander J Smola, and Eric P Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information Processing Systems, pages 181–189, 2013.

Appendix A Control Variates

Control variates are random quantities that are used to reduce the variance of a statistical estimator without trading any bias. Concretely, given an unbiased estimator such that (the quantity of interest), our goal is to construct another unbiased estimator with lower variance. We can do this by defining a control variate with known expectation . We can write our new estimator as

(25)

Clearly the new estimator has the same expectation as the original estimator, but a different variance. We can reduce the variance of by setting optimally.

Consider a univariate and , and without loss of generality, take . The variance of can be written

(26)
(27)
(28)

We minimize the variance with respect to by taking the derivative and setting equal to zero, which implies

(29)

The covariance is typically not known a priori and must be estimated. It can be shown, under the optimal , that the variance of is

(30)

where is the correlation coefficient between and .

When and are length vectors, we can construct an estimator that depends on a matrix-valued free parameter,

(31)

We can show that the that minimizes the — the sum of the marginal variances — is given by

(32)

where is the covariance matrix of the control variate vector, and is the cross covariance between and .

Intuitively, a control variate is injecting information into the estimator in the form of linear structure. If the two quantities, and are perfectly correlated, then we already know the mean and estimation is not necessary. As the two become uncorrelated, the linear estimator becomes less and less informative, and reverts to the original quantity.

a.1 Control Variates and Approximate Functions

In our setting, we approximate the distribution of some function where by a first order Taylor expansion about — for now we examine the univariate case

(33)

If we wish to use as a control variate for , we need to characterize the covariance between the two random variables. Because the form of is general, it is difficult to analyze. We instead derive the covariance between and the second-order expansion

(34)

as a surrogate.

(35)
(36)
(37)
(38)
(39)

where note that . Recall that the optimal control variate can be written

(40)
(41)

Appendix B Algorithm Details

We summarize an optimization routine using RV-RGE in Algorithm 1. The different variants rely on the different forms of and . The full Hessian estimator calculates these terms exactly. The diagonal Hessian estimates the Hessian-vector product with the diagonal of the Hessian. The HVP+Local estimator computes the Hessian-vector product exactly, but estimates the scale approximation mean using other samples.

We also note that there are ways to optimize the additional Hessian-vector product computation. Because each Hessian is evaluated at the same , we can cache the computation in the forward pass, and only repeat the backwards pass for each sample, as implemented in Maclaurin et al. [2015].

Appendix C Model Definitions

c.1 Multi-level Poisson GLM

Our second test model is a 37-dimensional posterior resulting from a hierarchical Poisson GLM. This model measures the relative rates of stop-and-frisk events for different ethnicities and in different precincts Gelman et al. [2007], and has been used as illustrative example of multi-level modeling [Gelman and Hill, 2006, Chapter 15, Section 1].

mean offset
group variances
ethnicity effect
precinct effect
log rate
stop-and-frisk events

where are the number of stop-and-frisk events within ethnicity group  and precinct  over some fixed period of time;  is the total number of arrests of ethnicity group  in precinct  over the same period of time;  and  are the ethnicity and precinct effects.

c.2 Bayesian Neural Network

We implement a 50-unit hidden layer neural network with ReLU activation functions. We place a normal prior over each weight in the neural network, governed by the same variance (with an inverse Gamma prior). We also place an inverse Gamma prior over the observation variance The model can be written as

weight prior hyper (42)
noise prior hyper (43)
weights (44)
output distribution (45)

where is the set of weights, and

is a multi-layer perceptron that maps input

to approximate output as a function of parameters . We denote the set of parameters as . We approximate the posterior , where is the training set of input-output pairs.

We use a 100-row subsample of the wine dataset from the UCI repository https://archive.ics.uci.edu/ml/datasets/Wine+Quality.

Appendix D Variance Reduction

Below are additional variance reduction measurements for the frisk model for different values of , samples drawn per iteration.

Iteration Estimator Ave Ave Ave
early MC 100.000 100.000 100.000 100.000 100.000 100.000
Full Hessian 1.184 1.022 0.001 0.002 0.007 0.902
Hessian Diag 35.541 25.012 0.003 0.011 0.201 22.090
HVP + Local 1.184 1.022 0.012 0.039 0.019 0.900
mid MC 100.000 100.000 100.000 100.000 100.000 100.000
Full Hessian 0.080 0.075 0.122 0.169 0.081 0.075
Hessian Diag 39.016 22.832 6.617 8.097 38.868 22.804
HVP + Local 0.080 0.075 31.992 46.160 0.227 0.078
late MC 100.000 100.000 100.000 100.000 100.000 100.000
Full Hessian 0.044 0.024 1.782 0.879 0.045 0.023
Hessian Diag 39.280 38.799 22.915 21.913 39.268 38.725
HVP + Local 0.044 0.024 98.290 99.679 0.116 0.014
Table 2: frisk model variance comparison: -sample estimators
Iteration Estimator Ave Ave Ave
early MC 100.000 100.000 100.000 100.000 100.000 100.000
Full Hessian 1.276 1.127 0.001 0.002 0.008 1.080
Hessian Diag 35.146 24.018 0.003 0.012 0.197 23.028
HVP + Local 1.276 1.127 0.013 0.039 0.020 1.079
mid MC 100.000 100.000 100.000 100.000 100.000 100.000
Full Hessian 0.081 0.074 0.125 0.121 0.081 0.074
Hessian Diag 37.534 21.773 7.204 7.035 37.394 21.752
HVP + Local 0.081 0.074 31.278 32.275 0.225 0.076
late MC 100.000 100.000 100.000 100.000 100.000 100.000
Full Hessian 0.042 0.043 1.894 0.296 0.044 0.043
Hessian Diag 39.972 101.263 24.450 27.174 39.961 101.019
HVP + Local 0.042 0.043 98.588 99.539 0.112 0.033
Table 3: frisk model variance comparison: -sample estimators