This paper concerns estimating the gradient of with respect tomohamed2019monte; jordan1999introduction; sutton2018reinforcement; chaloner1995bayesian. A popular technique is the “reparameterization trick” pflug2012optimization; vaes_welling; glasserman2013monte. Here, one defines a mapping that transforms some base density into . Then, the gradient is estimated by drawing and evaluating .
In any application using stochastic gradients, variance is a concern. Several variance reduction methods exist, with control variates representing a popular alternative mcbook
. A control variate is a random variable with expectation zero, which can be added to an estimator to cancel noise and decrease variance. Previous work has shown that control variates can significantly reduce the variance of reparameterization gradients, and thereby improve optimization performancecvs; traylorreducevariance_adam; stickingthelanding.
Miller et al. traylorreducevariance_adam proposed a Taylor-expansion based control variate for the case where is a fully-factorized Gaussian parameterized by its mean and scale. Their method works well for the gradient with respect to the mean parameters. However, for the scale parameters, computational issues force the use of further approximations. In a new analysis (Sec. 4) we observe that this amounts to using a constant Taylor approximation (i.e. an approximation of order zero). As a consequence, for the scale parameters, the control variate has little effect. Still, the approach is very helpful with fully-factorized Gaussians, because in this case most variance is contributed by gradient with respect to the mean parameters.
The situation is different for non fully-factorized distributions: Often, most of the variance is contributed by the gradient with respect to the scale parameters. This renders Taylor-based control variates practically useless. Indeed, empirical results in Section 5 show that, with diagonal plus low rank Gaussians or Gaussians with arbitrary dense covariances, Taylor-based control variates yield almost no benefit over the no control variates baseline. (We generalize the Taylor approach to full-rank and diagonal plus low-rank distributions in Appendix D.3.)
For VI, fully factorized variational distributions are typically much less accurate than those representing interdependence diagLR; householderflows. Thus, we seek a control variate that can aid the use of more powerful distributions, such as Gaussians with any covariance structure (full-rank, factorized as diagonal plus low rank diagLR, Householder flows householderflows), Student-t, and location-scale and elliptical families. This paper introduces such a control variate.
Our proposed method can be described in two steps. First, given any quadratic function that approximates , we define the control variate as . We show that this control variate is tractable for any distribution with known mean and covariance. Intuitively, the more accurately approximates , the more this will decrease the variance of the original reparameterization estimator . Second, we fit the parameters of through a “double descent” procedure aimed at reducing the estimator’s variance.
We empirically show that the use of our control variate leads to reductions in variance several orders of magnitude larger than the state of the art method when diagonal plus low rank or full-rank Gaussians are used as variational distributions. Optimization speed and reliability is greatly improved as a consequence.
Stochastic Gradient Variational Inference (SGVI). Take a model , where is observed data and latent variables. The posterior is often intractable. VI finds the parameters to approximate the target with the simpler distribution jordan1999introduction; jaakkola2000bayesian; blei2017variational; zhang2017advances. It does this by maximizing the "evidence lower bound"
which is equivalent to minimizing the KL divergence from the approximating distribution to the posterior . Using and letting denote the entropy, we can express the ELBO’s gradient as
SGVI’s idea is that, while the first term from Eq. 2
typically has no closed-form, there are many unbiased estimators that can be used with stochastic optimization algorithms to maximize the ELBOneuralVI_minh; viasstochastic_jordan; blackbox_blei; doublystochastic_titsias; VIforMCobjectives_mnih; stickingthelanding; generalreparam_blei; overdispersed_blei. (We assume that the entropy term can be computed in closed form. If it cannot, one can “absorb” into and estimate its gradient alongside .) These gradient estimators are usually based on the score function method williams_reinforce or the reparameterization trick vaes_welling; doublystochastic_titsias; rezende2014stochastic. Since the latter usually provides lower-variance gradients in practice, it is the method of choice whenever applicable. It requires a fixed distribution , and a transformation such that if , then . Then, an unbiased estimator for the first term in Eq. 2 is given by drawing and evaluating
Control Variates. A control variate is a zero-mean random variable used to reduce the variance of another random variable mcbook. Control variates are widely used in SGVI to reduce a gradient estimator’s variance traylorreducevariance_adam; neuralVI_minh; REBAR; backpropvoid; blackbox_blei; cvs; boustati2020amortized. Let define the base gradient estimator, using random variables , and let the function define the control variate, whose expectation over is zero. Then, for any scalar we can get an unbiased gradient estimator as
The hope is that approximates and cancels the error in the gradient estimator . It can be shown that the optimal weight is111Since and are vectors, the expressions for
are vectors, the expressions forand should be interpreted using , , and , which results in a variance of . Thus, a good control variate will have high correlation with the gradient estimator (while still being zero mean). In the extreme case that , variance would be reduced to zero. In practice, must be estimated. This can be done approximately using empirical estimates of and from recent evaluations cvs.
3 New Control Variate
This section presents our control variate. The goal is to estimate the gradient with low variance. The core idea behind our method is simple: if is replaced with a simpler function , a closed-form for may be available. Then, the control variate is defined as the difference between the term computed exactly and estimated using reparameterization. Intuitively, if the approximation is good, this control variate will yield large reductions in variance.
We use a quadratic function as our approximation (Sec. 3.1). The resulting control variate is tractable as long as the mean and covariance of are known (Sec. 3.2). While this is valid for any quadratic function , the effectiveness of the control variate depends on the approximation’s quality. We propose to find the parameters of by minimizing the final gradient estimator’s variance or a proxy to it (Sec. 3.3). We do this via a double-descent scheme to simultaneously optimize the parameters of alongside the parameters of (Sec. 3.4).
3.1 Definition, Validity, and Motivation
Given a function that approximates , we define the control variate as
Since the second term is an unbiased estimator of the first one, has expectation zero and thus represents a valid control variate. To understand the motivation behind this control variate consider the final gradient estimator,
Intuitively, making a better approximation of will tend to make the stochastic term smaller, thus reducing the estimator’s variance. We propose to set the approximating function to be a quadratic parameterized by and ,
where and are a vector and a square matrix parameterized by , and is a vector. (We avoid including an additive constant in the quadratic since it would not affect the gradient.)
3.2 Tractability of the Control Variate
We now consider computational issues associated with the control variate from Eq. 5. Our first result is that, given and , the control variate is tractable for any distribution with known mean and covariance. We begin by giving a closed-form for the expectation in Eq. 5 (proven in Appendix C).
Let be defined as in Eq. 7. If has mean and covariance , then
If we substitute this result into Eq. 5, we can easily use automatic differentiation tools to compute the gradient with respect to , and thus compute the control variate. Therefore, our control variate can be easily used for any reparameterizable distribution with known mean and covariance matrix. These include fully-factorized Gaussians, Gaussians with arbitrary full-rank covariance, Gaussians with structured covariance (e.g. diagonal plus low rank diagLR, Householder flows householderflows), Student-t distributions, and, more generally, distributions in a location scale family or elliptical family.
Computational cost. The cost of computing the control variate depends on the cost of computing matrix-vector products (with matrix ) and the trace of (see Eqs. 7 and 8). These costs depend on the structure of and . We consider the case where and are parameterized as diagonal plus low rank matrices, with ranks and , respectively. Then, computing the control variate has cost , where is the dimensionality of .
Notice that the cost of evaluating the reparameterization estimator is at least , since has parameters. However, constant factors here are usually significantly higher than for the control variate, since evaluating requries a pass through a dataset. Thus, as long as is “small”, the control variate does not affect the algorithm’s overall scalability.
These complexity results extend to cases where and/or are diagonal or full-rank matrices by replacing the corresponding rank, or , by or . For example, if is a full-rank matrix and is a diagonal plus rank-, the control variate’s cost is . If both matrices are full-rank and is parameterized by its Cholesky factor , the cost is . This cubic cost comes entirely from instantiating , all other costs are .
3.3 Constructing the Quadratic Approximation
The results in the previous section hold for any quadratic function . However, for the control variate to reduce variance, it is important that is a good approximation of . This section proposes two methods to find such an approximation.
A natural idea would be to use a Taylor approximation of traylorreducevariance_adam; viasstochastic_jordan. However, as we discuss in Section 4, this leads to serious computational challenges (and is suboptimal). Instead, we will directly seek parameters that minimize the variance of the final gradient estimator . For a given set of parameters , we set and find the parameters by minimizing an objective . We present two different objectives that can be used:
Method 1. Find by minimizing the variance of the final gradient estimator (assuming ),
Using a sample an unbiased estimate of can be obtained as
Method 2. While the above method works well, it imposes a modest constant factor overhead, due to the need to differentiate through the control variate. As an alternative, we propose a simple proxy. The motivation is that the difference between the base gradient estimator and its approximation based on is given by
Thus, the closer is to , the better the control variate can approximate and cancel estimator ’s noise. Accordingly, we propose the proxy objective
Using a sample an unbiased estimate of can be obtained as
We observed that both methods lead to reductions in variance of similar magnitude (see Fig. 1 for a comparison). However, the second method introduces a smaller overhead.
The idea of using a double-descent scheme to minimize gradient variance was explored in previous work. It has been done to set the parameters of a sampling distribution overdispersed_blei, and to set the parameters of a control variate for discrete latent variable models backpropvoid; REBAR using a continuous relaxation for discrete distributions gumbel1; gumbel2.
3.4 Final Algorithm
This section presents an efficient algorithm to use our control variate for SGVI. The approach involves maximizing the ELBO and finding a good quadratic approximation simultaneously, via a double-descent scheme. We maximize the ELBO using stochastic gradient ascent with the gradient estimator from Eq. 3 and our control variate for variance reduction. Simultaneously, we find an approximation by minimizing
using stochastic gradient descent with the gradient estimators from Eq.10 or 13. Our procedure, summarized in Alg. 1, involves alternating steps of each optimization process. Notably, optimizing as in Alg. 1 does not involve extra likelihood evaluations, since the model evaluations used to estimate the ELBO’s gradient are re-used to estimate .
Alg. 1 includes the control variate weight . This is useful in practice, specially at the beginning of training, when is far from optimal and is a poor approximation of . The (approximate) optimal weight can be obtained by keeping estimates of and as optimization proceeds cvs.
4 Comparison of Approximations
Taylor-Based Approximations. There is closely related work exploring Taylor-expansion based control variates for reparameterization gradients traylorreducevariance_adam. These control variates can be expressed as
This is similar to Eq. 5. The difference is that, here, the approximation is set to be a Taylor expansion of . In general this leads to an intractable control variate: the expectation may not be known, or the Taylor approximation may be intractable (e.g. requires computing Hessians). However, in some cases, it can be computed efficiently. For this discussion we focus on Gaussian variational distributions, where the parameters are the mean and scale.
For the gradient with respect to the mean parameters, can be set to be a second-order Taylor expansion of around the current mean. This might appear to be problematic, since computing the Hessian of will be intractable in general. However, it turns out that, for the mean parameters, this leads to a control variate that can be computed using only Hessian-vector products. This was first observed by Miller et al. traylorreducevariance_adam for diagonal Gaussians.
For the scale parameters, even with a diagonal Gaussian, using a second-order Taylor expansion requires the diagonal of the Hessian, which is intractable in general. For this reason, Miller et al. traylorreducevariance_adam propose an approach equivalent222The original paper traylorreducevariance_adam describes the control variate for the scale parameters as using a second-order Taylor expansion, and then applies an additional approximation based on a minibatch to deal with intractable Hessian computations. In Appendix D we show these formulations are exactly equivalent. to setting to a first-order Taylor expansion, so that is constant.
The biggest drawback of Taylor-based control variates is that the crude first-order Taylor approximation used for the scale parameters provides almost no variance reduction. Interestingly, this seems to pose very little problem with diagonal Gaussians. This is because, in this case, the gradient with respect to the mean parameters typically contribute almost all the variance. However, this approach may be useless in some other situations: With non-diagonal distributions, the scale parameters often contribute the majority of the variance (see Fig. 1).
A second drawback is that even a second-order Taylor expansion is not optimal. A Taylor expansion provides a good local approximation, which may be poor for distributions with large variance.
Demonstration. Fig. 1
compares four gradient estimators on a Bayesian logistic regression model (see Sec.5): plain reparameterization, reparameterization with a Taylor-based control variate, and reparameterization with our control variate (minimizing Eq. 9 or Eq. 12, using a diagonal plus rank- matrix ). The variational distribution is either a diagonal Gaussian or a Gaussian with arbitrary full-rank covariance. We set the mean and covariance . We measure each estimator’s variance for different values of . For transparency, in all cases we use a fixed weight .
There are four key observations: (i) Our variance reduction for the mean parameters is somewhat better than a Taylor approximation (which even increases variance in some cases). This is not surprising, since a Taylor expansion was never claimed to be optimal; (ii) Our control variate is vastly better for the scale parameters; (iii) the variance for fully-factorized distributions is dominated by the mean, while the variance for full-covariance distributions it is dominated by the scale; (iv) the proxy for the gradient variance (Eq. 9) performs extremely similarly to the true gradient variance (Eq. 12).
It should be emphasized that, for this analysis, the parameters are trained to completion for each value of . This does not exactly reflect what would be expected in practice, where the dual-descent scheme "tracks" the optimal as changes. Experiments in the next section consider this practical setting.
5 Experiments and Results
We present results that empirically validate the the proposed control variate and algorithm. We perform SGVI on several probabilistic models using different variational distributions. We maximize the ELBO using the reparameterization estimator with the proposed control variate to reduce its variance (Alg. 1). We compare against optimizing using the reparameterization estimator without any control variates, and against optimizing using a Taylor-based control variates for variance reduction.
5.1 Experimental details
Tasks and datasets: We use three different models: Logistic regression with the a1a dataset, hierarchical regression with the frisk dataset frisk
, and a Bayesian neural network with thered wine dataset. The latter two are the ones used by Miller et al. traylorreducevariance_adam. (Details for each model in App. B.)
Variational distribution: We consider diagonal Gaussians parameterized by the log-scale parameters, and diagonal plus rank- Gaussians, whose covariance is parameterized by a diagonal component and a factor of shape (i.e. ) diagLR. For the simpler models, logistic regression and hierarchical regression, we also consider full-rank Gaussians parameterized by the Cholesky factor of the covariance.
Algorithmic details: We use Adam adam to optimize the parameters of the variational distribution (with step sizes between and ). We use Adam with a step size of to optimize the parameters of the control variate, by minimizing the proxy to the variance from Eq. 12. We parameterize as a diagonal plus rank-. We set when diagonal or diagonal plus low rank variational distributions are used, and when a full-rank variational distribution is used.
Baselines considered: We compare against optimization using the base reparameterization estimator (Eq. 3). We also compare against using Taylor-based control variates. (We generalize the Taylor approach to full-rank and diagonal plus low-rank distributions in Appendix D.3.) For all control variates we find the (approximate) optimal weight using the method from Geffner and Domke cvs (fixing the weight to 1 lead to strictly worse results). We use and samples from to estimate gradients.
For the sake of reproducibility, we show results in terms of iterations. Table 1
shows the per iteration time-cost of each method in our experiments. Our method’s overhead is around 50%, while the Taylor approach has an overhead of around 150%. These numbers of course depend on the implementation and computational platform, but should give a rough estimate of the overhead in practice (we use PyTorch 1.1.0 on an Intel i5 2.3GHz).
|#Samples||Model||: Diag plus low rank||: full-rank covariance|
|Base||Our CV||Taylor||Base||Our CV||Taylor|
Fig. 2 shows optimization results for the diagonal plus low rank Gaussian variational distribution. The two leftmost columns show ELBO vs. iteration plots for two specific learning rates. The third column shows, for each method and iteration, the ELBO for the best learning rate chosen retrospectively. In all cases, our method improves over competing approaches. In fact, our method with samples to estimate the gradients performs better than competing approaches with . On the other hand, Taylor-based control variates give practically no improvement over using the base estimator alone. This is because most of the gradient variance comes from estimating the gradient with respect to the scale parameters, for which Taylor-based control variates do little.
To test the robustness of optimization, in Fig. 3 we show the final training ELBO after 80000 steps as a function of the step size used. Our method is less sensitive to the choice of step size. In particular, our method gives reasonable results with larger learning rates, which translates to better results with a smaller number of iterations.
For space reasons, results for diagonal Gaussians and Gaussians with arbitrary full-rank covariances as variational distributions are shown in Appendix A. Results for full-rank Gaussians are similar to the ones shown in Figs. 2 and 3. Our method performs considerably better than competing approaches (our method with outperforms competing approaches with ), and Taylor-based control variates yield no improvement over the no control variate baseline.
On the other hand, with diagonal Gaussians, our approach and Taylor-based control variates perform similarly – both are significantly better than the no control variate baseline. We attribute the success of Taylor-based approaches in this case to two related factors. First, diagonal approximations tend to under-estimate the true variance, so a local Taylor approximation may be more effective. Second, for diagonal Gaussians most of the gradient variance comes from mean parameters, where a second-order Taylor approach is tractable.
6 Broader Impact
In this work we present a new algorithm that yields improved performance for VI with non factorized distributions. We believe this algorithm could be included in VI-based automatic inference tools to improve their performance. This could have an impact in several areas since these tools, such as ADVI advi (in Stan stan), are used by researchers and practitioners in many different fields.
Appendix A Results with Other Variational Distributions
a.1 Gaussian with Arbitrary Full-rank Covariance Variational Distribution
a.2 Fully-factorized Gaussian Variational Distribution
Appendix B Models Used
Bayesian logistic regression: We use a subset of rows of the a1a dataset. In this case the posterior has dimensionality . Let , where is binary, represent the i-th sample in the dataset. The model is given by
Hierarchical Poisson model: By Gelman et al. frisk. The model measures the relative stop-and-frisk events in different precincts in New York city, for different ethnicities. In this case the posterior has dimensionality . The model is given by
Here, stands for ethnicity, for precinct, for the number of stops in precinct within ethnicity group (observed), and for the total number of arrests in precinct within ethnicity group (which is observed).
Bayesian neural network: As done by Miller et al. traylorreducevariance_adam
we use a subset of 100 rows from the “Red-wine” dataset. We implement a neural network with one hidden layer with 50 units and Relu activations. In this case the posteriorhas dimensionality . Let , where is an integer between one and ten, represent the i-th sample in the dataset. The model is given by
|(weights and biases)|
Appendix C Proof of Lemma
Let be defined as in Eq. 7. If is a distribution with mean and covariance matrix , then
Appendix D Details on Taylor-based Control Variates
There is closely related work exploring Taylor-expansion based control variates for reparameterization gradients by Miller et al. traylorreducevariance_adam. They develop a control variate for the case where is a fully-factorized Gaussian.
Note: In their paper, Miller et al. derived a control variate for the case where is a fully-factorized Gaussian parameterized by its mean(). That is, . However, in their code (publicly available) they use a different parameterization. Instead of using , they use a different set of parameters, , to represent the log of the standard deviation of . That is, . In order to explain, replicate and compare against the method they use, we derive the details of their approach for the latter case. (This derivation is not present in their paper, but follows all the steps closely.)
Miller et al. introduced a control variate to reduce the variance of the estimator of the gradient with respect to the mean parameters and a control variate to reduce the variance of the estimator of the gradient with respect to the log-scale parameters . We will denote these control variates and , respectively. Their main idea is to use curvature information about the model (via its Hessian) to construct both control variates. The control variate they propose for the mean parameters can be computed efficiently via Hessian-vector products. On the other hand, the original proposal for requires computing the (often) intractable Hessian . To avoid this the authors propose an alternative control variate based on some tractable approximations.
The authors noted that the use of these approximations lead to a significant deterioration of the control variate’s variance reduction capability. However, no formal analysis that explained this was presented. We study these approximations in detail and explain exactly why this quality reduction is observed. Simply put, we observe that these approximations lead to a control variate that does not use curvature information about the model at all.
The rest of this section is organized as follows. In D.1, we present the resulting control variates obtained after applying the required approximations to deal with the intractable Hessian: and . In D.2, we present Miller et al. original (intractable) control variate, , explain the source of intractability, and explain how the approximation used leads to the "weaker" control variate presented in D.1. Finally, in D.3 we describe the drawbacks of the approach, and extend the approach to the case where is a Gaussian with a full-rank or diagonal plus low rank covariance matrix.
d.1 Final control variate after approximations
Let be the variational distribution. The gradient that must be estimated is given by
where is evaluated at . The gradient estimator obtained with a sample is given by
Miller et al. traylorreducevariance_adam propose to build a control variate using an approximation of . The control variate is given by the difference between the gradient estimator using this approximation and its expectation,
The quality of the control variate directly depends on the quality of the approximation . If is very close to , the control variate is able to approximate and cancel the estimator’s noise. On the other hand, bad approximations lead to a small (or none) reduction in variance.
This idea is applied to fully-factorized Gaussian with parameters representing the log-scale. The reparameterization transformation is given by
where is the element-wise product between vectors. The parameters are . The control variate is derived differently for and . We discuss the two cases separately.
Control variate for . For , the authors set to be a first order Taylor expansion of the true gradient around . That is, , where is the Hessian of evaluated at . Then, it is not hard to show that the control variate becomes333
To see this, observe that
This control variate can be computed efficiently using Hessian-vector products, and will be effective when the approximation is close to for .
The following derivation for is different from that given by Miller et al. We show that it is equivalent in Sec. D.2.
Control variate for . For , it is necessary – in order to obtain a closed-form expectation – to use a constant approximation of the form (using the first order Taylor expansion as for leads to intractable terms, see Section D.2). Then, it turns out that the expectation part of the control variate is zero, and so the control variate becomes444In this case the Jacobian of with respect to is
It follows that
It can be observed that does not use curvature information about the model. This control variate will be effective only in cases where is close to for .
d.2 Original Derivation
Miller et al. traylorreducevariance_adam gave a more elaborate derivation of the above control variate for . They start with the same first-order Taylor expansion as used for . Applied directly, this suggests the control variate555Again, and . We thus have that
The first term from Eq. 37 can be computed efficiently using Hessian-vector products. The second term, however, is often intractable, since it requires the diagonal of the Hessian. In such cases, the authors propose to apply a further estimation process to estimate it using a baseline bekas2007estimator; VIforMCobjectives_mnih. The idea is that often gradients are estimated in a minibatch, based on a set of samples . Then, the expectation can be estimated without bias using the other samples in the minibatch. This results in the control variate for sample of
At a first glance it may appear that this control variate uses curvature information from the model via the Hessian . However, a careful inspection shows that all these terms cancel out. The control variate for the full minibatch is simply
This, of course, is exactly the same as taking a minibatch of the control variate derived in Eq. 32. Thus, the ideas of minibatch and baseline may somewhat obscure what is happening. It is not necessary to invoke the machinery of a baseline, nor to draw samples in a minibatch. A zero-th order Taylor expansion is equivalent, and has the practical advantage of remaining valid with a single sample. While some details of the baseline procedure were not available in the published paper, we confirmed this is equivalent to the control variate used in the publicly available code.
d.3 Limitations of the approach and extensions
One limitation of the above approach is that the control variate for is not very effective. Unless the diagonal of the Hessian is tractable, it uses a very crude approximation for . Thus, one would naturally expect this control variate to perform worse when the diagonal of the Hessian is not tractable. Indeed, this can be observed in the results obtained by Miller et al. traylorreducevariance_adam. Table 1 in their paper shows that the tractable control variate (Eq. 32, tractable), leads to a variance reduction several orders of magnitude worse than the one obtained using the control variate based on the true Hessian (Eq. 37, often intractable to compute).
In their simulations, this relatively poor performance for does not represent a big inconvenience. That is because of the following empirical observation: when using a fully-factorized Gaussian as variational distribution most of the gradient variance comes from mean parameters , where a much better approximation of can be used. However, our results in this paper show that with non fully-factorized distributions most of the variance is often contributed by the scale parameters (see Fig. 1).
A second limitation is that their approach requires manual distribution-specific derivations. More specifically, in order to use the control variate with another distribution the expectation
must be computed. In order to do so, a closed form expression for the Jacobian of is required. (One cannot use automatic differentiation for this since a mathematical expression for the Jacobian is needed in order to derive the expectation). Thus, extending the approach to other variational distributions is not trivial, and the difficulty depends on the variational distribution chosen. We now present three cases, two for which the extension can be done without much work (full-rank and diagonal plus low rank Gaussians), and other for which the extension requires extensive calculations (Householder flows householderflows).
Full-rank Gaussian: In this case we have . The parameters are , where S parameterizes the covariance matrix as , and reparameterization is given by . If we let be a vector that contains all rows of in order, we get that the required Jacobians are given by