PyTorch implementation for " Differentiable Antithetic Sampling for Variance Reduction in Stochastic Variational Inference" (https://arxiv.org/abs/1810.02555).
Stochastic optimization techniques are standard in variational inference algorithms. These methods estimate gradients by approximating expectations with independent Monte Carlo samples. In this paper, we explore a technique that uses correlated, but more representative , samples to reduce estimator variance. Specifically, we show how to generate antithetic samples that match sample moments with the true moments of an underlying importance distribution. Combining a differentiable antithetic sampler with modern stochastic variational inference, we showcase the effectiveness of this approach for learning a deep generative model.READ FULL TEXT VIEW PDF
Variational inference approximates the posterior distribution of a
Stochastic variational inference (SVI) lets us scale up Bayesian computa...
We present a new algorithm for stochastic variational inference that tar...
We introduce local expectation gradients which is a general purpose
Variational inference using the reparameterization trick has enabled
Variational inference is increasingly being addressed with stochastic
This paper introduces the Variational Determinant Estimator (VDE), a
PyTorch implementation for " Differentiable Antithetic Sampling for Variance Reduction in Stochastic Variational Inference" (https://arxiv.org/abs/1810.02555).
A wide class of problems in science and engineering can be solved by gradient-based optimization of function expectations. This is especially prevalent in machine learning(Schulman et al., 2015), including variational inference (Ranganath et al., 2014; Rezende et al., 2014)2014)
. On the face of it, problems of this nature require solving an intractable integral. Most practical approaches instead use Monte Carlo estimates of expectations and their gradients. These techniques are unbiased but can suffer from high variance when sample size is small—one unlikely sample in the tail of a distribution can heavily skew the final estimate. A simple way to reduce variance is to increase the number of samples; however the computational cost grows quickly. We would like to reap the positive benefits of a larger sample size using as few samples as possible.With a fixed computational budget, how do we choose samples?
A large body of work has been dedicated to reducing variance in sampling, with the most popular in machine learning being reparameterizations for some continuous distributions (Kingma and Welling, 2013; Jang et al., 2016) and control variates to adjust for estimated error (Mnih and Gregor, 2014; Weaver and Tao, 2001). These techniques sample i.i.d. but perhaps it is possible to choose correlated samples that are more representative of their parent distribution? Several such non-independent sampling approaches have been proposed in statistics. In this work we investigate antithetics, where for every sample we draw, we include a negatively correlated sample to minimize the distance between sample and population moments.
The key challenges in applying antithetic sampling to modern machine learning are (1) ensuring that antithetic samples are still correctly distributed such that they provide unbiased estimators for Monte Carlo simulation, and (2) ensuring that the sampling process is differentiable to permit end-to-end gradient-based optimization. We focus on stochastic variational inference and explore the effectiveness of antithetics on learning the parameters for a deep generative model. Critically, our method of antithetic sampling is differentiable and can be composed with reparametrizations of the underlying distributions to provide a fully differentiable sampling process. This yields a simple and low variance way to optimize the parameters of the variational posterior. Empirically, we demonstrate that differentiable antithetic sampling outperforms traditional i.i.d. sampling on a variety of objectives, posterior families, and datasets.
Our contributions are as follows:
We review a method for generating Gaussian variates with given sample moments and apply it to antithetic sampling.
We extend the algorithm beyond the Gaussian family using deterministic transformations.
We show that differentiating through the sampling computation improves variational inference.
We show that training VAEs with antithetic samples improves learning across seven datasets.
Consider a generative model that specifies a joint distributionover a set of observed variables and stochastic variables parameterized by . We are interested in the posterior distribution , which is intractable since . Instead, we introduce a variational posterior, that approximates but is easy to sample from and score.
Our objective is to maximize the likelihood of the data (the “evidence”), . Again, this is intractable so we optimize the evidence lower bound (ELBO) instead:
are both deep neural networks composed with a simple likelihood (e.g. Bernoulli or Gaussian).
Since can impact the ELBO (though not the true marginal likelihood it lower bounds), we jointly optimize over and . The gradients of the ELBO objective are:
Eqn. 2 can be directly estimated using Monte Carlo techniques. However, as it stands, Eqn. 3 is difficult to approximate as we cannot distribute the gradient inside the expectation. Luckily, if we constrain to certain families, we can reparameterize.
Reparameterization refers to isolating sampling from the gradient computation graph (Kingma and Welling, 2013; Rezende et al., 2014). If we can sample by applying a deterministic function to sample from an unparametrized distribution, , then we can rewrite Eqn. 3 as:
which can now be estimated by sampling. As an example, if is a Gaussian, and we choose to be , then .
Normally, we sample i.i.d. from and to approximate Eqns. 2 and 4, respectively. However, drawing correlated samples could reduce variance in our estimation. Suppose we are given samples . We could choose a second set of samples such that is somehow the “opposite” of . Then, we can write down a new estimator using both sample sets. For example, Eqn. 2 can be approximated by:
Assuming is marginally distributed according to , Eqn. 5 is unbiased. Moreover, if is near symmetric, the variance of this new estimator will be cut significantly. But what does “opposite” mean? One idea is to define “opposite” as choosing such that the moments of the combined sample set match the moments of . Intuitively, if is too large, then choosing to be too small can help rebalance the sample mean, reducing first order errors. Similarly, if our first set of samples is too condensed at the mode, then choosing antithetic samples with higher spread can stabilize the variance closer to its expectation. However, sampling
with particular sample statistics in mind is a difficult challenge. To solve this, we first narrow our scope to Gaussian distributions, and later extend to other distribution families.
We present the constrained sampling problem: given a Gaussian distribution with population mean and population variance , we wish to generate samples subject to the conditions:
where the constants and are given and represent the sample mean and sample variance. In other words, how can we draw samples from the correct marginal distribution conditioned on matching desired sample moments? For example, we might wish to match sample and population moments: and .
Over forty years, there have been a handful of solutions. We review the algorithm introduced by (Marsaglia and Good, 1980). In our experiments, we reference a second algorithm by (Pullin, 1979; Cheng, 1984), which is detailed in the supplement. We chose (Marsaglia and Good, 1980) due its simplicity, low computational overhead, and the fact that it makes the fewest random choices of proposed solutions.
Since are independent, we can write the joint density function as follows:
We can interpret Eqn. 31 as a hyperplane and Eqn. 32 as the surface of a sphere in dimensions. Let be the set of all points that satisfy the above constraints. Geometrically, we can view as the intersection between the hyperplane and -dimensional sphere i.e. the surface of a dimensional sphere (a circle in this case).
We make the following important observation: the joint density (Eqn. 8) is constant for all points in . To see this, we can write the following:
Critically, Eqn. 9 is independent of . For any , the density for every is constant. In other words, the conditional distribution of given that
is the uniform distribution over. Surprisingly, it does not depend on or .
Therefore, to solve the constrained sampling problem, we need only be able to sample uniformly from the surface of a dimensional sphere.
More precisely, we can generate the required samples from a point on the unit sphere in centered at the origin by solving the linear system:
where is a
dimensional vector of ones andis a by matrix such that the rows of form an orthonormal basis with the null space of v i.e. we choose where and , which happens to satisfy our constraints:
As z is uniformly distributed over the unit sphere, Eqn. 11 and 12 guarantee that x is uniformly distributed in . We can generate z by sampling and setting . As in (Marsaglia and Good, 1980), we set to RowNormalize where is defined as
Given i.i.d samples and desired sample moments , , the generated samples are independent normal variates with a common mean and variance such that and .
If we happen to know the population mean and variance (as we do in variational inference), we could generate i.i.d. Gaussian variates by sampling and , and passing , to MarsagliaSample. Formally we have the following corollary to Proposition 1:
For any , and , if and and , then the generated samples are independent normal variates sampled from .
We might be inclined to use MarsagliaSample to directly generate samples with some fixed deterministic and . However, Corollary 1 holds only if the desired sample moments are random variables. If we choose them deterministically, we can no longer guarantee the correct marginal distribution for the samples, thus precluding their use for Monte Carlo estimates. Instead, what we can do is compute and from i.i.d. samples from , derive antithetic sample moments, and use MarsagliaSample to generate a second set of samples distributed accordingly.
More precisely, given a set of independent normal variates , we would like to generate a new set of normal variates such that the combined sample moments match the population moments, and . We call the second set of samples antithetic to the first set.
We compute sample statistics from the first set:
are random variables, specificallyand . Ideally, we would want the second set to come from an “opposing” and . To choose and , we leverage the inverse CDF transform
: given the cumulative distribution function (CDF) for a random variable, denoted , we can define a uniform variate . The antithetic uniform variable is then , which upon application of the inverse CDF function, is mapped back to a properly distributed antithetic variate . Crucially, and have the same marginal distribution, but are not independent.
Let represent a Gaussian CDF and represent a Chi-squared CDF. We can derive and as:
Crucially, chosen this way are random variables with the correct marginal distributions, i.e., and . knowing , it is straightforward to generate antithetic samples with MarsagliaSample. We summarize the algorithm in Alg. 4 and its properties in Prop. 2.
Given i.i.d samples , i.i.d. samples , let be the generated antithetic samples. Then:
are independent normal variates sampled from .
The combined sample mean is equal to the population mean .
The sample variance of is anticorrelated with the sample variance of .
The first property follows immediately from Corollary 1, as by construction have the correct marginal distribution as noted above. Simple algebra shows that the inverse Gaussian CDF transform simplifies to , giving the desired relationship . The third property follows from the definition in Eq. 15. ∎
Since both sets of samples share the same (correct) marginal distribution, can be used to obtain unbiased Monte Carlo estimates.
Given i.i.d samples , i.i.d. samples , let be the generated antithetic samples. Then is an unbiased estimator of .
Let denote the joint distribution of the samples. Note the two groups of samples and are not independent. However,
because by assumption and Prop. 2, each is marginally distributed as . ∎
If and were well-defined and invertible, we could use Alg. 4 as is, with its good guarantees. On one hand, since
is normally distributed, the inverse CDF transform simplifies to:
However, there is no general closed form expression for
. Our options are then to either use a discretized table of probabilities or approximate the inverse CDF. Because we want differentiability, we choose to use a normal approximation to.
as (P1) it is an even power, (P2) it contains only 1 term involving a random variate, and (P3) is shown to work better for smaller degrees of freedom (smaller sample sizes). We derive a closed form for computingfrom by combining the normal approximation with Eqn. 16. We denote this final transform as the antithetic Hawkins-Wixley transform:
where with being the degree of freedom. Therefore, if we set and , then we can derive where is computed as in Eqn. 17, whose derivation can be found in the supplementary material.
P1 is important as odd degree approximations e.g.(Wilson and Hilferty, 1931) can result in a negative value for under small . P2 is required to derive a closed form as most linear combinations do not factor. P3 is desirable for variational inference.
To update Alg. 4, we swap the fourth line with Eqn. 16 and the sixth line with Eqn. 17. The first property in Prop. 2 and therefore also Prop. 3 do not hold anymore: the approximate AntitheticSample has bias that depends on the approximation error in Eqn. 17. In practice, we find the approximate AntitheticSample to be effective. From now on, when we refer to AntitheticSample, we refer to the approximate version. See Fig. 2 for an illustration of the impact of antithetics: sampling i.i.d. could result in skewed sample distributions that over-emphasize the mode or tails, especially when drawing very few samples. Including antithetic samples helps to “stabilize” the sample distribution to be closer to the true distribution.
, and plot the the true distribution (solid black line), a kernel density estimate (KDE) of the empirical distribution (dotted blue line) ofi.i.d. samples (blue), and a KDE of the empirical distribution (dashed red line) of i.i.d. samples (red) pooled with
antithetic samples (orange). This snapshot was taken from a data point after the first epoch of training an AntiVAE on dynamic MNIST.
Marsaglia and Good (1980)’s algorithm is restricted to distribution families that can be transformed to a unit sphere (primilarly Gaussians), as are many similar algorithms (Cheng, 1984; Pullin, 1979). However, we can explore “generalizing” AntitheticSample to a wider class of families by first antithetically sampling in a Gaussian distribution, then transforming its samples to samples from another family using a deterministic function, . Although we are not explicitly matching the moments of the derived distributions, we expect that transformations of more representative samples in an initial distribution may be more representative in the transformed distribution. We now discuss a few candidates for .
Devroye (1996) presents a large suite of “one line” transformations between distributions. We focus on three examples starting from a Gaussian to (1) Log Normal, (2) Exponential, and (3) Cauchy. Many additional transformations (e.g. to Pareto, Gumbel, Weibull, etc.) can be used in a similar fashion. Let refer to the CDF of a random variable . See supplementary material for derivation.
Let where is a learnable parameter. Then . Thus, where .
Let where are learnable parameters. Then . Given , we define .
One liners are an example of a simple flow where we know how to score the transformed sample. If we want more flexible distributions, we can apply normalizing flows (NF). A normalizing flow (Rezende and Mohamed, 2015) applies invertible transformations to samples from a simple distribution, leaving as a sample from a complex distribution. A common normalizing flow is a linear-time transformation: where are learnable parameters, and is a non-linearity. In variational inference, flows enable us to parameterize a wider set of posterior families.
We can also achieve flexible posteriors using volume-preserving flows (VPF), of which Tomczak and Welling (2016) introduced the Householder transformation: where is a trainable parameter. Critically, the Jacobian-determinant is 1.
Finally, we can use AntitheticSample to approximate the ELBO for variational inference.
For a given observed , we write the antithetic gradient estimators as:
where , , , and . Optionally, where Transform denotes any sample transformation(s) with parameters .
Alternative variational bounds have been considered recently, including an importance-weighted estimator of the ELBO, or IWAE (Burda et al., 2015). Antithetic sampling can be applied in a similar fashion.
Importantly, AntitheticSample is a special instance of a reparameterization estimator. Aside from samples from a parameter-less distribution (unit Gaussian), AntitheticSample is completely deterministic, meaning that it is differentiable with respect to the population moments and
by any modern auto-differentiation library. Allowing backpropagation throughAntitheticSample means that the parameters are aware of the sampling strategy. Thus, including antithetics will change the optimization trajectory, resulting in a different variational posterior than if we had used i.i.d samples alone. In Sec. 8, we show that most of the benefit of differentiable antithetic sampling comes from being differentiable.
Alg. 3 summarizes inference in a VAE using differentiable antithetic sampling (denoted by AntiVAE111For some experiments, we use Cheng’s algorithm instead of Marsaglia’s. We refer to this as AntiVAE (Cheng).). To the best of our knowledge, the application of antithetic sampling to stochastic optimization, especially variational inference is novel. Both the application of (Marsaglia and Good, 1980) to drawing antithetics and the extension of AntitheticSample to other distribution families by transformation is novel. This is also the first instance of differentiating through an antithetic sample generator.
A comparison of test log likelihoods over 500 epochs between VAE and AntiVAE. Transforming samples to match moments seems to have different degrees of effectiveness depending on the data domain. However, we find that the test ELBO with AntiVAE is almost always greater or equal to that of the VAE. This behavior is not sensitive to hyperparameters e.g. learning rate or MLP hidden dimension. For each subplot, we start plotting from epoch 20 to 500. We cannot resample observations in Caltech101, leading to overfitting.
|Model||stat. MNIST||dyn. MNIST||FashionMNIST||Omniglot||Caltech||Frey||Hist.|
We compare performance of the VAE and AntiVAE on seven image datasets: static MNIST (Larochelle and Murray, 2011), dynamic MNIST (LeCun et al., 1998), FashionMNIST (Xiao et al., 2017), OMNIGLOT (Lake et al., 2015), Caltech 101 Silhouettes (Marlin et al., 2010), Frey Faces222https://cs.nyu.edu/ roweis/data.html, and Histopathology patches (Tomczak and Welling, 2016). See supplement for details.
In both VAE and AntiVAE, and are two-layer MLPs with 300 hidden units, Xavier initialization (Glorot and Bengio, 2010)
, and ReLU. By default, we setand (i.e. 4 antithetic samples) and optimize either the ELBO or IWAE. For grayscale images, parameterize a discretized logistic distribution as in (Kingma et al., 2016; Tomczak and Welling, 2017). The log variance from is clamped between -4.5 and 0.0 (Tomczak and Welling, 2017). We use Adam (Kingma and Ba, 2014) with a fixed learning rate of and a mini-batch of 128. We train for 500 epochs with a random seed of 1. Test marginal log likelihoods are estimated via importance sampling using 100 i.i.d. samples.
AntiVAE consistently achieves higher log likelihoods, usually by a margin of 2 to 5 log units. With FashionMNIST and Histopathology, the margin grows to as much as 30 log units. In the 3 cases that AntiVAE performs worse than VAE, the log-likelihoods are almost equal ( 1 log unit). In Fig. 2(b), we see a case where, even when the final performance is equivalent, AntiVAE learns faster.
We find similar behavior using the tighter IWAE bound. A better sampling strategy is effective regardless of the choice of objective.
The positive effect of antithetics extends to other posterior families. With one liners and both flow families, we see improvements of up to 25 log units, averaging around 2.
Fig. 4a illustrates that as the number of samples , posterior samples will match the true moments of regardless of the sampling strategy. But as , the effectiveness grows quickly. We expect best performance at small (but not too small) where the normal approximation (Eqn. 17) is decent and the value of antithetics is high.
Fig. 4b illustrates that the importance of sampling strategy increases as the dimensionality grows due to an exponential explosion in the volume of the sample space. With higher dimensionality, we find antithetic sampling to be more effective.
From Fig. 3(c), 3(d), we see that the majority of the improvement from antithetic sampling relies on differentiating through AntitheticSample. This is sensible as the model can adjust parameters if it is aware of AntitheticSample, likely leading to better optima. However, even if we do not backpropagate through antithetic sampling (draw antithetic samples from followed by standard reparameterization), we will still find modest improvement over i.i.d. sampling.
AntitheticSample can be vectorized resulting in low overhead. We measure an average 22.8%, or 0.004 sec., increase in wallclock time for a forward and backward pass when adding in antithetics. Compared to the linear increase in runtime with using more samples, this cost is negligible. Cheng’s algorithm is more expensive, resulting in a 46.3% increase in runtime.
It is straightforward for users to swap i.i.d. sampling for antithetics. (We will release code in Pyro and PyTorch after review.)
We presented a differentiable antithetic sampler for variance reduction in stochastic variational inference (SVI). We showcased its practical benefits for the VAE on seven datasets and generalized it to a wider family of posteriors using sample transformations. Of all the customizable parts to SVI, we find the sampling strategy to be an unexplored but fruitful avenue. In the future, we hope to research applications to reinforcement learning using pathwise derivatives.
Inverse distributions and independent gamma-distributed products of random variables.Biometrika, 50(3/4):505–508, 1963.
A normal approximation for the chi-square distribution.Computational statistics & data analysis, 48(4):803–808, 2005.
Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256, 2010.
Inductive principles for restricted boltzmann machine learning.In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 509–516, 2010.
We provide details in preprocessing for datasets used in the experiments in the main text. In total, we tested AntiVAE on seven datasets: static MNIST, dynamic MNIST, FashionMNIST, OMNIGLOT, Caltech 101 Silhouettes, Frey Faces, and Histopathology patches. As in previous literature, static MNIST uses a fixed binarization of images whereas dynamic MNIST resamples images from the training dataset at each minibatch. In dynamic MNIST, the validation and test sets have fixed binarization. We do an identical dynamic resampling procedure for OMNIGLOT. Caltech101 is given as binary data, so we cannot resample at training time, which we find to cause overfitting on the test set. For grayscale images that cannot be binarized, we would like to parameterize the generative model as a Gaussian distribution. In practice, we found this choice to cause over-prioritization of the reconstruction term, essentially causing the VAE to behave like a regular autoencoder. Instead, we find that a logistic distribution over the 256 grayscale domain avoids this failure mode. We use the default variance constraints as in(Tomczak and Welling, 2017). We now provide a brief description to introduce each dataset.
is a dataset of hand-written digits from 0 to 9 split into 60,000 examples for training and 10,000 for testing. We use 10,000 randomly chosen images from the training set as a validation group.
Similar to MNIST, this more difficult dataset contains 28x28 grayscale images of 10 different articles of clothing e.g. skirts, shoes, shirts, etc. The sizing and splits are identical to MNIST.
is a dataset with 1,623 hand-wrriten characters from 50 different alphabets. Unlike the MNIST family of datasets, each character is only represented by 20 images, making this dataset more difficult. The training set is 24,345 examples with 8,070 test images. We again take 10% of the training data as validation.
contains silhouettes of 101 different object classes in black and white: each image has a filled polygon on a white background. There are 4,100 training images, 2,264 validation datapoints and 2,307 test examples. Like OMNIGLOT, this task is difficult due to the limited data set.
is a collection of portrait photos of one individual with varying emotional expressions for a total of 2,000 images. We use 1,565 for training, 200 validation, and 200 test examples.
is a dataset from ten different biopsies of patients with cancer (e.g. lymphona, leukemia) or anemia. The dataset is originally in color with 336 x 448 pixel images. The data was processed to be 28 x 28 grayscale. The images are split in 6,800 training, 2,000 validation, and 2,000 test images. We refer to (Tomczak and Welling, 2016) for exact details.
All splitting was either given by the dataset owners or decided by a random seed of 1.
To compute the test log-likelihood (in any of the experiments), we use samples to estimate the following:
Notably, we use i.i.d. samples to estimate Eqn. 20. The final number reported is the average test log likelihood across the test split of the dataset.
We explicitly write out the approximate algorithm. Note the similiarity to Alg. 2 in the main text. Here, we refer to this as ApproxAntitheticSample; in the main text, this algorithm is often referred to as AntitheticSample.
To measure runtime, we compute the average wall-time of the forward and backward pass over a single epoch with fixed hyperparameters for VAE and AntiVAE. Namely, we use a minibatch size of 128 and vary the number of samples . The measurements are in seconds using a Titan X GPU with CUDA 9.0. The implementation of the forward pass in PyTorch is vectorized across samples for both VAE and AntiVAE. Thus the comparison of runtime should be fair. We report the results in the Table. 2.
To compute the additional cost of antithetic sampling, we divided the average runtimes of AntiVAE by the average runtimes of VAE and took the mean, resulting in 22.8% increase in running time (about 0.004 seconds). We note that AntiVAE (Cheng) is much more expensive as it is difficult to vectorize Helmert’s transformation.
We provide a step-by-step derviation for in one-liner transformations, namely from Gaussian to Cauchy and Exponential. We skip Log Normal as its formulation from a Gaussian variate is trivial. Below, let represent a normal variate and let be a random variable in the desired distribution family.
Let . Parameters: .
We start with .
Since and , we can replace with .
Let . Parameters: , .