1 Introduction
Gradientbased optimization is the workhorse of contemporary machine learning. For deterministic models, the backpropagation algorithm has allowed computing of gradients relative to parameters of large neural network models. Recent work has sought to extend this success to models where some of the hidden variables are stochastic. The
reparameterization trick [14]has enabled efficient training of models that have continuous stochastic hidden variables by allowing computation of unbiased and lowvariance gradients of loss functions. Models with discrete stochastic variables, although attractive from a modeling point of view in terms of their power and interpretability, have proven difficult to optimize using gradientbased optimization. Neither backpropagation nor the reparametrization trick are directly applicable to training discrete variable models. Optimizing discrete variables models using REINFORCE and REINFORCElike algorithms is difficult due to highvariance of these gradient estimators.
In this paper we use ideas from the theory of harmonic analysis of Boolean functions to provide another perspective on the REINFORCE [4] gradient estimator. Harmonic analysis seeks to analyze Boolean functions using Fourier expansions which express Boolean functions as linear combinations of orthonormal basis functions with coefficients called the Fourier coefficients
of the Boolean function. Each Fourier coefficient corresponds to a subset of input coordinates and depends on the product probability distribution imposed on the Boolean cube. We show that REINFORCE estimates singleton Fourier coefficients of a Boolean function. We then introduce the noise operator from harmonic analysis and construct a control variate that can be plugged into existing gradient estimators to further reduce their variance.
The contribution of this paper is twofold.

The first is a reinterpretation of existing variancereducing gradient estimators in terms of the Fourier coefficients of the objective function seen as a function of the stochastic hidden variables. We show that they can be seen as eliminating or reducing terms with irrelevant Fourier coefficients from the Fourier expansion of the objective function.

The second contribution is an unbiased estimator for the gradient of the objective function relative to the parameters of the stochastic binary latent variables. The estimator depends on the properties of the noise operator from Boolean Fourier theory.
We begin with a review of the Fourier analysis of Boolean functions.
2 Background
2.1 Boolean Fourier Analysis
We reiterate some facts from the analysis of Boolean functions. A comprehensive introduction can be found in [2]. We work with Boolean functions . We assume a product probability distribution on the Boolean input, with being the probability of the th coordinate being one, its mean and its variance. Given Boolean functions and , we define an inner product:
We define the norm of :
with the property that if .
2.1.1 Fourier Expansion
Let . For a set , define . The functions are an orthonormal basis for the space of Boolean functions. The expansion of in this basis
is known as the biased Fourier expansion of and the coefficients of this expansion, , are the Fourier coefficients. The cardinality of is the degree or weight of the coefficient. The Fourier expansion of a Boolean function expresses the function as a multilinear polynomial in . It should be noted that the Fourier coefficients of a function depend on both that function and the product probability distribution imposed on the input and changing the input probability distribution changes the Fourier expansion for the same function.
There is also an inverse expansion for the Fourier coefficients
In particular, the degree coefficient is the average of the function under the input distribution: .
2.1.2 Discrete Derivative
The th discrete derivative of is formal derivative of the biased expansion of relative to . This is denoted by and has the Fourier expansion given by
Notice that does not depend on .
2.1.3 Noise Operator
Given and , we say that are correlated if is generated by independently setting each to with probability and a sample from with probability .
We also denote this by . We use this to define the noise operator acting on as the expectation over correlated inputs as follows:
The noise operator has a special action on the functions : it multiplies them by , i.e.,
Similarly, its action on the basis function is to multiply it by :
By linearity of expectation it follows that
3 Related Work
Perhaps the best known example of a gradient estimator for (but not limited to) discrete stochastic hidden variable models is the REINFORCE algorithm [4]. Also known as the likelihood ratio estimator and the score function estimator, this estimator uses the logderivative trick i.e., , to convert the gradient of an expectation with respect to parameters of a probability distribution over which the expectation is computed into an expectation of a term with a gradient of a log probability. That is,
The score function estimator has been applied to optimize the variational lower bound on the loglikelihood with an inference network [14; 16]. In this form, however, the estimator has high variance which leads to slow convergence. A number of subsequent methods have been proposed to deal with this high variance. For continuous stochastic variables [14] propose the reparameterization trick which involves rewriting the function in a form that allows gradients to go through.
Another method to reduce the variance of this estimator, which also works for discrete variables, is to subtract a term , called a control variate, from . If the control variate depends on the samples, , then we must add the analytical expectation of the control variate under to keep estimator unbiased.
A number of control variate schemes have been proposed: NVIL [5] subtracts two baselines from the objective to reduce variance: the first is a constant baseline set to the moving average of the function and the second is an inputdependent baseline computed by a feedforward neural network. Since the baselines do not depend on the samples, the analytical expectation is identically 0. MuProp [6] uses the firstorder Taylor approximation of the function as a baseline. is usually set to the distribution mean which allows gradients to pass through but requires a separate pass through the network. Since the baseline depends on , the term must be added to keep the estimator unbiased. DARN [17] also uses the firstorder Taylor expansion as a baseline but does not add the analytical expectation, making the estimator biased. Another very simple biased estimator is the straightthrough estimator [19] which uses the gradient relative to the sample as that relative to the parameter. Another class of estimators employ multiple samples such as in [15] to construct tighter lowerbounds on the log likelihood and in [18] to construct persample baselines.
The Fourier expansion is widely used in computational learning theory with applications to learning lowdegree functions
[10][11], constantdepth circuits [12], juntas [13].4 REINFORCE Estimates Fourier Coefficients
The following lemma is a slight variant of the MargulisRusso [8; 7; 2] formula. See the appendix for proof.
Lemma 1.
Let be a Boolean function. Then
where is the Fourier coefficient of under the biased expansion.
From this we see that:
(1)  
(2)  
(3) 
5 Variance Reduction in Gradient Estimation
From the above we see that computing gradients of loss functions with respect to parameters of binary latent variables requires computing weight1 Fourier coefficients. Since a Fourier coefficient is an expectation, Monte Carlo averaging is frequently used as an estimation technique. Naive averaging however results in an estimator that has high variance. What is the source of this variance? To see this we can use the Fourier expansion as follows. A Fourier coefficient is written
(4)  
(5) 
Since and because of orthonormality, we end up with in expectation. But using a Monte Carlo average to estimate the expectation will get a contribution to its variance coming from the first and third terms. This variance will be high or low if the Fourier coefficients in these terms are high or low respectively. From this perspective, to reduce the variance of a gradient estimator we should attempt to remove those terms with irrelevant Fourier coefficients or ensure that those Fourier coefficients are small in magnitude.
In the following we look at some gradient estimators from the literature with this view.
5.1 StraightThrough Estimator
The straightthrough estimator [19] is a simple but biased estimator that computes that derivative with respect to a sample and uses it as an estimate of the derivative with respect to the probability parameter .
We can view this estimator as approximating the discrete derivative of or equivalently the derivative of the Fourier expansion of . The discrete derivative of can be expanded as follows.
Here denotes without the th coordinate. If we were able to compute we would have an unbiased estimator of the gradient since However, since we normally have available to us as a neural network (and not as a multilinear polynomial), we approximate the discrete derivative with the derivative of the function as a neural network.
5.2 Nvil
NVIL [5] subtracts two baselines, a constant baseline and an input dependent baseline from the function before computing its Fourier coefficient. Since the baselines do not depend on samples from the distribution, they can be seen as subtracting the weight0 Fourier coefficient, , from the function. Notice that this does not affect the remaining Fourier coefficients of the resulting function.
5.3 MuProp
The MuProp [6] estimator builds a control variate based on the first order Taylor series expansion as follows.
This can be viewed as an approximation to the following.
The term includes all the terms of that contain and subtracting from removes all terms containing from , reducing the variance in . Once again, since we do not have as a multilinear polynomial, we cannot compute , therefore we approximate it with the gradient of the function as a neural network.
6 Proposed Estimator
We develop an unbiased gradient estimator using the variance reduction properties of the noise operator. The noise operator, , is a smoothing operator that when applied to a functions decays the effect of the higherorder terms of the function. The extent of decay also depends on degree: the higher the degree of a term, the greater is the decaying factor applied to the term. The expected value of a function under the input distribution is unaffected by application of the noise operator i.e., . Therefore the noise operated version of a function is a function with the same expected value but with much reduced variance.
This is intuitively clear from the Fourier expansion of the noise operator with parameter :
Here, we see that the noise operator multiplies the Fourier coefficients of with an exponentially decaying factor in that depends on the degree of the term. We also see that as , and as , . In other words, the variance of goes to with . For intermediate values of , has exponentially small higher degree terms and lower variance than . In fact, we can make the stronger statement that even higher norms of are bounded by the second norm of . This fact is expressed by saying that the noise operator is hypercontractive and is the content of Bonami’s hypercontractivity [9; 2] theorem. One version of the theorem states
Theorem 1 (Hypercontractivity of Noise Operator).
Given a product distribution on a finite probability space where each outcome in the constituent distributions has probability at least and a function defined on such space, for any and , we have that
In particular for appropriate and we have that . Here we use that fact that for , . Squaring and subtracting the square of the expectation , we see that
showing that the variance of the noise operated version of is no more than the variance of .
6.1 Constructing the Control Variate
We can use this fact to construct a control variate in a number of ways. Recall that a control variate for an unbiased estimator is essentially a Boolean function for which all degree1 coefficients are 0. If is any Boolean function then
has all degree1 coefficients 0 and can be used as a control variate.
If is well correlated with , then subtracting the lowvariance term from results in a function with 0 degree1 coefficients with higherorder terms that are still well correlated with . Subtracting this from to give then gives a function with small higherorder terms where the degree1 terms are unaffected. A correlated function can be learnt by using a feedforward network and minimizing the mean squared error with .
Another possible way to build the control variate is to use
6.2 Building the Estimator
We built our final estimator used in the experiments upon MuProp. To do so we construct a function that includes the muProp terms, an input dependent baseline and our sample dependent baseline as follows.
The gradient is and we use the single sample estimate of the expectation where is estimated using a single correlated sample.
MNIST Generative Modeling  

Model  NVIL  MuProp  MuProp+Baseline 
200784  112.8  111.9  111.8 
200200784  100.55  99.14  98.97 
200200200784  96.55  96.1  95.77 
Omniglot Generative Modeling  

Model  NVIL  MuProp  MuProp+Baseline 
200784  118.7  118.3  118.4 
200200784  110.8  110.2  109.44 
200200200784  108.8  107.94  107.76 
7 Experiments
To test the variance reduction properties of our proposed estimator, we compared the estimator against NVIL and the MuProp estimator with an input dependent baseline. We built our implementation upon the code base made available by [1]^{1}^{1}1TensorFlow code available at: https://github.com/alpz/fourierREINFORCE
. In our experiments we used sigmoid belief nets with one, two and three layers. Each stochastic layer in the network is 200 units wide. The input dependent baseline is a feedforward newtork with a single layer of 100 tanh units and the sampledependent baseline has two 100 unit layers of either tanh or relu units. The datasets are the statically binarized MNIST digit and Omniglot character datasets. For the optimization we used SGD with a momentum of 0.9 and a minibatch size of 24. For 3 layer models on Omniglot we found SGD to be slow to converge regardless of the gradient estimator; for those we performed the optimization using Adam. The gradient variance is estimated using an exponential moving average. We use a constant
value of 0.5. The experiments were run with learning rates in and with the best performing result chosen. We used a different learning rate for the baseline networks which was set to times the learning rate for the model.7.1 Generative Modeling with Sigmoid Belief Nets
We train generative models using the autoencoding variational Bayes framework of
[14]. The framework simplifies the training of sigmoid belief networks using an inference network to generate samples from an approximate variational posterior distribution. The inference network is parameterized as a feedforward network with a structure that is the reverse of the model. We optimize the single sample variational lower bound (ELBO) of the loglikelihood.Here is the approximate variational posterior and is a data sample.
The final training ELBO results for MNIST and Omniglot after 2,000,000 steps are given in tables 2 and 2. The training ELBO for 2layer and 3layer models for MNIST is plotted in figure 1. The estimated log of gradient variance is plotted in figure 2. As can be seen from the last figure we get a substantial improvement in gradient variance for the 2layer model and the difference becomes significant early in the training when model depth is increased. This can also be seen from the ELBO plots where the 3layer ELBO diverges from the MuProp ELBO much earlier than for the 2layer model.
7.1.1 Multiple Samples
The results above were performed using a single sample to estimate . We also performed experiments using more samples to estimate . Notice that this still uses only a single sample from the model. This resulted in only very slight reduction in gradient variance in the initial phase of training. This can be seen in figure 3
for the 2layer model on MNIST. However since we did not optimize over hyperparameters for this, it still might be possible to improve these results for multiple samples to compute
.References
 [1] Tucker, George, et al. "REBAR: Lowvariance, unbiased gradient estimates for discrete latent variable models." Advances in Neural Information Processing Systems. 2017.
 [2] O’Donnell, Ryan. Analysis of boolean functions. Cambridge University Press, 2014.
 [3] Raiko, Tapani, et al. "Techniques for learning binary stochastic feedforward neural networks." arXiv preprint arXiv:1406.2989 (2014).

[4]
Williams, Ronald J. "Simple statistical gradientfollowing algorithms for connectionist reinforcement learning." Reinforcement Learning. Springer, Boston, MA, 1992. 532.
 [5] Mnih, Andriy, and Karol Gregor. "Neural Variational Inference and Learning in Belief Networks." International Conference on Machine Learning. 2014.
 [6] Gu, Shixiang, et al. "MuProp: Unbiased backpropagation for stochastic neural networks." arXiv preprint arXiv:1511.05176 (2015).

[7]
Russo, Lucio. "An approximate zeroone law." Probability Theory and Related Fields 61.1 (1982): 129139.
 [8] Margulis, Grigorii Aleksandrovich. "Probabilistic characteristics of graphs with large connectivity." Problemy peredachi informatsii 10.2 (1974): 101108.
 [9] Bonami, Aline. "Étude des coefficients de Fourier des fonctions de Lp (G)." Ann. Inst. Fourier (Grenoble) 20.2 (1970): 335402.

[10]
Mansour, Yishay. "Learning Boolean functions via the Fourier transform." Theoretical advances in neural computation and learning. Springer, Boston, MA, 1994. 391424.
 [11] Kushilevitz, Eyal, and Yishay Mansour. "Learning decision trees using the Fourier spectrum." SIAM Journal on Computing 22.6 (1993): 13311348.
 [12] Linial, Nathan, Yishay Mansour, and Noam Nisan. "Constant depth circuits, Fourier transform, and learnability." Journal of the ACM (JACM) 40.3 (1993): 607620.

[13]
Mossel, Elchanan, Ryan O’Donnell, and Rocco P. Servedio. "Learning juntas." Proceedings of the thirtyfifth annual ACM symposium on Theory of computing. ACM, 2003.
 [14] Kingma, Diederik P., and Max Welling. "Autoencoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).
 [15] Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. "Importance weighted autoencoders." arXiv preprint arXiv:1509.00519 (2015).
 [16] Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models." arXiv preprint arXiv:1401.4082 (2014).
 [17] Gregor, Karol, et al. "Deep autoregressive networks." arXiv preprint arXiv:1310.8499 (2013).
 [18] Mnih, Andriy, and Danilo J. Rezende. "Variational inference for monte carlo objectives." arXiv preprint arXiv:1602.06725 (2016).
 [19] Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. "Estimating or propagating gradients through stochastic neurons for conditional computation." arXiv preprint arXiv:1308.3432 (2013
Appendix A Proof of Lemma 1
We follow [2, 8.4] in the following proof.
Proof.
We consider as a multilinear polynomial over . Let be the biased Fourier representation of the Boolean function . Then by linearity of expectation
We also have that
Then
where is the Fourier coefficient in the biased representation. Given that , and
, the result follows by the chain rule. ∎
Comments
There are no comments yet.