1 Introduction
There has been significant recent interest in deep reinforcement learning (DRL) that combines reinforcement learning (RL) with powerful function approximators such as neural networks, which leads to a wide variety of successful applications, ranging from board/video game playing to simulated/real life robotic control
(Silver et al., 2016; Mnih et al., 2013; Schulman et al., 2015a; Levine et al., 2016). One major area of DRL is onpolicy optimization (Silver et al., 2016; Schulman et al., 2015a), which progressively improves upon the current policy iterate until a local optima is found.As onpolicy optimization flattens RL into a stochastic optimization problem, unbiased gradient estimation is carried out using REINFORCE or its more stable variants (Williams, 1992; Mnih et al., 2016). In general, however, onpolicy gradient estimators suffer from high variance and need many more samples to construct high quality updates. Prior works have proposed variance reduction using variants of control variates (Gu et al., 2015, 2017; Grathwohl et al., 2017; Wu et al., 2018). However, recently Tucker et al. (2018)
cast doubts on some aforementioned variance reduction techniques by showing that their implementation deviates from the proposed methods in the paper, which we will detail in the related work below. In other cases, biased gradients are deliberately constructed to heuristically compute trust region policy updates
(Schulman et al., 2017b), which also achieve stateoftheart performance on a wide range of tasks.In this work, we consider an unbiased policy gradient estimator based on the AugmentReinforceMerge (ARM) gradient estimator for binary latent variable models (Yin & Zhou, 2019). We design a practical onpolicy algorithm for RL tasks with binary action space, and show that the theoretical guarantee for variance reduction in Yin & Zhou (2019) can be straightforwardly applied to the policy gradient setting. The proposed ARM policy gradient estimator is a plugin alternative to REINFORCE gradient estimator and its variants (Mnih et al., 2016), with minor algorithmic modifications.
The remainder of the paper is organized as follows. In Section 2, we introduce background and related work on RL and the ARM estimator for binary latent variable models. In Section 3, we describe the ARM policy gradient estimator, including the derivation, theoretical guarantees, and onpolicy optimization algorithm. In Section 4, we show via thorough experiments that our proposed estimator consistently outperforms stable baselines, such as the A2C (Mnih et al., 2016) and recently proposed RELAX (Grathwohl et al., 2017) estimators.
2 Background
2.1 Reinforcement Learning
Consider a Markov decision process (MDP), where at time
the agent is in state , takes action , transitions to next state according to and receives instant reward . A policy is a mapping where is the space of distributions over the action space . The objective of RL is to search for a policy such that the expected discounted cumulative rewards is maximized(1) 
where is a discount factor and is the horizon. Let be the optimal policy. For convenience, we define under policy the value function as and actionvalue function as . We also define the advantage function as . By construction, we have , , the expected advantage under policy is zero.
2.2 OnPolicy Optimization
One way to find is through direct policy search. Consider parameterizing the policy with parameter where is the space of parameters. If the policy class is expressive enough such that , we can recover with some parameter . In onpolicy optimization, we start with a random policy and iteratively improve the policy through gradient updates for some learning rate . We can compute the gradient
(2) 
It is worth noting that is almost but not exactly the REINFORCE gradient of (Williams, 1992). The approximation is due to the absence of discount factors at each term in the summation of (2). In recent practice, is used instead of the exact REINFORCE gradient since the factor aggressively weighs down terms with large (Schulman et al., 2015a, 2017a; Mnih et al., 2016), which leads to poor empirical performance. In our subsequent derivations, we treat as the standard gradient and we let
as an onesample unbiased estimate of
such that . However, generally exhibits very high variance and does not entail stable updates. Actorcritic policy gradients (Mnih et al., 2016) subtract the original estimator by a statedependent baseline function as a control variate for variance reduction. A nearoptimal choice of is the value function , which yields the following onesample unbiased actorcritic gradient estimator(3) 
where we still have . We also call (3) the A2C policy gradient estimator (Mnih et al., 2016). In practice, the actionvalue function is estimated via Monte Carlo samples and the value function is approximated by a parameterized critic with parameter .
2.3 AugmentReinforceMerge Gradient for Binary Random Variable Models
Discrete random variables are ubiquitous in probabilistic generative modeling. For the presentation of subsequent work, we limit our attention to the binary case. Let denote a binary random variable such that
(4) 
where
is the logit of the Bernoulli probability parameter and
is the sigmoid function. For multidimensional distributions, we have a vector of
binary random variables such that each componentfollows an independent Bernoulli distribution
, which is denoted as . In general, we consider an expected optimization objective of the vector(5) 
To optimize (5), we can construct a gradient estimator of for iterative updates. Due to the discrete nature of variables , the REINFORCE gradient estimator is the naive baseline but its variance can be too high to be of practical use. The ARM (AugmentReinforceMerge) gradient estimator (Yin & Zhou, 2019) provides the following alternative (See also Theorem 1in (Yin & Zhou, 2019))
Theorem 2.1 (ARM estimator for multivariate binary).
For a vector of binary random variables , the gradient of
with respect to , the logits of the Bernoulli distributions can be expressed as
(6) 
where . Here is a dimensional vector with the th component to be where is the indicator function.
The ARM gradient estimator was originally derived through a sequence of steps in Yin & Zhou (2019): first, an AR estimator is derived from augmenting the variable space (A) and applying REINFORCE (R); then a final merge step (M) is applied to several AR estimators for variance reduction. It was shown that through the merge step, the resulting ARM estimator is equivalent to the original AR estimator combined with an optimal control variate subject to certain constraints, which leads to substantial variance reduction with theoretical guarantees. We refer the readers to Yin & Zhou (2019) for more details.
2.4 Training Stochastic Binary Network
One important application of the ARM gradient estimator is training stochastic binary neural networks. Consider a binary latent variable model with stochastic hidden layers conditional on observations
, we construct their joint distribution as
(7) 
where are binary random variables and the conditional distributions are Bernoulli distributions parameterized by . In general, we construct the following objective in the form of (5)
(8) 
for some function . In the context of variational autoencoder (VAE) (Kingma & Welling, 2013), is the evidence lower bound (ELBO) (Blei et al., 2017). We would like to optimize using gradients , and this is enabled by the following ARM backpropagation theorem. Proposition 6 in Yin & Zhou (2019) addresses the VAE model, but there is no loss of generality when considering a general function as in the following theorem.
Theorem 2.2 (ARM Backpropagation).
For a stochastic binary network with binary hidden layers, let , construct the conditional distributions as
(9) 
for some function . Then the gradient of w.r.t. can be expressed as
where , which can be estimated via a single Monte Carlo sample.
2.5 Related Work
OnPolicy Optimization.
Onpolicy optimization is driven by policy gradients with function approximation (Sutton et al., 2000). Due to the nondifferentiable nature of RL, REINFORCE gradient estimator (Williams, 1992) is the default policy gradient estimator. In practice, REINFORCE gradient estimator has very high variance and the updates can become unstable. Recently, Mnih et al. (2016) propose advantage actor critic (A2C), which reduces the variance of the policy gradient estimator using a value function critic. Further, Schulman et al. (2015b) introduce generalized advantage estimation (GAE), a combination of multistep return and value function critic to tradeoff the bias and variance in the advantage function estimation, in order to compute lowervariance downstream policy gradients.
Variance Reduction for Stochastic gradient estimator.
Policy gradient estimators are special cases of the general stochastic gradient estimation of an objective function written as an expectation . To address the typical highvariance issues of REINFORCE gradient estimator, prior works have proposed to add control variates (or baseline functions) for variance reduction (Paisley et al., 2012; Ranganath et al., 2014; Gu et al., 2015; Kucukelbir et al., 2017). Reparameterization trick greatly reduces the variance when variables are continuous and the underlying distribution is reparametrizable (Kingma & Welling, 2013). When variables are discrete, Maddison et al. (2016); Jang et al. (2016) introduce a biased yet lowvariance gradient estimator based on continuous relaxation of the discrete variables. More recently, REBAR (Tucker et al., 2017) and RELAX (Grathwohl et al., 2017) construct unbiased gradient estimators by using baseline functions derived from continuous relaxation of the discrete variables, whose parameters need to be estimated online. Yin & Zhou (2019) propose an unbiased estimate motivated as a selfcontrol baseline, and display substantial gains over prior works when
are binary variables. In this work, we borrow ideas from
Yin & Zhou (2019) and extend the ARM gradient estimator for binary stochastic network into ARM policy gradient for RL.Variance Reduction for Policy Gradients.
By default, the baseline function for REINFORCE onpolicy gradient estimator is only state dependent, and the value function critic is typically applied. Gu et al. (2015) propose Taylor expansions of the value functions as the baseline and construct an unbiased gradient estimator. Grathwohl et al. (2017); Wu et al. (2018) propose carefully designed actiondependent baselines, which can construct unbiased gradient estimator while in theory achieving more substantial variance reduction. Despite their reported success, Tucker et al. (2018) observe that subtle implementation decisions cause their code to diverge from the unbiased methods presented in the paper: (1) Gu et al. (2015, 2017) achieve reported gains potentially by introducing bias into their advantage estimates. When such bias is removed, they do not outperform baseline methods; (2) Liu et al. (2018) achieve reported gains over statedependent baselines potentially because the baselines are poorly trained. When properly trained, statedependent baselines achieve similar results as the proposed actiondependent baselines; (3) Grathwohl et al. (2017) achieve gains potentially due to a bug that leads to different advantage estimators for their proposed RELAX estimator and the baseline. When this bug is fixed, they do not achieve significant gains. In this work, we propose ARM policy gradient estimator, a plugin alternative which is unbiased and consistently outperforms A2C and RELAX for tasks with binary action space.
3 AugmentReinforceMerge Policy Gradient
Below we present the derivation of AugmentReinforceMerge (ARM) policy gradient. The high level idea is that we draw connections between RL and stochastic binary network  for a RL problem of horizon , we interpret the action sequence as the stochastic and derive the gradient estimator similarly as in Theorem 2.2. We consider RL problems with binary action space .
3.1 TimeDependent Policy
To make full analogy to stochastic binary networks with layers, we consider a RL problem with horizon . We make the policy timedependent by specifying a different policy for time with parameter . We define a similar RL objective as (1)
(10) 
where we jointly optimize over all . To make the connection between (10) and (8) explicit, we observe the following: we can interpret (10) as a special form of (8) by setting the binary hidden variables as the actions and the observation as the initial state . The conditional distribution can be defined as , which consists of the policy and the transition dynamics . Finally we define the objective function , which depends only directly on (after marginalizing out the states ).
Since the action space is binary, we introduce a policy parameterization similar to stochastic binary network, , . For any given , consider the ARM estimator of (10) w.r.t. according to Theorem 2.2. When we sample , we can sample a uniform random variable then set . We define the pseudo action as . By converting all variables in the binary stochastic network example into their RL counterparts as described above, we can derive the gradient of (10) w.r.t. for any given
(11) 
where is defined as , , the expected cumulative rewards obtained by first executing at and following the timedependent policy thereafter. Notice that this is not exactly the same as the actionvalue function we defined in Section 2.1 since the policy is timedependent.
3.2 Stationary Policy
In practice we are interested in a stationary policy, which is invariant over time . Since we have derived the gradient estimator for a timedependent policy , the most naive approach would be to share weights by letting and linearly combining the perstep gradients (11) across time steps. Since now the policy is stationary, we define as the stationary version of (11) where is replaced by
(12) 
The combined gradient is
(13) 
We denote and the unbiased sample estimates of and respectively. Importantly, we can show that is unbiased, , where is the standard gradient defined in Section 2.2. We summarize the result in the following theorem.
Theorem 3.1 (Unbiased ARM policy gradient).
When the ARM policy gradient is constructed as in (13), it is unbiased w.r.t. the true gradient of the RL objective
Proof.
Recall the standard gradient in (2), we define . Now we show that by combining ARM gradients across all time steps (which correspond to all layers of a stochastic binary network), we can compute an unbiased policy gradient. Recall the gradient at is computed based on (12), we explicitly compute the gradient in the following. For simplicity, we remove all dependencies on and denote , . Also the advantage function . Assume also without loss of generality,
(14) 
We also explicitly write down the standard gradient at time step with the same notation
We see that conditional on the same state , the ARM policy gradient sample estimate at time step has its expectation equal to the standard gradient at time . To complete the proof, we marginalize out via the visitation probability under current policy and sum over time steps: . ∎
3.3 Variance Reduction for ARM Policy Gradient
Here we compare the variance of the ARM policy gradient estimator against the REINFORCE gradient estimator . Analyzing the variance of the full gradient is very challenging, since we need to account for covariance between gradient components across time steps. We settle for analyzing the variance of each time component and for any . Similar analysis has been applied in Fujita & Maeda (2018).
For simple analysis, we assume that all actionvalue functions (or advantage functions) can be obtained exactly, , we do not consider additional bias and variance introduced by advantage function estimations. Conditional on a given , we can show with results from Yin & Zhou (2019) that
(15) 
Since we have established , we can show via the variance decomposition formula that for
(16) 
We cannot obtain a tighter bound without making further assumptions on the relative scale of variance and the expectations . Still, we can show that the ARM policy gradient estimator can achieve potentially much smaller variance than the standard gradient estimator, which entails faster policy optimization in practice.
We provide an intuition for why ARM policy gradient estimator can achieve substantial variance reduction and stabilize training in practice. In Figure 1, we show the running average of , , the difference between onpolicy actions and pseudo actions as training progresses. We show two settings of advantage estimations (left: A2C, right: GAE) which will be detailed in Section 4. We see that as training progresses, both actions are increasingly less likely to be different, often yielding . In such cases the ARM policy gradient estimator achieves exact zero value . On the contrary, prior baseline gradient estimators (2,3) cannot take up exact zero values and will cause parameters to stumble around due to noisy estimates. We provide more detailed discussions in the Appendix.
3.4 Algorithm
Here we present a practical algorithm that applies the ARM policy gradient estimator. The onpolicy optimization procedure should alternate between collecting onpolicy samples using current policy and computing gradient estimator using these onpolicy samples for parameter updates (Schulman et al., 2015a, 2017a).
We see that a primary difficulty with computing (12) is that it requires the difference of two actionvalue functions . Unless the difference is estimated by additional parameterized critics, in typical onpolicy algorithm implementation, we only have access to the Monte Carlo estimators of actionvalue functions corresponding to onpolicy actions. To overcome this, we estimate the difference by using the property . To be specific, when , the advantage function of the pseudo action can be expressed as
(17) 
Since the difference of actionvalue functions is identical to the difference of advantage functions, we have in general
(18) 
We can hence estimate the difference in (12) with only onpolicy advantage estimates , along with sampled pseudo action and onpolicy action . Notice when , the difference , therefore the gradient is zero with high frequency when the algorithm approaches convergence, which leads to faster convergence speed and better stability.
We design the onpolicy optimization procedure as follows. At any iteration, we have policy with parameter . We generate onpolicy rollout at time by sampling a uniform random variable , then construct onpolicy action and pseudo action . Only the onpolicy actions are executed in the environment, which returns instant rewards . We estimate the advantage functions from onpolicy samples using techniques from A2C (Mnih et al., 2016) or GAE (Schulman et al., 2015b), and replace in (12) by estimates . The details of the advantage estimators are provided in the Appendix. Finally, the difference of the actionvalue functions can be computed using (18) based purely on onpolicy samples. With all the above components, we compute the ARM policy gradient (12) to update the policy parameters. The main algorithm is summarized in Algorithm 1 in the Appendix.
ARM policy gradient as a plugin alternative.
We note here that despite minor differences in the algorithm (, need to sample pseudo actions along with ), ARM policy gradient estimator is a convenient plugin alternative to other baseline onpolicy gradient estimators (2,3). All the other components of standard onpolicy optimization algorithms (, value function baselines) remain the same.
4 Experiments
We aim to address the following questions through the experiments: (1) Does the proposed ARM policy gradient estimator outperform previous policy gradient estimators on binary action benchmark tasks? (2)
How sensitive are these policy gradient estimators to hyperparameters,
, the size of the sample batch size?To address (1), we compare the ARM policy gradient estimator with A2C gradient estimator (Mnih et al., 2016) and RELAX gradient estimator (Grathwohl et al., 2017). We aim to study how advantage estimators affect the quality of the downstream gradient estimators: with A2C advantage estimation, we compare gradient estimators; with GAE, we compare gradient estimators ^{1}^{1}1Here for A2C gradient estimator, we just replace the original A2C advantage estimators by GAE estimators and keep all other algorithmic components the same.. We evaluate onpolicy optimization with various gradient estimators on benchmark tasks with binary action space: for each policy gradient estimator, we train the policy for a fixed number of time steps (
) and across 5 random seeds. Each curve in the plots below shows the mean performance with shaded area as the standard deviation performance across seeds. In Figures
3 and 4, xaxis show the number of time steps at training time and yaxis show the performance. The results are reported in Section 4.1. To address (2), we vary the batch size to assess the effects of the batch size on the variance of the policy gradient estimators. We also evaluate the estimators’ sensitivities to learning rates. The results are reported in Sections 4.2 and 4.3.Benchmark Environments.
We focus on benchmark environments with binary action space illustrated in Figure 2: All tasks are simulated by OpenAI gym and DeepMind control suite (Todorov, 2008; Brockman et al., 2016; Tassa et al., 2018). Some tasks have binary action space by default. In cases where the action space is a real interval, ,
for Pendulum, we binarize the action space to be
.Implementations and Hyperparameters.
All implementations are in Tensorflow
(Abadi et al., 2016) and RL algorithms are based on OpenAI baselines (Dhariwal et al., 2017) ^{2}^{2}2https://github.com/openai/baselines. All policies are optimized with Adam using best learning rates selected from . We refer to the original code of RELAX^{3}^{3}3https://github.com/wgrathwohl/BackpropThroughTheVoidRL but notice potential issues in their original implementation. We discuss such potential issues in the Appendix. We implement our own version of RELAX (Grathwohl et al., 2017) by modifying the OpenAI baselines (Dhariwal et al., 2017). Recall that onpolicy optimization algorithms alternate between collecting batch samples of size and then compute gradient estimators based on these samples; here we set by default.Policy Architectures.
We parameterize the logit function as a twolayer neural network with state as input and hidden units per layer. Each layer applies ReLU
nonlinearity as the activation function. The output is a logit scalar
where consists of weight matrices and biases in the neural network. For variance reduction we also parameterize a value function baseline with two hidden layers each with units, with relu nonlinear activation. Both the logit function and the value function have linear activation for the output layer. For RELAX gradient estimator, we use two parameterized baseline functions with two layers, each with hidden units.4.1 Benchmark Comparison
Here we compare the ARM policy gradient estimator with two baseline methods: A2C gradient estimator (Mnih et al., 2016) and the unbiased RELAX gradient estimator (Grathwohl et al., 2017). We note some critical implementation details: All gradient estimators require advantage estimation , we do not normalize the advantage estimates before computing the policy gradients as commonly implemented in Dhariwal et al. (2017). As observed in Tucker et al. (2018), such normalization biases the original gradients for variance reduction, especially for action dependent baselines such as RELAX (Grathwohl et al., 2017). Since our focus is on unbiased gradient estimators, we remove such normalization (Dhariwal et al., 2017) ^{4}^{4}4Since GAE applies a combination of Monte Carlo returns and value function critics, the advantage estimator is still slightly biased to achieve small variance..
A2C Advantage Estimator.
A2C Advantage estimators are simple combinations of Monte Carlo sampled returns and baseline functions. For the baseline is the value function critic, while for RELAX the baseline is parameterized and trained to minimize the square norm of the gradients computed on minibatches. Let be the RELAX gradient estimator and recall as the A2C gradient estimator. Define a generalized RELAX gradient as for . When we recover an A2C gradient estimator (which is still different from our original A2C gradient estimator since the baseline functions are parameterized and trained differently). When we recover the original RELAX estimator. In practice we find that tends to significantly outperform the pure RELAX estimator.
In Figure 3, we show the performance of estimators along with RELAX with varying . We see that the ARM policy gradient estimator outperforms the other baseline estimators on most tasks: except on Inverted Pendulum, where all estimators tend to learn slowly, while RELAX with perform the best in terms of mean performance, they do not significantly outperform others when accounting for the standard deviation. For other benchmark tasks, the ARM policy gradient estimator enables significantly faster policy optimization, while other baselines either become very unstable or learn very slowly.
Generalized Advantage Estimator (GAE).
GAE constructs advantage estimates using a more complex combination of Monte Carlo samples and baseline function, to better tradeoff bias and variance. By construction, here the baseline function must be value function critic, hence we only evaluate .
In Figure 4, we show the comparison. We make several observations: (1) Comparing A2C with GAE, as shown in Figure 4, to A2C with A2C advantage estimator, as shown in Figure 3, we see that in most cases A2C with GAE speeds up the policy optimization significantly. This demonstrates the importance of stable advantage estimation for policy gradient estimator. A notable exception is MountainCar, where A2C with A2C advantage estimator performs better. (2) Comparing ARM with A2C in Figure 4, we still see that the ARM estimator significantly outperforms A2C estimator. For almost all presented tasks, the ARM policy gradient estimator allows for much faster learning and significantly better asymptotic performance.
4.2 Effect of Batch Size
Onpolicy gradient estimators are plagued by low sample efficiency, and we typically need many samples to construct highquality gradient estimator. We vary the batch size for each iteration and evaluate the final performance of gradient estimators. Here, we use GAE as the advantage estimator. We fix the number of iterations (gradient updates) for training to be ^{5}^{5}5This number of iteration is obtained via , where is the number of training steps and is the default batch size.. Under this setting, we expect the variance of the policy gradients to increase with decreasing and so does the final performance.
In Figure 5, we show the performance of policies trained via gradient estimators. The xaxis show the batch size while yaxis show the cumulative returns of the last 10 training iterations (with standard deviation across 5 seeds). We see that as expected the final performance generally increases as we have larger batch size . Across all presented tasks, the performance of the ARM gradient estimator significantly dominates that of the A2C gradient estimator. This is also compatible with results in Figures 3 and 4, where the ARM policy gradient estimator achieves convergence with significantly fewer number of training iterations.
4.3 Sensitivity to Hyperparameters
gradient estimators under various learning rate and random initializations: We show the quantile plots of
distinct hyperparameter configurations. ARM gradient estimator is generally more robust to A2C gradient estimator across presented tasks.In addition to batch size, we evaluate the policy gradient estimators’ sensitivity to hyperparameters such as learning rate and random initialization of parameters. In the following setting, we uniformly at random sample log learning rate and parameter initialization from settings ( random seeds). We train policies under each hyperparameter configuration for steps and record the performance of the last iterations. For each policy gradient estimator in , we sample distinct hyperparameter configurations and plot the quantile plots of their performance in Figure 6. In general, we see that ARM gradient estimator is more robust than A2C gradient estimator across all presented tasks.
5 Conclusion
We propose the ARM policy gradient estimator as a convenient lowvariance plugin alternative to prior baseline onpolicy gradient estimators, with a simple onpolicy optimization algorithm for tasks with binary action space. We leave the extension to more general discrete action space as exciting future work.
References

Abadi et al. (2016)
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
Tensorflow: a system for largescale machine learning.
In OSDI, volume 16, pp. 265–283, 2016.  Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
 Fujita & Maeda (2018) Yasuhiro Fujita and Shinichi Maeda. Clipped action policy gradient. arXiv preprint arXiv:1802.07564, 2018.
 Grathwohl et al. (2017) Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for blackbox gradient estimation. arXiv preprint arXiv:1711.00123, 2017.
 Gu et al. (2015) Shixiang Gu, Sergey Levine, Ilya Sutskever, and Andriy Mnih. Muprop: Unbiased backpropagation for stochastic neural networks. arXiv preprint arXiv:1511.05176, 2015.
 Gu et al. (2017) Shixiang Shane Gu, Timothy Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3846–3855, 2017.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kucukelbir et al. (2017) Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):145, 2017.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Liu et al. (2018) Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Actiondependent control variates for policy optimization via stein identity. 2018.
 Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
 Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
 Paisley et al. (2012) John Paisley, David Blei, and Michael Jordan. Variational bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430, 2012.
 Ranganath et al. (2014) Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Artificial Intelligence and Statistics, pp. 814–822, 2014.
 Schulman et al. (2015a) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015a.
 Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017a) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017a.
 Schulman et al. (2017b) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
 Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063, 2000.
 Tassa et al. (2018) Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
 Todorov (2008) Emanuel Todorov. General duality between optimal control and estimation. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pp. 4286–4292. IEEE, 2008.
 Tucker et al. (2017) George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha SohlDickstein. Rebar: Lowvariance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pp. 2627–2636, 2017.
 Tucker et al. (2018) George Tucker, Surya Bhupatiraju, Shixiang Gu, Richard E Turner, Zoubin Ghahramani, and Sergey Levine. The mirage of actiondependent baselines in reinforcement learning. arXiv preprint arXiv:1802.10031, 2018.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pp. 5–32. Springer, 1992.
 Wu et al. (2018) Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with actiondependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.
 Yin & Zhou (2019) Mingzhang Yin and Mingyuan Zhou. ARM: AugmentREINFORCEmerge gradient for stochastic binary networks. In International Conference on Learning Representations, 2019.
Appendix A Further Experiment Details
a.1 CartPole Environment Setup
The CartPole experiments are defined by an environment parameter which specifies that the agent can achieve a maximum rewards of (, balance the system for time steps before the episode terminates).
In our main experiments, we set for CartPolev0, for CartPolev1, for CartPolev2 and for CartPolev3. The difficulty increases with : with large , the agent is less likely to obtain many full trajectories within a single iteration, making it more difficult for return based estimation; long horizons also make it hard for policy optimization.
a.2 Advantage Estimator
We consider two popular advantage estimators widely in use (Mnih et al., 2016; Schulman et al., 2015b) for onpolicy optimization applications. The objective of both estimators are to approximate the advantage function under current policy .
A2C Advantage Estimator.
We construct the A2C estimator at time with the following
(19) 
where are MonteCarlo estimates of the partial sum of returns (along the sampled trajectories). The critic is trained by regression over the partial returns to approximate the value function .
Generalized Advantage Estimator (GAE).
GAE is indexed by two parameters and , where is the discount factor and is an additional trace parameter that determines biasvariance tradeoff of the final estimation. We define TDerrors
(20) 
where is a value function critic trained by regression over returns. GAE at time is computed as a weighted average of TDerrors across time
(21) 
Though the optimal parameter is problem dependent, a common practice is to set .
a.3 Baseline Implementations
We have compared the ARM policy gradient estimator with A2C policy gradient estimator and recently proposed RELAX gradient estimator. We implement three policy gradient estimators based on OpenAI baselines (Dhariwal et al., 2017). Though each policy gradient estimator requires more or less different implementation variations (, record the pseudo action and random noise for the ARM policy gradient), we have ensured that these three implementations share as much common structure as possible.
We note that though the RELAX code (Grathwohl et al., 2017) is made available, and their code is built on top of the OpenAI baselines. We did not directly run their code because of potential issues in their original implementation of RELAX. We implement our own version of RELAX and note some of the differences from their code.
Difference 1: Policy gradient computation.
In the following we point out potential issues with the implementation of (Grathwohl et al., 2017) and we refer to the the latest commit to the RELAX repository^{6}^{6}6https://github.com/wgrathwohl/BackpropThroughTheVoidRL (commit 0e6623d) as of this writing.
Recall that onpolicy optimization algorithms alternate between performing rollouts and performing updates based on the rollout samples. At rollout time, onpolicy samples are collected. Recall that the A2C policy gradient estimator takes the following form
(22) 
where are the advantage estimators from the onpolicy samples. Importantly in (22), the actions should be the onpolicy samples  the intuition is that onsample actions match their corresponding advantages , if for a certain action in state , the gradient update will increase the probability . In some cases, one implements the A2C gradient estimator by resampling actions at each during training, resulting in the following estimator
(23) 
We remark that this new gradient estimator with resampled actions is biased, i.e. . The bias comes from the fact that when , we assign the advantage to the wrong actions , leading to mismatched credit assignments. We find the latest version of RELAX code (Grathwohl et al., 2017) implements this biased estimator (23)  in fact they implement the biased estimator for both the A2C baseline and their proposed RELAX gradient estimator.
Specifically, in Tensorflow terminology (Abadi et al., 2016), the advantages , actions and states should be input into the loss and gradient computation via placeholders. However, in the RELAX implementation, they input advantages and states via placeholders, while inputting the actions via train_model.a0 where a0 stands for actions sampled from the policy network train_model. In practice, this will cause the policy model to resample an independent set of actions, leading to biased estimates. The resampling bias is severe when , especially when the policy is still random during the initial stage of training. Later in the training when the policy becomes more deterministic, the bias decreases since it is more likely that . In our implementation, we correct such potential bugs.
Difference 2: Average gradients over states not trajectories.
In the original development of RELAX, policy gradients are computed per trajectories and averaged across multiple trajectories. A common practice in onpolicy algorithm implementation (Dhariwal et al., 2017) is to average policy gradients across states. We follow this latter practice. As a result, we can collect a fixed number of steps per iteration instead of a fixed number of rollouts (which can result in varying number of steps) as in the original work (Grathwohl et al., 2017). We believe that such practice allows for fair comparison.
Appendix B Algorithm
In the pseudocode below, we omit the training of value function baseline . Following the common practice (Schulman et al., 2015a; Mnih et al., 2016; Schulman et al., 2017a), the value function baseline is trained by regression over Monte Carlo returns.
Appendix C Further Discussions on ARM Policy Gradient Estimator
For simplicity we fix a state and use a simplified notation: , , , and . Here the logit is parameterized as by parameter . Let be the noise used for generating actions such that if and the pseudo action if . Without loss of generality, assume .
The ARM policy gradient estimator can be simplified to be the following