Augment-Reinforce-Merge Policy Gradient for Binary Stochastic Policy

Due to the high variance of policy gradients, on-policy optimization algorithms are plagued with low sample efficiency. In this work, we propose Augment-Reinforce-Merge (ARM) policy gradient estimator as an unbiased low-variance alternative to previous baseline estimators on tasks with binary action space, inspired by the recent ARM gradient estimator for discrete random variable models. We show that the ARM policy gradient estimator achieves variance reduction with theoretical guarantees, and leads to significantly more stable and faster convergence of policies parameterized by neural networks.


page 13

page 15

page 28


Clipped Action Policy Gradient

Many continuous control tasks have bounded action spaces and clip out-of...

ARSM: Augment-REINFORCE-Swap-Merge Estimator for Gradient Backpropagation Through Categorical Variables

To address the challenge of backpropagating the gradient through categor...

Policy Gradients for Contextual Bandits

We study a generalized contextual-bandits problem, where there is a stat...

ARM: Augment-REINFORCE-Merge Gradient for Discrete Latent Variable Models

To backpropagate the gradients through discrete stochastic layers, we en...

Analyzing the Variance of Policy Gradient Estimators for the Linear-Quadratic Regulator

We study the variance of the REINFORCE policy gradient estimator in envi...

DisARM: An Antithetic Gradient Estimator for Binary Latent Variables

Training models with discrete latent variables is challenging due to the...

L_0-ARM: Network Sparsification via Stochastic Binary Optimization

We consider network sparsification as an L_0-norm regularized binary opt...

1 Introduction

There has been significant recent interest in deep reinforcement learning (DRL) that combines reinforcement learning (RL) with powerful function approximators such as neural networks, which leads to a wide variety of successful applications, ranging from board/video game playing to simulated/real life robotic control

(Silver et al., 2016; Mnih et al., 2013; Schulman et al., 2015a; Levine et al., 2016). One major area of DRL is on-policy optimization (Silver et al., 2016; Schulman et al., 2015a), which progressively improves upon the current policy iterate until a local optima is found.

As on-policy optimization flattens RL into a stochastic optimization problem, unbiased gradient estimation is carried out using REINFORCE or its more stable variants (Williams, 1992; Mnih et al., 2016). In general, however, on-policy gradient estimators suffer from high variance and need many more samples to construct high quality updates. Prior works have proposed variance reduction using variants of control variates (Gu et al., 2015, 2017; Grathwohl et al., 2017; Wu et al., 2018). However, recently Tucker et al. (2018)

cast doubts on some aforementioned variance reduction techniques by showing that their implementation deviates from the proposed methods in the paper, which we will detail in the related work below. In other cases, biased gradients are deliberately constructed to heuristically compute trust region policy updates

(Schulman et al., 2017b), which also achieve state-of-the-art performance on a wide range of tasks.

In this work, we consider an unbiased policy gradient estimator based on the Augment-Reinforce-Merge (ARM) gradient estimator for binary latent variable models (Yin & Zhou, 2019). We design a practical on-policy algorithm for RL tasks with binary action space, and show that the theoretical guarantee for variance reduction in Yin & Zhou (2019) can be straightforwardly applied to the policy gradient setting. The proposed ARM policy gradient estimator is a plug-in alternative to REINFORCE gradient estimator and its variants (Mnih et al., 2016), with minor algorithmic modifications.

The remainder of the paper is organized as follows. In Section 2, we introduce background and related work on RL and the ARM estimator for binary latent variable models. In Section 3, we describe the ARM policy gradient estimator, including the derivation, theoretical guarantees, and on-policy optimization algorithm. In Section 4, we show via thorough experiments that our proposed estimator consistently outperforms stable baselines, such as the A2C (Mnih et al., 2016) and recently proposed RELAX (Grathwohl et al., 2017) estimators.

2 Background

2.1 Reinforcement Learning

Consider a Markov decision process (MDP), where at time

the agent is in state , takes action , transitions to next state according to and receives instant reward . A policy is a mapping where is the space of distributions over the action space . The objective of RL is to search for a policy such that the expected discounted cumulative rewards is maximized


where is a discount factor and is the horizon. Let be the optimal policy. For convenience, we define under policy the value function as and action-value function as . We also define the advantage function as . By construction, we have , , the expected advantage under policy is zero.

2.2 On-Policy Optimization

One way to find is through direct policy search. Consider parameterizing the policy with parameter where is the space of parameters. If the policy class is expressive enough such that , we can recover with some parameter . In on-policy optimization, we start with a random policy and iteratively improve the policy through gradient updates for some learning rate . We can compute the gradient


It is worth noting that is almost but not exactly the REINFORCE gradient of (Williams, 1992). The approximation is due to the absence of discount factors at each term in the summation of (2). In recent practice, is used instead of the exact REINFORCE gradient since the factor aggressively weighs down terms with large (Schulman et al., 2015a, 2017a; Mnih et al., 2016), which leads to poor empirical performance. In our subsequent derivations, we treat as the standard gradient and we let

as an one-sample unbiased estimate of

such that . However, generally exhibits very high variance and does not entail stable updates. Actor-critic policy gradients (Mnih et al., 2016) subtract the original estimator by a state-dependent baseline function as a control variate for variance reduction. A near-optimal choice of is the value function , which yields the following one-sample unbiased actor-critic gradient estimator


where we still have . We also call (3) the A2C policy gradient estimator (Mnih et al., 2016). In practice, the action-value function is estimated via Monte Carlo samples and the value function is approximated by a parameterized critic with parameter .

2.3 Augment-Reinforce-Merge Gradient for Binary Random Variable Models

Discrete random variables are ubiquitous in probabilistic generative modeling. For the presentation of subsequent work, we limit our attention to the binary case. Let denote a binary random variable such that



is the logit of the Bernoulli probability parameter and

is the sigmoid function. For multi-dimensional distributions, we have a vector of

binary random variables such that each component

follows an independent Bernoulli distribution

, which is denoted as . In general, we consider an expected optimization objective of the vector


To optimize (5), we can construct a gradient estimator of for iterative updates. Due to the discrete nature of variables , the REINFORCE gradient estimator is the naive baseline but its variance can be too high to be of practical use. The ARM (Augment-Reinforce-Merge) gradient estimator (Yin & Zhou, 2019) provides the following alternative (See also Theorem 1in (Yin & Zhou, 2019))

Theorem 2.1 (ARM estimator for multivariate binary).

For a vector of binary random variables , the gradient of

with respect to , the logits of the Bernoulli distributions can be expressed as


where . Here is a -dimensional vector with the th component to be where is the indicator function.

The ARM gradient estimator was originally derived through a sequence of steps in Yin & Zhou (2019): first, an AR estimator is derived from augmenting the variable space (A) and applying REINFORCE (R); then a final merge step (M) is applied to several AR estimators for variance reduction. It was shown that through the merge step, the resulting ARM estimator is equivalent to the original AR estimator combined with an optimal control variate subject to certain constraints, which leads to substantial variance reduction with theoretical guarantees. We refer the readers to Yin & Zhou (2019) for more details.

2.4 Training Stochastic Binary Network

One important application of the ARM gradient estimator is training stochastic binary neural networks. Consider a binary latent variable model with stochastic hidden layers conditional on observations

, we construct their joint distribution as


where are binary random variables and the conditional distributions are Bernoulli distributions parameterized by . In general, we construct the following objective in the form of (5)


for some function . In the context of variational auto-encoder (VAE) (Kingma & Welling, 2013), is the evidence lower bound (ELBO) (Blei et al., 2017). We would like to optimize using gradients , and this is enabled by the following ARM back-propagation theorem. Proposition 6 in Yin & Zhou (2019) addresses the VAE model, but there is no loss of generality when considering a general function as in the following theorem.

Theorem 2.2 (ARM Backpropagation).

For a stochastic binary network with binary hidden layers, let , construct the conditional distributions as


for some function . Then the gradient of w.r.t. can be expressed as

where , which can be estimated via a single Monte Carlo sample.

2.5 Related Work

On-Policy Optimization.

On-policy optimization is driven by policy gradients with function approximation (Sutton et al., 2000). Due to the non-differentiable nature of RL, REINFORCE gradient estimator (Williams, 1992) is the default policy gradient estimator. In practice, REINFORCE gradient estimator has very high variance and the updates can become unstable. Recently, Mnih et al. (2016) propose advantage actor critic (A2C), which reduces the variance of the policy gradient estimator using a value function critic. Further, Schulman et al. (2015b) introduce generalized advantage estimation (GAE), a combination of multi-step return and value function critic to trade-off the bias and variance in the advantage function estimation, in order to compute lower-variance downstream policy gradients.

Variance Reduction for Stochastic gradient estimator.

Policy gradient estimators are special cases of the general stochastic gradient estimation of an objective function written as an expectation . To address the typical high-variance issues of REINFORCE gradient estimator, prior works have proposed to add control variates (or baseline functions) for variance reduction (Paisley et al., 2012; Ranganath et al., 2014; Gu et al., 2015; Kucukelbir et al., 2017). Re-parameterization trick greatly reduces the variance when variables are continuous and the underlying distribution is re-parametrizable (Kingma & Welling, 2013). When variables are discrete, Maddison et al. (2016); Jang et al. (2016) introduce a biased yet low-variance gradient estimator based on continuous relaxation of the discrete variables. More recently, REBAR (Tucker et al., 2017) and RELAX (Grathwohl et al., 2017) construct unbiased gradient estimators by using baseline functions derived from continuous relaxation of the discrete variables, whose parameters need to be estimated online. Yin & Zhou (2019) propose an unbiased estimate motivated as a self-control baseline, and display substantial gains over prior works when

are binary variables. In this work, we borrow ideas from

Yin & Zhou (2019) and extend the ARM gradient estimator for binary stochastic network into ARM policy gradient for RL.

Variance Reduction for Policy Gradients.

By default, the baseline function for REINFORCE on-policy gradient estimator is only state dependent, and the value function critic is typically applied. Gu et al. (2015) propose Taylor expansions of the value functions as the baseline and construct an unbiased gradient estimator. Grathwohl et al. (2017); Wu et al. (2018) propose carefully designed action-dependent baselines, which can construct unbiased gradient estimator while in theory achieving more substantial variance reduction. Despite their reported success, Tucker et al. (2018) observe that subtle implementation decisions cause their code to diverge from the unbiased methods presented in the paper: (1) Gu et al. (2015, 2017) achieve reported gains potentially by introducing bias into their advantage estimates. When such bias is removed, they do not outperform baseline methods; (2) Liu et al. (2018) achieve reported gains over state-dependent baselines potentially because the baselines are poorly trained. When properly trained, state-dependent baselines achieve similar results as the proposed action-dependent baselines; (3) Grathwohl et al. (2017) achieve gains potentially due to a bug that leads to different advantage estimators for their proposed RELAX estimator and the baseline. When this bug is fixed, they do not achieve significant gains. In this work, we propose ARM policy gradient estimator, a plug-in alternative which is unbiased and consistently outperforms A2C and RELAX for tasks with binary action space.

3 Augment-Reinforce-Merge Policy Gradient

Below we present the derivation of Augment-Reinforce-Merge (ARM) policy gradient. The high level idea is that we draw connections between RL and stochastic binary network - for a RL problem of horizon , we interpret the action sequence as the stochastic and derive the gradient estimator similarly as in Theorem 2.2. We consider RL problems with binary action space .

3.1 Time-Dependent Policy

To make full analogy to stochastic binary networks with layers, we consider a RL problem with horizon . We make the policy time-dependent by specifying a different policy for time with parameter . We define a similar RL objective as (1)


where we jointly optimize over all . To make the connection between (10) and (8) explicit, we observe the following: we can interpret (10) as a special form of (8) by setting the binary hidden variables as the actions and the observation as the initial state . The conditional distribution can be defined as , which consists of the policy and the transition dynamics . Finally we define the objective function , which depends only directly on (after marginalizing out the states ).

Since the action space is binary, we introduce a policy parameterization similar to stochastic binary network, , . For any given , consider the ARM estimator of (10) w.r.t. according to Theorem 2.2. When we sample , we can sample a uniform random variable then set . We define the pseudo action as . By converting all variables in the binary stochastic network example into their RL counterparts as described above, we can derive the gradient of (10) w.r.t. for any given


where is defined as , , the expected cumulative rewards obtained by first executing at and following the time-dependent policy thereafter. Notice that this is not exactly the same as the action-value function we defined in Section 2.1 since the policy is time-dependent.

3.2 Stationary Policy

In practice we are interested in a stationary policy, which is invariant over time . Since we have derived the gradient estimator for a time-dependent policy , the most naive approach would be to share weights by letting and linearly combining the per-step gradients (11) across time steps. Since now the policy is stationary, we define as the stationary version of (11) where is replaced by


The combined gradient is


We denote and the unbiased sample estimates of and respectively. Importantly, we can show that is unbiased, , where is the standard gradient defined in Section 2.2. We summarize the result in the following theorem.

Theorem 3.1 (Unbiased ARM policy gradient).

When the ARM policy gradient is constructed as in (13), it is unbiased w.r.t. the true gradient of the RL objective


Recall the standard gradient in (2), we define . Now we show that by combining ARM gradients across all time steps (which correspond to all layers of a stochastic binary network), we can compute an unbiased policy gradient. Recall the gradient at is computed based on (12), we explicitly compute the gradient in the following. For simplicity, we remove all dependencies on and denote , . Also the advantage function . Assume also without loss of generality,


We also explicitly write down the standard gradient at time step with the same notation

We see that conditional on the same state , the ARM policy gradient sample estimate at time step has its expectation equal to the standard gradient at time . To complete the proof, we marginalize out via the visitation probability under current policy and sum over time steps: . ∎

3.3 Variance Reduction for ARM Policy Gradient

Here we compare the variance of the ARM policy gradient estimator against the REINFORCE gradient estimator . Analyzing the variance of the full gradient is very challenging, since we need to account for covariance between gradient components across time steps. We settle for analyzing the variance of each time component and for any . Similar analysis has been applied in Fujita & Maeda (2018).

For simple analysis, we assume that all action-value functions (or advantage functions) can be obtained exactly, , we do not consider additional bias and variance introduced by advantage function estimations. Conditional on a given , we can show with results from Yin & Zhou (2019) that


Since we have established , we can show via the variance decomposition formula that for


We cannot obtain a tighter bound without making further assumptions on the relative scale of variance and the expectations . Still, we can show that the ARM policy gradient estimator can achieve potentially much smaller variance than the standard gradient estimator, which entails faster policy optimization in practice.

We provide an intuition for why ARM policy gradient estimator can achieve substantial variance reduction and stabilize training in practice. In Figure 1, we show the running average of , , the difference between on-policy actions and pseudo actions as training progresses. We show two settings of advantage estimations (left: A2C, right: GAE) which will be detailed in Section 4. We see that as training progresses, both actions are increasingly less likely to be different, often yielding . In such cases the ARM policy gradient estimator achieves exact zero value . On the contrary, prior baseline gradient estimators (2,3) cannot take up exact zero values and will cause parameters to stumble around due to noisy estimates. We provide more detailed discussions in the Appendix.

(a) A2C
(b) GAE
Figure 1: Illustration of the frequency of different from as training progresses. The x-axis show the time steps in training, y-axis show the running average of , a measure of how often they differ. Each curve shows the result for a different environment and we show the settings both for A2C (left) and GAE (right). As training progresses, on-policy actions are increasingly less likely to differ from the pseudo action .

3.4 Algorithm

Here we present a practical algorithm that applies the ARM policy gradient estimator. The on-policy optimization procedure should alternate between collecting on-policy samples using current policy and computing gradient estimator using these on-policy samples for parameter updates (Schulman et al., 2015a, 2017a).

We see that a primary difficulty with computing (12) is that it requires the difference of two action-value functions . Unless the difference is estimated by additional parameterized critics, in typical on-policy algorithm implementation, we only have access to the Monte Carlo estimators of action-value functions corresponding to on-policy actions. To overcome this, we estimate the difference by using the property . To be specific, when , the advantage function of the pseudo action can be expressed as


Since the difference of action-value functions is identical to the difference of advantage functions, we have in general


We can hence estimate the difference in (12) with only on-policy advantage estimates , along with sampled pseudo action and on-policy action . Notice when , the difference , therefore the gradient is zero with high frequency when the algorithm approaches convergence, which leads to faster convergence speed and better stability.

We design the on-policy optimization procedure as follows. At any iteration, we have policy with parameter . We generate on-policy rollout at time by sampling a uniform random variable , then construct on-policy action and pseudo action . Only the on-policy actions are executed in the environment, which returns instant rewards . We estimate the advantage functions from on-policy samples using techniques from A2C (Mnih et al., 2016) or GAE (Schulman et al., 2015b), and replace in (12) by estimates . The details of the advantage estimators are provided in the Appendix. Finally, the difference of the action-value functions can be computed using (18) based purely on on-policy samples. With all the above components, we compute the ARM policy gradient (12) to update the policy parameters. The main algorithm is summarized in Algorithm 1 in the Appendix.

ARM policy gradient as a plug-in alternative.

We note here that despite minor differences in the algorithm (, need to sample pseudo actions along with ), ARM policy gradient estimator is a convenient plug-in alternative to other baseline on-policy gradient estimators (2,3). All the other components of standard on-policy optimization algorithms (, value function baselines) remain the same.

4 Experiments

We aim to address the following questions through the experiments: (1) Does the proposed ARM policy gradient estimator outperform previous policy gradient estimators on binary action benchmark tasks? (2)

How sensitive are these policy gradient estimators to hyperparameters,

, the size of the sample batch size?

To address (1), we compare the ARM policy gradient estimator with A2C gradient estimator (Mnih et al., 2016) and RELAX gradient estimator (Grathwohl et al., 2017). We aim to study how advantage estimators affect the quality of the downstream gradient estimators: with A2C advantage estimation, we compare gradient estimators; with GAE, we compare gradient estimators 111Here for A2C gradient estimator, we just replace the original A2C advantage estimators by GAE estimators and keep all other algorithmic components the same.. We evaluate on-policy optimization with various gradient estimators on benchmark tasks with binary action space: for each policy gradient estimator, we train the policy for a fixed number of time steps (

) and across 5 random seeds. Each curve in the plots below shows the mean performance with shaded area as the standard deviation performance across seeds. In Figures

3 and 4, x-axis show the number of time steps at training time and y-axis show the performance. The results are reported in Section 4.1. To address (2), we vary the batch size to assess the effects of the batch size on the variance of the policy gradient estimators. We also evaluate the estimators’ sensitivities to learning rates. The results are reported in Sections 4.2 and 4.3.

Benchmark Environments.

We focus on benchmark environments with binary action space illustrated in Figure 2: All tasks are simulated by OpenAI gym and DeepMind control suite (Todorov, 2008; Brockman et al., 2016; Tassa et al., 2018). Some tasks have binary action space by default. In cases where the action space is a real interval, ,

for Pendulum, we binarize the action space to be


(a) CartPole
(b) MountainCar
(c) Inverted Pendulum
(d) Acrobot Swingup
(e) Pendulum Swingup
Figure 2: Illustration of Benchmark tasks. Benchmark tasks (a)-(c) are from OpenaAI gym (Brockman et al., 2016) and (d)-(e) are from DeepMind Control Suite (Tassa et al., 2018). These tasks have binary action space by design ((a)) or the action space is by design continuous and binarized for our setting ((b)-(e)).

Implementations and Hyper-parameters.

All implementations are in Tensorflow

(Abadi et al., 2016) and RL algorithms are based on OpenAI baselines (Dhariwal et al., 2017) 222 All policies are optimized with Adam using best learning rates selected from . We refer to the original code of RELAX333 but notice potential issues in their original implementation. We discuss such potential issues in the Appendix. We implement our own version of RELAX (Grathwohl et al., 2017) by modifying the OpenAI baselines (Dhariwal et al., 2017). Recall that on-policy optimization algorithms alternate between collecting batch samples of size and then compute gradient estimators based on these samples; here we set by default.

Policy Architectures.

We parameterize the logit function as a two-layer neural network with state as input and hidden units per layer. Each layer applies ReLU

non-linearity as the activation function. The output is a logit scalar

where consists of weight matrices and biases in the neural network. For variance reduction we also parameterize a value function baseline with two hidden layers each with units, with relu non-linear activation. Both the logit function and the value function have linear activation for the output layer. For RELAX gradient estimator, we use two parameterized baseline functions with two layers, each with hidden units.

4.1 Benchmark Comparison

Here we compare the ARM policy gradient estimator with two baseline methods: A2C gradient estimator (Mnih et al., 2016) and the unbiased RELAX gradient estimator (Grathwohl et al., 2017). We note some critical implementation details: All gradient estimators require advantage estimation , we do not normalize the advantage estimates before computing the policy gradients as commonly implemented in Dhariwal et al. (2017). As observed in Tucker et al. (2018), such normalization biases the original gradients for variance reduction, especially for action dependent baselines such as RELAX (Grathwohl et al., 2017). Since our focus is on unbiased gradient estimators, we remove such normalization (Dhariwal et al., 2017) 444Since GAE applies a combination of Monte Carlo returns and value function critics, the advantage estimator is still slightly biased to achieve small variance..

A2C Advantage Estimator.

A2C Advantage estimators are simple combinations of Monte Carlo sampled returns and baseline functions. For the baseline is the value function critic, while for RELAX the baseline is parameterized and trained to minimize the square norm of the gradients computed on mini-batches. Let be the RELAX gradient estimator and recall as the A2C gradient estimator. Define a generalized RELAX gradient as for . When we recover an A2C gradient estimator (which is still different from our original A2C gradient estimator since the baseline functions are parameterized and trained differently). When we recover the original RELAX estimator. In practice we find that tends to significantly outperform the pure RELAX estimator.

(a) MountainCar
(b) Inverted Pendulum
(c) Acrobot Swingup
(d) Pendulum Swingup
(e) CartPole v2
(f) CartPole v3
Figure 3: Comparison of gradient estimators for on-policy optimization, with A2C advantage estimation: Here RELAX gradient estimator is the generalized version with a combination coefficient such that recovers the original RELAX estimator while recovers the A2C estimator. We observe that ARM gradient estimator consistently outperforms all gradient estimator baselines, both in terms of convergence rate and asymptotic performance.

In Figure 3, we show the performance of estimators along with RELAX with varying . We see that the ARM policy gradient estimator outperforms the other baseline estimators on most tasks: except on Inverted Pendulum, where all estimators tend to learn slowly, while RELAX with perform the best in terms of mean performance, they do not significantly outperform others when accounting for the standard deviation. For other benchmark tasks, the ARM policy gradient estimator enables significantly faster policy optimization, while other baselines either become very unstable or learn very slowly.

Generalized Advantage Estimator (GAE).

GAE constructs advantage estimates using a more complex combination of Monte Carlo samples and baseline function, to better trade-off bias and variance. By construction, here the baseline function must be value function critic, hence we only evaluate .

In Figure 4, we show the comparison. We make several observations: (1) Comparing A2C with GAE, as shown in Figure 4, to A2C with A2C advantage estimator, as shown in Figure 3, we see that in most cases A2C with GAE speeds up the policy optimization significantly. This demonstrates the importance of stable advantage estimation for policy gradient estimator. A notable exception is MountainCar, where A2C with A2C advantage estimator performs better. (2) Comparing ARM with A2C in Figure 4, we still see that the ARM estimator significantly outperforms A2C estimator. For almost all presented tasks, the ARM policy gradient estimator allows for much faster learning and significantly better asymptotic performance.

(a) MountainCar
(b) Inverted Pendulum
(c) Acrobot Swingup
(d) Pendulum Swingup
(e) CartPole v2
(f) CartPole v3
Figure 4: Comparison of gradient estimators for on-policy optimization, with GAE: We observe that ARM gradient estimator still consistently outperforms the A2C gradient estimator. For hard exploration environment MountainCar, ARM leads to fast convergence while A2C does not converge.

4.2 Effect of Batch Size

On-policy gradient estimators are plagued by low sample efficiency, and we typically need many samples to construct high-quality gradient estimator. We vary the batch size for each iteration and evaluate the final performance of gradient estimators. Here, we use GAE as the advantage estimator. We fix the number of iterations (gradient updates) for training to be 555This number of iteration is obtained via , where is the number of training steps and is the default batch size.. Under this setting, we expect the variance of the policy gradients to increase with decreasing and so does the final performance.

(a) CartPole-v0
(b) CartPole-v2
(c) Pendulum Swingup
(d) Acrobot Swingup
Figure 5: Comparison of gradient estimators under various batch size : the number of iterations (gradient updates) are fixed. The final performance of the trained policy increases as the batch size decreases. We observe that across all presented tasks, the performance of ARM dominates that of A2C under all batch sizes. The x-coordinates of the ARM curve are slightly misaligned to separate from the A2C curve.

In Figure 5, we show the performance of policies trained via gradient estimators. The x-axis show the batch size while y-axis show the cumulative returns of the last 10 training iterations (with standard deviation across 5 seeds). We see that as expected the final performance generally increases as we have larger batch size . Across all presented tasks, the performance of the ARM gradient estimator significantly dominates that of the A2C gradient estimator. This is also compatible with results in Figures 3 and 4, where the ARM policy gradient estimator achieves convergence with significantly fewer number of training iterations.

4.3 Sensitivity to Hyper-parameters

(a) CartPole v0
(b) CartPole v3
(c) MountainCar
(d) Inverted Pendulum
(e) Acrobot Swingup
(f) Pendulum Swingup
Figure 6: Comparison of

gradient estimators under various learning rate and random initializations: We show the quantile plots of

distinct hyper-parameter configurations. ARM gradient estimator is generally more robust to A2C gradient estimator across presented tasks.

In addition to batch size, we evaluate the policy gradient estimators’ sensitivity to hyper-parameters such as learning rate and random initialization of parameters. In the following setting, we uniformly at random sample log learning rate and parameter initialization from settings ( random seeds). We train policies under each hyper-parameter configuration for steps and record the performance of the last iterations. For each policy gradient estimator in , we sample distinct hyper-parameter configurations and plot the quantile plots of their performance in Figure 6. In general, we see that ARM gradient estimator is more robust than A2C gradient estimator across all presented tasks.

5 Conclusion

We propose the ARM policy gradient estimator as a convenient low-variance plug-in alternative to prior baseline on-policy gradient estimators, with a simple on-policy optimization algorithm for tasks with binary action space. We leave the extension to more general discrete action space as exciting future work.


Appendix A Further Experiment Details

a.1 CartPole Environment Setup

The CartPole experiments are defined by an environment parameter which specifies that the agent can achieve a maximum rewards of (, balance the system for time steps before the episode terminates).

In our main experiments, we set for CartPole-v0, for CartPole-v1, for CartPole-v2 and for CartPole-v3. The difficulty increases with : with large , the agent is less likely to obtain many full trajectories within a single iteration, making it more difficult for return based estimation; long horizons also make it hard for policy optimization.

a.2 Advantage Estimator

We consider two popular advantage estimators widely in use (Mnih et al., 2016; Schulman et al., 2015b) for on-policy optimization applications. The objective of both estimators are to approximate the advantage function under current policy .

A2C Advantage Estimator.

We construct the A2C estimator at time with the following


where are Monte-Carlo estimates of the partial sum of returns (along the sampled trajectories). The critic is trained by regression over the partial returns to approximate the value function .

Generalized Advantage Estimator (GAE).

GAE is indexed by two parameters and , where is the discount factor and is an additional trace parameter that determines bias-variance trade-off of the final estimation. We define TD-errors


where is a value function critic trained by regression over returns. GAE at time is computed as a weighted average of TD-errors across time


Though the optimal parameter is problem dependent, a common practice is to set .

a.3 Baseline Implementations

We have compared the ARM policy gradient estimator with A2C policy gradient estimator and recently proposed RELAX gradient estimator. We implement three policy gradient estimators based on OpenAI baselines (Dhariwal et al., 2017). Though each policy gradient estimator requires more or less different implementation variations (, record the pseudo action and random noise for the ARM policy gradient), we have ensured that these three implementations share as much common structure as possible.

We note that though the RELAX code (Grathwohl et al., 2017) is made available, and their code is built on top of the OpenAI baselines. We did not directly run their code because of potential issues in their original implementation of RELAX. We implement our own version of RELAX and note some of the differences from their code.

Difference 1: Policy gradient computation.

In the following we point out potential issues with the implementation of (Grathwohl et al., 2017) and we refer to the the latest commit to the RELAX repository666 (commit 0e6623d) as of this writing.

Recall that on-policy optimization algorithms alternate between performing rollouts and performing updates based on the rollout samples. At rollout time, on-policy samples are collected. Recall that the A2C policy gradient estimator takes the following form


where are the advantage estimators from the on-policy samples. Importantly in (22), the actions should be the on-policy samples - the intuition is that on-sample actions match their corresponding advantages , if for a certain action in state , the gradient update will increase the probability . In some cases, one implements the A2C gradient estimator by re-sampling actions at each during training, resulting in the following estimator


We remark that this new gradient estimator with re-sampled actions is biased, i.e. . The bias comes from the fact that when , we assign the advantage to the wrong actions , leading to mismatched credit assignments. We find the latest version of RELAX code (Grathwohl et al., 2017) implements this biased estimator (23) - in fact they implement the biased estimator for both the A2C baseline and their proposed RELAX gradient estimator.

Specifically, in Tensorflow terminology (Abadi et al., 2016), the advantages , actions and states should be input into the loss and gradient computation via placeholders. However, in the RELAX implementation, they input advantages and states via placeholders, while inputting the actions via train_model.a0 where a0 stands for actions sampled from the policy network train_model. In practice, this will cause the policy model to re-sample an independent set of actions, leading to biased estimates. The re-sampling bias is severe when , especially when the policy is still random during the initial stage of training. Later in the training when the policy becomes more deterministic, the bias decreases since it is more likely that . In our implementation, we correct such potential bugs.

Difference 2: Average gradients over states not trajectories.

In the original development of RELAX, policy gradients are computed per trajectories and averaged across multiple trajectories. A common practice in on-policy algorithm implementation (Dhariwal et al., 2017) is to average policy gradients across states. We follow this latter practice. As a result, we can collect a fixed number of steps per iteration instead of a fixed number of rollouts (which can result in varying number of steps) as in the original work (Grathwohl et al., 2017). We believe that such practice allows for fair comparison.

Appendix B Algorithm

In the pseudocode below, we omit the training of value function baseline . Following the common practice (Schulman et al., 2015a; Mnih et al., 2016; Schulman et al., 2017a), the value function baseline is trained by regression over Monte Carlo returns.

1:  Input: total number of time steps , training batch size , learning rate for policy , discount factor
2:  Initialize: Policy network with parameters .
3:  for  do
4:     // Rollout
5:     1. In state , compute logit . Sample . If , let the action and otherwise ; if , let the shadow action and otherwise . Receive instant reward . Label .
6:     if  then
7:        // Training policy
8:        2. Estimate advantage function for each step from the collected sample tuples labeled as .
9:        3. Construct differences of advantage functions at each step based on (18)

        4. Construct the surrogate loss function

11:        5. Update .
12:     end if
13:  end for
Algorithm 1 On-Policy Optimization with ARM Policy Gradient

Appendix C Further Discussions on ARM Policy Gradient Estimator

For simplicity we fix a state and use a simplified notation: , , , and . Here the logit is parameterized as by parameter . Let be the noise used for generating actions such that if and the pseudo action if . Without loss of generality, assume .

The ARM policy gradient estimator can be simplified to be the following