1 Introduction
Modelfree reinforcement learning (RL) with flexible function approximators, such as neural networks (i.e., deep reinforcement learning), has shown success in goaldirected sequential decisionmaking problems in high dimensional state spaces
(Mnih et al., 2015; Schulman et al., 2015b; Lillicrap et al., 2015; Silver et al., 2016). Policy gradient methods (Williams, 1992; Sutton et al., 2000; Kakade, 2002; Peters & Schaal, 2006; Silver et al., 2014; Schulman et al., 2015a, 2017) are a class of modelfree RL algorithms that have found widespread adoption due to their stability and ease of use. Because these methods directly estimate the gradient of the expected reward RL objective, they exhibit stable convergence both in theory and practice (Sutton et al., 2000; Kakade, 2002; Schulman et al., 2015a; Gu et al., 2017b). In contrast, methods such as Qlearning lack convergence guarantees in the case of nonlinear function approximation (Sutton & Barto, 1998).Onpolicy MonteCarlo policy gradient estimates suffer from high variance, and therefore require large batch sizes to reliably estimate the gradient for stable iterative optimization (Schulman et al., 2015a). This limits their applicability to realworld problems, where sample efficiency is a critical constraint. Actorcritic methods (Sutton et al., 2000; Silver et al., 2014) and weighted return estimation (Tesauro, 1995; Schulman et al., 2015b)
replace the high variance MonteCarlo return with an estimate based on the sampled return and a function approximator. This reduces variance at the expense of introducing bias from the function approximator, which can lead to instability and sensitivity to hyperparameters. In contrast, statedependent baselines
(Williams, 1992; Weaver & Tao, 2001) reduce variance without introducing bias. This is desirable because it does not compromise the stability of the original method.Gu et al. (2017a); Grathwohl et al. (2018); Liu et al. (2018); Wu et al. (2018) present promising results extending the classic statedependent baselines to stateactiondependent baselines. The standard explanation for the benefits of such approaches is that they achieve large reductions in variance (Grathwohl et al., 2018; Liu et al., 2018), which translates to improvements over methods that only condition the baseline on the state. This line of investigation is attractive, because by definition, baselines do not introduce bias and thus do not compromise the stability of the underlying policy gradient algorithm, but still provide improved sample efficiency. In other words, they retain the advantages of the underlying algorithms with no unintended sideeffects.
In this paper, we aim to improve our understanding of stateactiondependent baselines and to identify targets for further unbiased variance reduction. Toward this goal, we present a decomposition of the variance of the policy gradient estimator which isolates the potential variance reduction due to stateactiondependent baselines. We numerically evaluate the variance components on a synthetic linearquadraticGaussian (LQG) task, where the variances are nearly analytically tractable, and on benchmark continuous control tasks and draw two conclusions: (1) on these tasks, a learned stateactiondependent baseline does not significantly reduce variance over a learned statedependent baseline, and (2) the variance caused by using a function approximator for the value function or statedependent baseline is much larger than the variance reduction from adding action dependence to the baseline.
To resolve the apparent contradiction arising from (1), we carefully reviewed the opensource implementations^{1}^{1}1At the time of submission, code for (Wu et al., 2018) was not available. accompanying Qprop (Gu et al., 2017a), Stein control variates (Liu et al., 2018), and LAX (Grathwohl et al., 2018) and show that subtle implementation decisions cause the code to diverge from the unbiased methods presented in the papers. We explain and empirically evaluate variants of these prior methods to demonstrate that these subtle implementation details, which trade variance for bias, are in fact crucial for their empirical success. These results motivate further study of these design decisions.
The second observation (2), that function approximators poorly estimate the value function, suggests that there is room for improvement. Although many common benchmark tasks are finite horizon problems, most value function parameterizations ignore this fact. We propose a horizonaware value function parameterization, and this improves performance compared with the stateactiondependent baseline without biasing the underlying method.
We emphasize that without the opensource code accompanying (Gu et al., 2017a; Liu et al., 2018; Grathwohl et al., 2018), this work would not be possible. Releasing the code has allowed us to present a new view on their work and to identify interesting implementation decisions for further study that the original authors may not have been aware of.
We have made our code and additional visualizations available at https://sites.google.com/view/miragerl.
2 Background
Reinforcement learning aims to learn a policy for an agent to maximize a sum of reward signals (Sutton & Barto, 1998). The agent starts at an initial state . Then, the agent repeatedly samples an action from a policy with parameters , receives a reward , and transitions to a subsequent state according to the Markovian dynamics of the environment. This generates a trajectory of states, actions, and rewards . We abbreviate the trajectory after the initial state and action by .
The goal is to maximize the discounted sum of rewards along sampled trajectories
where is a discount parameter and is the unnormalized discounted state visitation frequency.
Policy gradient methods differentiate the expected return objective with respect to the policy parameters and apply gradientbased optimization (Sutton & Barto, 1998). The policy gradient can be written as an expectation amenable to Monte Carlo estimation
where is the stateaction value function, is the value function, and is the advantage function. The equality in the last line follows from the fact that (Williams, 1992).
In practice, most policy gradient methods (including this paper) use the undiscounted state visitation frequencies (i.e., for ), which produces a biased estimator for and more closely aligns with maximizing average reward (Thomas, 2014).
We can estimate the gradient with a MonteCarlo estimator
(1) 
where is an estimator of the advantage function up to a statedependent constant (e.g., ).
2.1 Advantage Function Estimation
Given a value function estimator, , we can form a step advantage function estimator,
where and . produces an unbiased gradient estimator when used in Eq. 1 regardless of the choice of . However, the other estimators () produce biased estimates unless . Advantage actor critic (A2C and A3C) methods (Mnih et al., 2016) and generalized advantage estimators (GAE) (Schulman et al., 2015b) use a single or linear combination of estimators as the advantage estimator in Eq. 1. In practice, the value function estimator is never perfect, so these methods produce biased gradient estimates. As a result, the hyperparameters that control the combination of must be carefully tuned to balance bias and variance (Schulman et al., 2015b), demonstrating the perils and sensitivity of biased gradient estimators. For the experiments in this paper, unless stated otherwise, we use the GAE estimator. Our focus will be on the additional bias introduced beyond that of GAE.
2.2 Baselines for Variance Reduction
The policy gradient estimator in Eq. 1 typically suffers from high variance. Control variates are a wellstudied technique for reducing variance in Monte Carlo estimators without biasing the estimator (Owen, 2013). They require a correlated function whose expectation we can analytically evaluate or estimate with low variance. Because , any function of the form can serve as a control variate, where is commonly referred to as a baseline (Williams, 1992). With a baseline, the policy gradient estimator becomes
which does not introduce bias. Several recent methods (Gu et al., 2017a; Thomas & Brunskill, 2017; Grathwohl et al., 2018; Liu et al., 2018; Wu et al., 2018) have extended the approach to stateactiondependent baselines (i.e., is a function of the state and the action). With a stateaction dependent baseline , the policy gradient estimator is
(2) 
Now, in general, so it must be analytically evaluated or estimated with low variance for the baseline to be effective.
When the action set is discrete and not large, it is straightforward to analytically evaluate the expectation in the second term (Gu et al., 2017b; Gruslys et al., 2017). In the continuous action case, Gu et al. (2017a) set to be the first order Taylor expansion of a learned advantage function approximator. Because is linear in , the expectation can be analytically computed. Gu et al. (2017b); Liu et al. (2018); Grathwohl et al. (2018) set to be a learned function approximator and leverage the reparameterization trick to estimate with low variance when is reparameterizable (Kingma & Welling, 2013; Rezende et al., 2014).
3 Policy Gradient Variance Decomposition
Now, we analyze the variance of the policy gradient estimator with a stateaction dependent baseline (Eq. 2
). This is an unbiased estimator of
for any choice of . For theoretical analysis, we assume that we can analytically evaluate the expectation over in the second term because it only depends on and , which we can evaluate multiple times without querying the environment.The variance of the policy gradient estimator in Eq. 2, , can be decomposed using the law of total variance as
where the simplification of the second term is because the baseline does not introduce bias. We can further decompose the first term,
where . Putting the terms together, we arrive at the following:
(3) 
Notably, only involves , and it is clear that the variance minimizing choice of is . For example, if , the discounted return, then the optimal choice of is , the stateaction value function.
The variance in the onpolicy gradient estimate arises from the fact that we only collect data from a limited number of states , that we only take a single action in each state, and that we only rollout a single path from there on . Intuitively, describes the variance due to sampling a single , describes the variance due to sampling a single , and lastly describes the variance coming from visiting a limited number of states. The magnitudes of these terms depends on task specific parameters and the policy.
The relative magnitudes of the variance terms will determine the effectiveness of the optimal stateactiondependent baseline. In particular, denoting the value of the second term when using a statedependent baseline by , the variance of the policy gradient estimator with a statedependent baseline is . When is optimal, vanishes, so the variance is . Thus, an optimal stateactiondependent baseline will be beneficial when is large relative to . We expect this to be the case when single actions have a large effect on the overall discounted return (e.g., in a Cliffworld domain, where a single action could cause the agent to fall of the cliff and suffer a large negative reward). Practical implementations of a stateactiondependent baseline require learning , which will further restrict the potential benefits.
3.1 Variance in LQG Systems
LinearquadraticGaussian (LQG) systems (Stengel, 1986) are a family of widely studied continuous control problems with closedform expressions for optimal controls, quadratic value functions, and Gaussian state marginals. We first analyze the variance decomposition in an LQG system because it allows nearly analytic measurement of the variance terms in Eq. 3 (See Appendix 9 for measurement details).
Figure 1 plots the variance terms for a simple 2D point mass task using discounted returns as the choice of (See Appendix 9 for task details). As expected, without a baseline (), the variance of is much larger than and . Further, using the value function as a statedependent baseline (), results in a large variance reduction (compare the lines for and in Figure 1). An optimal stateactiondependent baseline would reduce to , however, for this task, such a baseline would not significantly reduce the total variance because is already large relative to (Figure 1).
We also plot the effect of using GAE^{2}^{2}2For the LQG system, we use the oracle value function to compute the GAE estimator. In the rest of the experiments, GAE is computed using a learned value function. (Schulman et al., 2015b) on for . Baselines and GAE reduce different components of the gradient variances, and this figure compares their effects throughout the learning process.
3.2 Empirical Variance Measurements
We estimate the magnitude of the three terms for benchmark continuous action tasks as training proceeds. Once we decide on the form of , approximating is a learning problem in itself. To understand the approximation error, we evaluate the situation where we have access to an oracle and when we learn a function approximator for . Estimating the terms in Eq. 3 is nontrivial because the expectations and variances are not available in closed form. We construct unbiased estimators of the variance terms and repeatedly draw samples to drive down the measurement error (see Appendix 10 for details). We train a policy using TRPO^{3}^{3}3The relative magnitudes of the variance terms depend on the task, policy, and network structures. For evaluation, we use a welltuned implementation of TRPO (Appendix 8.4). (Schulman et al., 2015a) and as training proceeds, we plot each of the individual terms , and of the gradient estimator variance for Humanoid in Figure 2 and for HalfCheetah in Appendix Figure 9. Additionally, we repeat the experiment with the horizonaware value functions (described in Section 5) in Appendix Figures 10 and 11.
We plot the variance decomposition for two choices of : the discounted return, , and GAE (Schulman et al., 2015b). In both cases, we set and (the optimal stateactiondependent baseline). When using the discounted return, we found that dominates , suggesting that even an optimal stateactiondependent baseline (which would reduce to 0) would not improve over a statedependent baseline (Figure 2). In contrast, with GAE, is reduced and now the optimal stateactiondependent baseline would reduce the overall variance compared to a statedependent baseline. However, when we used function approximators to , we found that the statedependent and stateactiondependent function approximators produced similar variance and much higher variance than when using an oracle (Figure 2). This suggests that, in practice, we would not see improved learning performance using a stateactiondependent baseline over a statedependent baseline on these tasks. We confirm this in later experiments in Sections 4 and 5.
Furthermore, we see that closing the function approximation gap of and would produce much larger reductions in variance than from using the optimal stateactiondependent baseline over the statedependent baseline. This suggests that improved function approximation of both and should be a priority. Finally, is relatively small in both cases, suggesting that focusing on reducing variance from the first two terms of Eq. 3, and , will be more fruitful.
4 Unveiling the Mirage
In the previous section, we decomposed the policy gradient variance into several sources, and we found that in practice, the source of variance reduced by the stateactiondependent baseline is not reduced when a function approximator for is used. However, this appears to be a paradox: if the stateactiondependent baseline does not reduce variance, how are prior methods that propose stateactiondependent baselines able to report significant improvements in learning performance? We analyze implementations accompanying these works, and show that they actually introduce bias into the policy gradient due to subtle implementation decisions^{4}^{4}4The implementation of the stateactiondependent baselines for continuous control in (Grathwohl et al., 2018) suffered from two critical issues (see Appendix 8.3 for details), so it was challenging to determine the source of their observed performance. After correcting these issues in their implementation, we do not observe an improvement over a statedependent baseline, as shown in Appendix Figure 13. We emphasize that these observations are restricted to the continuous control experiments as the rest of the experiments in that paper use a separate codebase that is unaffected.. We find that these methods are effective not because of unbiased variance reduction, but instead because they introduce bias for variance reduction.
4.1 Advantage Normalization
Although QProp and IPG (Gu et al., 2017b) (when ) claim to be unbiased, the implementations of QProp and IPG apply an adaptive normalization to only some of the estimator terms, which introduces a bias. Practical implementations of policy gradient methods (Mnih & Gregor, 2014; Schulman et al., 2015b; Duan et al., 2016) often normalize the advantage estimate , also commonly referred to as the learning signal
, to unit variance with batch statistics. This effectively serves as an adaptive learning rate heuristic that bounds the gradient variance.
The implementations of QProp and IPG normalize the learning signal , but not the bias correction term . Explicitly, the estimator with such a normalization is,
where and are batchbased estimates of the mean and standard deviation of . This deviates from the method presented in the paper and introduces bias. In fact, IPG (Gu et al., 2017b) analyzes the bias in the implied objective that would be introduced when the first term has a different weight from the bias correction term, proposing such a weight as a means to trade off bias and variance. We analyze the bias and variance of the gradient estimator in Appendix 11. However, the weight actually used in the implementation is off by the factor , and never one (which corresponds to the unbiased case). This introduces an adaptive biasvariance tradeoff that constrains the learning signal variance to 1 by automatically adding bias if necessary.
In Figure 3, we compare the implementation of QProp from (Gu et al., 2017a), an unbiased implementation of QProp that applies the normalization to all terms, and TRPO. We found that the adaptive biasvariance tradeoff induced by the asymmetric normalization is crucial for the gains observed in (Gu et al., 2017a). If implemented as unbiased, it does not outperform TRPO.
4.2 Poorly Fit Value Functions
In contrast to our results, Liu et al. (2018) report that stateactiondependent baselines significantly reduce variance over statedependent baselines on continuous action benchmark tasks (in some cases by six orders of magnitude). We find that this conclusion was caused by a poorly fit value function.
The GAE advantage estimator has mean zero when , which suggests that a statedependent baseline is unnecessary if . As a result, a statedependent baseline is typically omitted when the GAE advantage estimator is used. This is the case in (Liu et al., 2018). However, when poorly approximates , the GAE advantage estimator has nonzero mean, and a statedependent baseline can reduce variance. We show that is the case by taking the opensource code accompanying (Liu et al., 2018), and implementing a statedependent baseline. It achieves comparable variance reduction to the stateactiondependent baseline (Appendix Figure 12).
This situation can occur when the value function approximator is not trained sufficiently (e.g., if a small number of SGD steps are used to train ). Then, it can appear that adding a stateactiondependent baseline reduces variance where a statedependent baseline would have the same effect.
4.3 SampleReuse in Baseline Fitting
Recent work on stateactiondependent baselines fits the baselines using onpolicy samples (Liu et al., 2018; Grathwohl et al., 2018) either by regressing to the Monte Carlo return or minimizing an approximation to the variance of the gradient estimator. This must be carefully implemented to avoid bias. Specifically, fitting the baseline to the current batch of data and then using the updated baseline to form the estimator results in a biased gradient (Jie & Abbeel, 2010).
Although this can reduce the variance of the gradient estimator, it is challenging to analyze the bias introduced. The bias is controlled by the implicit or explicit regularization (e.g., early stopping, size of the network, etc.) of the function approximator used to fit . A powerful enough function approximator can trivially overfit the current batch of data and reduce the learning signal to . This is especially important when flexible neural networks are used as the function approximators.
Liu et al. (2018) fit the baseline using the current batch before computing the policy gradient estimator. Using the opensource code accompanying (Liu et al., 2018), we evaluate several variants: an unbiased version that fits the stateactiondependent baseline after computing the policy step, an unbiased version that fits a statedependent baseline after computing the policy step, and a version that estimates with an extra sample of instead of importance weighting samples from the current batch. Our results are summarized in Appendix Figure 8. Notably, we found that using an extra sample, which should reduce variance by avoiding importance sampling, decreases performance because the baseline is overfit to the current batch. The performance of the unbiased statedependent baseline matched the performance of the unbiased stateactiondependent baseline. On Humanoid, the biased method implemented in (Liu et al., 2018) performs best. However, on HalfCheetah, the biased methods suffer from instability.
5 HorizonAware Value Functions
The empirical variance decomposition illustrated in Figure 2 and Appendix Figure 9 reveals deficiencies in the commonly used value function approximator, and as we showed in Section 4.2, a poor value approximator can produce misleading results. To fix one deficiency with the value function approximator, we propose a new horizonaware parameterization of the value function. As with the stateactiondependent baselines, such a modification is appealing because it does not introduce bias into the underlying method.
The standard continuous control benchmarks use a fixed time horizon (Duan et al., 2016; Brockman et al., 2016), yet most value function parameterizations are stationary, as though the task had infinite horizon. Near the end of an episode, the expected return will necessarily be small because there are few remaining steps to accumulate reward. To remedy this, our value function approximator outputs two values: and and then we combine them with the discounted time left to form a value function estimate
where is the maximum length of the episode. Conceptually, we can think of as predicting the average reward over future states and as a statedependent offset. is a rate of return, so we multiply it be the remaining discounted time in the episode.
Including time as an input to the value function can also resolve this issue (e.g., (Duan et al., 2016; Pardo et al., 2017)). We compare our horizonaware parameterization against including time as an input to the value function and find that the horizonaware value function performs favorably (Appendix Figures 6 and 7).
In Figure 4, we compare TRPO with a horizonaware value function against TRPO, TRPO with a statedependent baseline, and TRPO with a stateactiondependent baseline. Across environments, the horizonaware value function outperforms the other methods. By prioritizing the largest variance components for reduction, we can realize practical performance improvements without introducing bias.
6 Related Work
Baselines (Williams, 1992; Weaver & Tao, 2001) in RL fall under the umbrella of control variates, a general technique for reducing variance in Monte Carlo estimators without biasing the estimator (Owen, 2013). Weaver & Tao (2001) analyzes the optimal statedependent baseline, and in this work, we extend the analysis to stateactiondependent baselines in addition to analyzing the variance of the GAE estimator (Tesauro, 1995; Schulman et al., 2015a).
Dudík et al. (2011) introduced the community to doublyrobust estimators, a specific form of control variate, for offpolicy evaluation in bandit problems. The stateactiondependent baselines (Gu et al., 2017a; Wu et al., 2018; Liu et al., 2018; Grathwohl et al., 2018; Gruslys et al., 2017) can be seen as the natural extension of the doubly robust estimator to the policy gradient setting. In fact, for the discrete action case, the policy gradient estimator with the stateactiondependent baseline can be seen as the gradient of a doubly robust estimator.
Prior work has explored modelbased (Sutton, 1990; Heess et al., 2015; Gu et al., 2016) and offpolicy criticbased gradient estimators (Lillicrap et al., 2015). In offpolicy evaluation, practitioners have long realized that constraining the estimator to be unbiased is too limiting. Instead, recent methods mix unbiased doublyrobust estimators with biased modelbased estimates and minimize the mean squared error (MSE) of the combined estimator (Thomas & Brunskill, 2016; Wang et al., 2016a). In this direction, several recent methods have successfully mixed highvariance, unbiased onpolicy gradient estimates directly with lowvariance, biased offpolicy or modelbased gradient estimates to improve performance (O’Donoghue et al., 2016; Wang et al., 2016b; Gu et al., 2017b). It would be interesting to see if the ideas from offpolicy evaluation could be further adapted to the policy gradient setting.
7 Discussion
Stateactiondependent baselines promise variance reduction without introducing bias. In this work, we clarify the practical effect of stateactiondependent baselines in common continuous control benchmark tasks. Although an optimal stateactiondependent baseline is guaranteed not to increase variance and has the potential to reduce variance, in practice, currently used function approximators for the stateactiondependent baselines are unable to achieve significant variance reduction. Furthermore, we found that much larger gains could be achieved by instead improving the accuracy of the value function or the statedependent baseline function approximators.
With these insights, we reexamined previous work on stateactiondependent baselines and identified a number of pitfalls. We were also able to correctly attribute the previously observed results to implementation decisions that introduce bias in exchange for variance reduction. We intend to further explore the tradeoff between bias and variance in future work.
Motivated by the gap between the value function approximator and the true value function, we propose a novel modification of the value function parameterization that makes it aware of the finite time horizon. This gave consistent improvements over TRPO, whereas the unbiased stateactiondependent baseline did not outperform TRPO.
Finally, we note that the relative contributions of each of the terms to the policy gradient variance are problem specific. A learned stateactiondependent baseline will be beneficial when is large relative to . In this paper, we focused on continuous control benchmarks where we found this not to be the case. We speculate that in environments where single actions have a strong influence on the discounted return (and hence is large), may be large. For example, in a discrete task with a critical decision point such as a Cliffworld domain, where a single action could cause the agent to fall of the cliff and suffer a large negative reward. Future work will investigate the variance decomposition in additional domains.
Acknowledgments
We thank Jascha SohlDickstein, Luke Metz, Gerry Che, Yuchen Lu, and Cathy Wu for helpful discussions. We thank Hao Liu and Qiang Liu for assisting our understanding of their code. SG acknowledges support from a CambridgeTübingen PhD Fellowship. RET acknowledges support from Google and EPSRC grants EP/M0269571 and EP/L000776/1. ZG acknowledges support from EPSRC grant EP/J012300/1.
References
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Duan et al. (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pp. 1329–1338, 2016.
 Dudík et al. (2011) Dudík, M., Langford, J., and Li, L. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
 Grathwohl et al. (2018) Grathwohl, W., Choi, D., Wu, Y., Roeder, G., and Duvenaud, D. Backpropagation through the void: Optimizing control variates for blackbox gradient estimation. International Conference on Learning Representations (ICLR), 2018.
 Gruslys et al. (2017) Gruslys, A., Azar, M. G., Bellemare, M. G., and Munos, R. The reactor: A sampleefficient actorcritic architecture. arXiv preprint arXiv:1704.04651, 2017.
 Gu et al. (2016) Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep qlearning with modelbased acceleration. In International Conference on Machine Learning, pp. 2829–2838, 2016.
 Gu et al. (2017a) Gu, S., Lillicrap, T., Ghahramani, Z., Turner, R. E., and Levine, S. Qprop: Sampleefficient policy gradient with an offpolicy critic. International Conference on Learning Representations (ICLR), 2017a.
 Gu et al. (2017b) Gu, S., Lillicrap, T., Turner, R. E., Ghahramani, Z., Schölkopf, B., and Levine, S. Interpolated policy gradient: Merging onpolicy and offpolicy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3849–3858, 2017b.
 Heess et al. (2015) Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., and Tassa, Y. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952, 2015.
 Jie & Abbeel (2010) Jie, T. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. In Advances in Neural Information Processing Systems, pp. 1000–1008, 2010.
 Kakade (2002) Kakade, S. M. A natural policy gradient. In Advances in Neural Information Processing Systems, pp. 1531–1538, 2002.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Liu & Nocedal (1989) Liu, D. C. and Nocedal, J. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(13):503–528, 1989.
 Liu et al. (2018) Liu, H., Feng, Y., Mao, Y., Zhou, D., Peng, J., and Liu, Q. Actiondependent control variates for policy optimization via stein identity. International Conference on Learning Representations (ICLR), 2018.
 Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937, 2016.
 O’Donoghue et al. (2016) O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. Pgq: Combining policy gradient and qlearning. arXiv preprint arXiv:1611.01626, 2016.
 Owen (2013) Owen, A. B. Monte carlo theory, methods and examples. Monte Carlo Theory, Methods and Examples. Art Owen, 2013.
 Pardo et al. (2017) Pardo, F., Tavakoli, A., Levdik, V., and Kormushev, P. Time limits in reinforcement learning. arXiv preprint arXiv:1712.00378, 2017.
 Peters & Schaal (2006) Peters, J. and Schaal, S. Policy gradient methods for robotics. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pp. 2219–2225. IEEE, 2006.
 Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015a.
 Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Silver et al. (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. Deterministic policy gradient algorithms. In ICML, 2014.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
 Stengel (1986) Stengel, R. F. Optimal control and estimation. Courier Corporation, 1986.
 Sutton (1990) Sutton, R. S. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pp. 216–224. Elsevier, 1990.
 Sutton & Barto (1998) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction, volume 1. MIT Press Cambridge, 1998.
 Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, pp. 1057–1063, 2000.
 Tesauro (1995) Tesauro, G. Temporal difference learning and tdgammon. Communications of the ACM, 38(3):58–68, 1995.
 Thomas (2014) Thomas, P. Bias in natural actorcritic algorithms. In International Conference on Machine Learning, pp. 441–448, 2014.
 Thomas & Brunskill (2016) Thomas, P. and Brunskill, E. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, 2016.
 Thomas & Brunskill (2017) Thomas, P. S. and Brunskill, E. Policy gradient methods for reinforcement learning with function approximation and actiondependent baselines. arXiv preprint arXiv:1706.06643, 2017.
 Wang et al. (2016a) Wang, Y.X., Agarwal, A., and Dudik, M. Optimal and adaptive offpolicy evaluation in contextual bandits. arXiv preprint arXiv:1612.01205, 2016a.
 Wang et al. (2016b) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. Sample efficient actorcritic with experience replay. arXiv preprint arXiv:1611.01224, 2016b.

Weaver & Tao (2001)
Weaver, L. and Tao, N.
The optimal reward baseline for gradientbased reinforcement
learning.
In
Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence
, pp. 538–545. Morgan Kaufmann Publishers Inc., 2001.  Williams (1992) Williams, R. J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pp. 5–32. Springer, 1992.
 Wu et al. (2018) Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A. M., Kakade, S., Mordatch, I., and Abbeel, P. Variance reduction for policy gradient with actiondependent factorized baselines. International Conference on Learning Representations (ICLR), 2018.
Appendix
8 Experiment Details
8.1 QProp Experiments
We modified the QProp implementation published by the authors at https://github.com/shaneshixiang/rllabplusplus (commit: 4d55f96). We used the conservative variant of QProp, as is used throughout the experimental section in the original paper. We used the default choices of policy and value functions, learning rates, and other hyperparameters as dictated by the code.
We used a batch size of 5000 steps. We ran each of the three algorithms on a discrete (CartPolev0) and two continuous (HalfCheetahv1, Humanoidv1) environments environments using OpenAI Gym (Brockman et al., 2016) and Mujoco 1.3.
To generate the TRPO and (biased) QProp results, we run the code as is. For the unbiased QProp, recall the expression for the biased gradient estimator:
To debias the QProp gradient estimator, we divide the bias correction term, , by .
8.2 Stein Control Variate Experiments
We used the Stein control variate implementation published by the authors at https://github.com/DartML/PPOSteinControlVariate (commit: 6eec471). We used the default hyperparameters and test on two continuous control environments, HalfCheetahv1 and Humanoidv1 using OpenAI Gym and Mujoco 1.3.
We evaluated five algorithms in this experiment: PPO, the Stein control variates algorithm as implemented by (Liu et al., 2018), a variant of the biased Stein algorithm that does not use importance sampling to compute the bias correction term (described below), an unbiased statedependent baseline, and an unbiased stateactiondependent Stein baseline. All of the learned baselines were trained to minimize the approximation to the variance of the gradient estimator as described in (Liu et al., 2018).
We use the code as is to run the first two variants. In the next variant, we estimate with an extra sample of instead of importance weighting samples from the current batch (see Eq. 20 in (Liu et al., 2018). For the unbiased baselines, we ensure that the policy update steps for the current batch are performed before updating the baselines.
8.3 Backpropagating through the Void
We used the implementation published by the authors (https://github.com/wgrathwohl/BackpropThroughTheVoidRL, commit: 0e6623d) with the following modification: we measure the variance of the policy gradient estimator. In the original code, the authors accidentally measure the variance of a gradient estimator that neither method uses. We note that Grathwohl et al. (2018) recently corrected a bug in the code that caused the LAX method to use a different advantage estimator than the base method. We use this bug fix.
8.4 HorizonAware Value Function Experiments
For these experiments, we modify the opensource TRPO implementation: https://github.com/ikostrikov/pytorchtrpo (commit: 27400b8). We test four different algorithms on three different continuous control algorithms: HalfCheetahv1, Walker2dv1, and Humanoidv1 using OpenAI Gym and Mujoco 1.3.
The policy network is a twolayer MLP with 64 units per hidden layer with tanh nonlinearities. It parameterizes the mean and standard deviation of a Gaussian distribution from which actions are sampled. The value function is parameterized similarly. We implement the two outputs of the horizonaware value function as a single neural network with two heads. Both heads have a separate linear layer off of the last hidden layer. We estimate the value functions by regressing on Monte Carlo returns. We optimize the value functions with LBFGS
(Liu & Nocedal, 1989). We use and for GAE, a batch size of 5000 steps, and the maximum time per episode is set to 1000.For the experiments in which we train an additional statedependent or stateactiondependent baseline on the advantage estimates, we parameterize those similarly to the normal value function and train them with a learning rate of optimized with Adam (Kingma & Ba, 2014). We fit the baselines to minimize the mean squared error of predicting the GAE estimates. With the stateactiondependent baseline, we estimate using the reparameterization trick as in (Liu et al., 2018; Grathwohl et al., 2018).
9 Variance computations in LinearQuadraticGaussian (LQG) systems
LinearQuadraticGaussian (LQG) systems (Stengel, 1986) are one of the most studied control/continuous states and actions RL problems. The LQG problems are expressed in the following generic form for a finite horizon scenario. To simplify exposition, we focus on openloop control without observation matrices, however, it is straightforward to extend our analysis.
The policy parameters, dynamics parameters, and reward parameters are , and , respectively. While the goal of LQG is to find control parameters , which can be solved analytically through dynamic programming, we are interested in analyzing its gradient estimator properties when a policy gradient algorithm is applied.
9.1 Computing
In a LQG system, both the stateaction value function and state value function corresponding to policy can be derived analytically. To achieve that, we first note that the marginal state distributions at given can be computed iteratively by the following equation,
To compute , we simply modify the above to first compute all future stateactionconditional marginals ,
and integrate with respect to quadratic costs at each time step,
whose resulting coefficients are,
and denotes constant terms that do not depend on or . Note that the is nonstationary due to the finite time horizon. Given the quadratic , it is straightforward to derive the value function ,
and the advantage function ,
9.2 Computing the LQG analytic gradients
Given quadratic and , it is tractable to compute the exact gradient with respect to the Gaussian parameters. For the gradient with respect to the mean , the derivation is given below, where we drop the time index since it applies to every time step.
Written with time indices, the stateconditional gradient and the full gradient are tractably expressed as the following for both and based estimators,
Similarly, the gradients with respect to the covariance can be computed analytically. For the LQG system, we analyze the variance with respect to the mean parameters, so we omit the derivations here.
9.3 Estimating the LQG variances
Given analytic expressions for , and , it is simple to estimate the variance terms , , and in Eq. 3. is the simplest and given by,
term can also be computed analytically; however, to avoid integrating complex fourthorder terms, we use a samplebased estimate over and , which is sufficiently low variance. For the gradient estimator with , the variance expression is,
The samplebased estimate is the follow, where and ,
The samplebased estimate with based estimator , i.e. , is the same, except above is replaced by . In addition, an unbiased estimate for can be derived,
Importantly, is the variance reduction from using as the statedependent baseline, and is the extra variance reduction from using , the optimal variance reducing stateactiondependent baseline.
Lastly, analytic computation for the term is also tractable, since is a Gaussian, and the integrand is quadratic. However, computing this full Gaussian is computationally expensive, so instead we use the following form, which is still sufficiently low variance,
Comments
There are no comments yet.