Compatible features for Monotonic Policy Improvement

by   Marcin B. Tomczak, et al.
University of Cambridge

Recent policy optimization approaches have achieved substantial empirical success by constructing surrogate optimization objectives. The Approximate Policy Iteration objective (Schulman et al., 2015a; Kakade and Langford, 2002) has become a standard optimization target for reinforcement learning problems. Using this objective in practice requires an estimator of the advantage function. Policy optimization methods such as those proposed in Schulman et al. (2015b) estimate the advantages using a parametric critic. In this work we establish conditions under which the parametric approximation of the critic does not introduce bias to the updates of surrogate objective. These results hold for a general class of parametric policies, including deep neural networks. We obtain a result analogous to the compatible features derived for the original Policy Gradient Theorem (Sutton et al., 1999). As a result, we also identify a previously unknown bias that current state-of-the-art policy optimization algorithms (Schulman et al., 2015a, 2017) have introduced by not employing these compatible features.



There are no comments yet.


page 1

page 2

page 3

page 4


Relative Entropy Regularized Policy Iteration

We present an off-policy actor-critic algorithm for Reinforcement Learni...

Parameter Sharing in Coagent Networks

In this paper, we aim to prove the theorem that generalizes the Coagent ...

Policy Optimization Through Approximated Importance Sampling

Recent policy optimization approaches (Schulman et al., 2015a, 2017) hav...

A Temporal-Difference Approach to Policy Gradient Estimation

The policy gradient theorem (Sutton et al., 2000) prescribes the usage o...

GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values

We present GradientDICE for estimating the density ratio between the sta...

Distributional Reinforcement Learning for Energy-Based Sequential Models

Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et...

Number Theory meets Wireless Communications: an introduction for dummies like us

In this chapter we introduce the theory of Diophantine approximation via...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free reinforcement learning has been applied successfully to a variety of problems Duan et al. (2016). Nevertheless, many of the state-of-the-art approaches introduce bias to the policy updates and hence do not guarantee convergence. This might lead to oversensitivity of the algorithms to changes in initial conditions and has an adverserial influence on reproducibility Henderson et al. (2017).

Policy updates often use values learned by a critic to the variance of stochastic gradients. However, arbitrary parametrization of the critic can introduce bias into policy updates. Early work in

Sutton et al. (1999) provides restrictions on the form of the function approximator under which the policy gradient remains unbiased and the convergence of learning is guaranteed. The notion of compatible features introduced in Sutton et al. (1999) also has connections to the natural policy gradient Kakade (2001). Compatible features remain an active area of research Pajarinen et al. (2019).

Modern policy optimization algorithms focus on optimizing a surrogate objective that approximates the value of the policy Schulman et al. (2015, 2017). This surrogate objective remains an accurate optimization target when the policy used to gather the data is close to the policy being the optimized Schulman et al. (2015); Kakade and Langford (2002). In this work we investigate the conditions on the parametrization of critic under which the updates of the surrogate objective remain unbiased.

2 Preliminaries

We assume a classical MDP formulation as in Sutton et al. (1999). We consider an MDP being a tuple , is a set of states, is a set of actions, is a transition model, is a reward function, is an initial distribution over states and is a discount factor. We denote a trajectory as . Given a stochastic policy , we use to denote that the trajectory has been generated by following policy , i.e. , , and .

We denote unnormalized discounted state occupancy measure as . The value function is defined as and the state-value function is . The advantage function is defined as . To emphasize that the policy is parametrized by parameters we use notation . We consider policies such that exist and is continuous for every and . This is a general class of policies that includes, e.g., expressive deep neural networks. Under this discounted cost setting, reinforcement learning algorithms seek a policy that maximizes its value defined as:


When dealing with two different parametric policies, and , with parameters and , respectively, we follow the notation in Schulman et al. (2015) and define a surrogate policy optimization objective as


When the expected divergence between and is small can be treated as a good approximation of Schulman et al. (2015); Kakade and Langford (2002).

3 Related work

We begin by restating a well-known result presented in Sutton et al. (1999) that allows to calculate the gradient of w.r.t policy parameters :


The remarkable property of the expression for the policy gradient given by Equation (3) is that calculating the gradient does not require calculating the gradient of the occupancy measure . Thus, if we know , the gradient can be approximated with Monte Carlo sampling using the trajectories obtained by following policy .

However, since is not known in advance, it has to be estimated. One possible choice of the estimator is to use empirical returns: . The problem with this choice of estimator is the large variance of the stochastic version of the gradients which results in poor practical performance Ciosek and Whiteson (2018).

To reduce the variance in estimation of many algorithms learn a parametric approximation of by solving the regression problem:


where are state action pairs sampled with policy . Using such in place of can introduce bias to policy updates. The following Theorem derived in Sutton et al. (1999); Konda and Tsitsiklis (2003) provides conditions under which function approximator of does not introduce bias to .

Theorem 1.

Let be differentiable function approximator that satisfies the following conditions:






Note that condition (5) can be satisfied by solving the regression problem . Satisfying the condition given by (6) can be done by setting . The constant can be set to to ensure that , since the expectation for every . See the discussion in Sutton et al. (1999) for details.

Performing gradient ascent on requires resampling the data with every policy update as the expectation in Equation (3) is taken over the distribution of trajectories induced by . The approach presented in Kakade and Langford (2002); Schulman et al. (2015) tackles the problem of improving policy in a different way.

Given the data gathered using the current policy we want to estimate the lower bound on the performance of an arbitrary policy and perform maximisation w.r.t . Since the lower bound is tight at maximizing this lower bound w.r.t guarantees an improvement in the value of policy Schulman et al. (2015). Hence, sequential optimization of the discussed lower bound is called Monotonic Policy Improvement. The lower bound derived in Schulman et al. (2015) is summarized in the following theorem.

Theorem 2.

Let and . Then the can be lower bounded as follows:


The work done in Achiam et al. (2017) extends this results so that the operator in can be replaced with the expected value taken w.r.t . Calculating the gradient is straightforward as the occupancy measure in the definition of does not depend on .

In the practical setting is optimized by constraining the divergence between and Schulman et al. (2015, 2017). The algorithm derived in Schulman et al. (2015) is closely similar to the natural gradient policy optimisation Kakade (2001). Note that is a biased approximation of but the bias can be controlled by restricting the distance between and .

Optimizing requires knowing the values of . There are various ways of estimating the critic , for instance see Schulman et al. (2015). However, similarly to the case of Theorem 1, using an arbitrary parametrization of the critic introduces bias to an estimate of .

4 Compatible features for surrogate policy optimization

In this section, we seek a parametric form of an approximator of for which remains unbiased. To this end, we follow the approach presented in Sutton et al. (1999). We derive the following theorem.

Theorem 3 (Compatible features for Monotonic Policy Improvement).

Assuming the following condition is satisfied:






Firstly, we note that the value function in the definition of fulfils the role of the control variate, i.e. it does not influence the expectation of the gradient. To note this, we analyse the gradient :


where the second step is allowed since exist and is continuous , ; and the last step is due to the fact that does not depend on , i.e.: .

Next we define . We seek a condition under which the following equality holds:


By subtracting and the assumption given by (9) from (4), we obtain:


where we have used the log trick, , in the second line. Hence Equation (13) can by satisfied by requiring:


Integrating this last equation w.r.t yields: , which completes the proof. ∎

Again setting ensures that:


In reference Sutton et al. (1999) the authors conjecture that the the compatible features might be the only choice of features that lead to an unbiased policy gradient . In the case of we can provide another choice of features leading to an unbiased gradient . The features derived in Theorem 3 depend on importance sampling weights . To remove importance sampling weights from derived compatible features we modify the condition in Equation (9) by replacing action sampling distribution with .

Theorem 4.

Assuming that the following condition is satisfied:






We derive the result by following a similar line of thought as in proof of Theorem 3. Again we use . We subtract and the assumption given by (17) from , which yields:


The integral in Equation (4) can be set to zero by setting to solve the differential equation: , which results in desired compatible features . ∎

Note the features derived in Theorem 4 are analogous to ones derived in Theorem 1 with the difference that the occupancy measure is taken w.r.t the data gathering policy in the case of Theorem 4.

Assumption (17) in Theorem 4 has a natural interpretation. Intuitively, the critic values

should be more accurate for actions with high probability under policy

. More formally, the condition in (17) can be satisfied by finding a parameter that minimises the weighted quadratic error: , where weights ensure improved accuracy for likely actions given under target policy . Similarly, when target policy assigns low probability to an action the error in the critic estimation is not relevant as it will not influence the gradient .

Given a sequence of state action pairs gathered by following policy , the integrals in condition (17) can be also approximated with samples, which leads to a weighted regression problem emerging from Theorem 4:


5 Experiments

In this section we use the NChain Strens (2000) environment to compare the gradient calculated in two different ways: i) using a standard linear critic of the form , learned with the standard least squares regression approach given by (4); and ii) using a linear critic with compatible features given by (18), learned by solving the weighted the least squared regression problem (21). We use the following parametric policy function . Hence, the proposed compatible features are given by , since and . We provide true state action value function as targets to learn critics. We estimate the gradient using rollouts from policy . We set policy parameters to ; and ; . We use the expression for given by Equation (4).

We compare the obtained gradients to ground truth calculated using the true state action value function . In both cases, we provide an increasing number of rollouts from policy to both approaches and analyse the errors in estimation. We report bias, variance and RMSE of estimated gradients in Figure 1.

Figure 1: Comparison of gradient derived with compatible parametrization and by using standard critic. Employing a critic with derived compatible features allows to provide unbiased gradient of . The results are averaged over trials.

As expected, the bias in the estimation of introduced by the standard critic cannot be removed by increasing the number of provided trajectories, as it is caused by employing a noncompatible function approximator. The derived compatible features are unbiased by construction, hence provide substantially improved quality of estimation of . Interestingly, using compatible features also provides slightly lower variance than using the standard critic.

6 Conclusions

We have analysed the use of parametric critics to estimate the policy optimization surrogate objective in a systematical way. As a consequence, we can provide conditions under which the approximation does not introduce bias to the policy updates. This result holds for a general class of policies, including policies parametrized by deep neural networks. We have shown that for the investigated surrogate objective there exists two different choices of compatible features. We empirically demonstrated that the compatible features allow to estimate the gradient of surrogate objective more accurately.


  • [1] J. Achiam, D. Held, A. Tamar, and P. Abbeel (2017-06–11 Aug) Constrained policy optimization. In

    Proceedings of the 34th International Conference on Machine Learning

    , D. Precup and Y. W. Teh (Eds.),
    Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 22–31. External Links: Link Cited by: §3.
  • [2] K. Ciosek and S. Whiteson (2018) Expected policy gradients. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §3.
  • [3] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 1329–1338. External Links: Link Cited by: §1.
  • [4] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger (2017) Deep reinforcement learning that matters. CoRR abs/1709.06560. External Links: Link, 1709.06560 Cited by: §1.
  • [5] S. Kakade and J. Langford (2002) Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML ’02, San Francisco, CA, USA, pp. 267–274. External Links: ISBN 1-55860-873-7, Link Cited by: Compatible features for Monotonic Policy Improvement, §1, §2, §3.
  • [6] S. Kakade (2001) A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, Cambridge, MA, USA, pp. 1531–1538. External Links: Link Cited by: §1, §3.
  • [7] V. R. Konda and J. N. Tsitsiklis (2003-04) On actor-critic algorithms. SIAM J. Control Optim. 42 (4), pp. 1143–1166. External Links: ISSN 0363-0129, Link, Document Cited by: §3.
  • [8] J. Pajarinen, H. L. Thai, R. Akrour, J. Peters, and G. Neumann (2019) Compatible natural gradient policy search. CoRR abs/1902.02823. External Links: Link, 1902.02823 Cited by: §1.
  • [9] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015-07–09 Jul) Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1889–1897. External Links: Link Cited by: Compatible features for Monotonic Policy Improvement, §1, §2, §3, §3, §3.
  • [10] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: Compatible features for Monotonic Policy Improvement, §3.
  • [11] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: Compatible features for Monotonic Policy Improvement, §1, §3.
  • [12] M. J. A. Strens (2000) A bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, San Francisco, CA, USA, pp. 943–950. External Links: ISBN 1-55860-707-2, Link Cited by: §5.
  • [13] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour (1999) Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, Cambridge, MA, USA, pp. 1057–1063. External Links: Link Cited by: Compatible features for Monotonic Policy Improvement, §1, §2, §3, §3, §3, §4, §4.