1 Introduction
Modelfree reinforcement learning has been applied successfully to a variety of problems Duan et al. (2016). Nevertheless, many of the stateoftheart approaches introduce bias to the policy updates and hence do not guarantee convergence. This might lead to oversensitivity of the algorithms to changes in initial conditions and has an adverserial influence on reproducibility Henderson et al. (2017).
Policy updates often use values learned by a critic to the variance of stochastic gradients. However, arbitrary parametrization of the critic can introduce bias into policy updates. Early work in
Sutton et al. (1999) provides restrictions on the form of the function approximator under which the policy gradient remains unbiased and the convergence of learning is guaranteed. The notion of compatible features introduced in Sutton et al. (1999) also has connections to the natural policy gradient Kakade (2001). Compatible features remain an active area of research Pajarinen et al. (2019).Modern policy optimization algorithms focus on optimizing a surrogate objective that approximates the value of the policy Schulman et al. (2015, 2017). This surrogate objective remains an accurate optimization target when the policy used to gather the data is close to the policy being the optimized Schulman et al. (2015); Kakade and Langford (2002). In this work we investigate the conditions on the parametrization of critic under which the updates of the surrogate objective remain unbiased.
2 Preliminaries
We assume a classical MDP formulation as in Sutton et al. (1999). We consider an MDP being a tuple , is a set of states, is a set of actions, is a transition model, is a reward function, is an initial distribution over states and is a discount factor. We denote a trajectory as . Given a stochastic policy , we use to denote that the trajectory has been generated by following policy , i.e. , , and .
We denote unnormalized discounted state occupancy measure as . The value function is defined as and the statevalue function is . The advantage function is defined as . To emphasize that the policy is parametrized by parameters we use notation . We consider policies such that exist and is continuous for every and . This is a general class of policies that includes, e.g., expressive deep neural networks. Under this discounted cost setting, reinforcement learning algorithms seek a policy that maximizes its value defined as:
(1) 
When dealing with two different parametric policies, and , with parameters and , respectively, we follow the notation in Schulman et al. (2015) and define a surrogate policy optimization objective as
(2) 
When the expected divergence between and is small can be treated as a good approximation of Schulman et al. (2015); Kakade and Langford (2002).
3 Related work
We begin by restating a wellknown result presented in Sutton et al. (1999) that allows to calculate the gradient of w.r.t policy parameters :
(3) 
The remarkable property of the expression for the policy gradient given by Equation (3) is that calculating the gradient does not require calculating the gradient of the occupancy measure . Thus, if we know , the gradient can be approximated with Monte Carlo sampling using the trajectories obtained by following policy .
However, since is not known in advance, it has to be estimated. One possible choice of the estimator is to use empirical returns: . The problem with this choice of estimator is the large variance of the stochastic version of the gradients which results in poor practical performance Ciosek and Whiteson (2018).
To reduce the variance in estimation of many algorithms learn a parametric approximation of by solving the regression problem:
(4) 
where are state action pairs sampled with policy . Using such in place of can introduce bias to policy updates. The following Theorem derived in Sutton et al. (1999); Konda and Tsitsiklis (2003) provides conditions under which function approximator of does not introduce bias to .
Theorem 1.
Let be differentiable function approximator that satisfies the following conditions:
(5) 
and
(6) 
Then,
(7) 
Note that condition (5) can be satisfied by solving the regression problem . Satisfying the condition given by (6) can be done by setting . The constant can be set to to ensure that , since the expectation for every . See the discussion in Sutton et al. (1999) for details.
Performing gradient ascent on requires resampling the data with every policy update as the expectation in Equation (3) is taken over the distribution of trajectories induced by . The approach presented in Kakade and Langford (2002); Schulman et al. (2015) tackles the problem of improving policy in a different way.
Given the data gathered using the current policy we want to estimate the lower bound on the performance of an arbitrary policy and perform maximisation w.r.t . Since the lower bound is tight at maximizing this lower bound w.r.t guarantees an improvement in the value of policy Schulman et al. (2015). Hence, sequential optimization of the discussed lower bound is called Monotonic Policy Improvement. The lower bound derived in Schulman et al. (2015) is summarized in the following theorem.
Theorem 2.
Let and . Then the can be lower bounded as follows:
(8) 
The work done in Achiam et al. (2017) extends this results so that the operator in can be replaced with the expected value taken w.r.t . Calculating the gradient is straightforward as the occupancy measure in the definition of does not depend on .
In the practical setting is optimized by constraining the divergence between and Schulman et al. (2015, 2017). The algorithm derived in Schulman et al. (2015) is closely similar to the natural gradient policy optimisation Kakade (2001). Note that is a biased approximation of but the bias can be controlled by restricting the distance between and .
4 Compatible features for surrogate policy optimization
In this section, we seek a parametric form of an approximator of for which remains unbiased. To this end, we follow the approach presented in Sutton et al. (1999). We derive the following theorem.
Theorem 3 (Compatible features for Monotonic Policy Improvement).
Assuming the following condition is satisfied:
(9) 
and
(10) 
Then,
(11) 
Proof.
Firstly, we note that the value function in the definition of fulfils the role of the control variate, i.e. it does not influence the expectation of the gradient. To note this, we analyse the gradient :
(12) 
where the second step is allowed since exist and is continuous , ; and the last step is due to the fact that does not depend on , i.e.: .
Next we define . We seek a condition under which the following equality holds:
(13) 
By subtracting and the assumption given by (9) from (4), we obtain:
(14) 
where we have used the log trick, , in the second line. Hence Equation (13) can by satisfied by requiring:
(15) 
Integrating this last equation w.r.t yields: , which completes the proof. ∎
Again setting ensures that:
(16) 
In reference Sutton et al. (1999) the authors conjecture that the the compatible features might be the only choice of features that lead to an unbiased policy gradient . In the case of we can provide another choice of features leading to an unbiased gradient . The features derived in Theorem 3 depend on importance sampling weights . To remove importance sampling weights from derived compatible features we modify the condition in Equation (9) by replacing action sampling distribution with .
Theorem 4.
Assuming that the following condition is satisfied:
(17) 
and
(18) 
Then,
(19) 
Proof.
We derive the result by following a similar line of thought as in proof of Theorem 3. Again we use . We subtract and the assumption given by (17) from , which yields:
(20) 
The integral in Equation (4) can be set to zero by setting to solve the differential equation: , which results in desired compatible features . ∎
Note the features derived in Theorem 4 are analogous to ones derived in Theorem 1 with the difference that the occupancy measure is taken w.r.t the data gathering policy in the case of Theorem 4.
Assumption (17) in Theorem 4 has a natural interpretation. Intuitively, the critic values
should be more accurate for actions with high probability under policy
. More formally, the condition in (17) can be satisfied by finding a parameter that minimises the weighted quadratic error: , where weights ensure improved accuracy for likely actions given under target policy . Similarly, when target policy assigns low probability to an action the error in the critic estimation is not relevant as it will not influence the gradient .5 Experiments
In this section we use the NChain Strens (2000) environment to compare the gradient calculated in two different ways: i) using a standard linear critic of the form , learned with the standard least squares regression approach given by (4); and ii) using a linear critic with compatible features given by (18), learned by solving the weighted the least squared regression problem (21). We use the following parametric policy function . Hence, the proposed compatible features are given by , since and . We provide true state action value function as targets to learn critics. We estimate the gradient using rollouts from policy . We set policy parameters to ; and ; . We use the expression for given by Equation (4).
We compare the obtained gradients to ground truth calculated using the true state action value function . In both cases, we provide an increasing number of rollouts from policy to both approaches and analyse the errors in estimation. We report bias, variance and RMSE of estimated gradients in Figure 1.
As expected, the bias in the estimation of introduced by the standard critic cannot be removed by increasing the number of provided trajectories, as it is caused by employing a noncompatible function approximator. The derived compatible features are unbiased by construction, hence provide substantially improved quality of estimation of . Interestingly, using compatible features also provides slightly lower variance than using the standard critic.
6 Conclusions
We have analysed the use of parametric critics to estimate the policy optimization surrogate objective in a systematical way. As a consequence, we can provide conditions under which the approximation does not introduce bias to the policy updates. This result holds for a general class of policies, including policies parametrized by deep neural networks. We have shown that for the investigated surrogate objective there exists two different choices of compatible features. We empirically demonstrated that the compatible features allow to estimate the gradient of surrogate objective more accurately.
References

[1]
(201706–11 Aug)
Constrained policy optimization.
In
Proceedings of the 34th International Conference on Machine Learning
, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 22–31. External Links: Link Cited by: §3. 
[2]
(2018)
Expected policy gradients.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §3.  [3] (2016) Benchmarking deep reinforcement learning for continuous control. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pp. 1329–1338. External Links: Link Cited by: §1.
 [4] (2017) Deep reinforcement learning that matters. CoRR abs/1709.06560. External Links: Link, 1709.06560 Cited by: §1.
 [5] (2002) Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML ’02, San Francisco, CA, USA, pp. 267–274. External Links: ISBN 1558608737, Link Cited by: Compatible features for Monotonic Policy Improvement, §1, §2, §3.
 [6] (2001) A natural policy gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, NIPS’01, Cambridge, MA, USA, pp. 1531–1538. External Links: Link Cited by: §1, §3.
 [7] (200304) On actorcritic algorithms. SIAM J. Control Optim. 42 (4), pp. 1143–1166. External Links: ISSN 03630129, Link, Document Cited by: §3.
 [8] (2019) Compatible natural gradient policy search. CoRR abs/1902.02823. External Links: Link, 1902.02823 Cited by: §1.
 [9] (201507–09 Jul) Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1889–1897. External Links: Link Cited by: Compatible features for Monotonic Policy Improvement, §1, §2, §3, §3, §3.
 [10] (2015) Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: Compatible features for Monotonic Policy Improvement, §3.
 [11] (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: Compatible features for Monotonic Policy Improvement, §1, §3.
 [12] (2000) A bayesian framework for reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, San Francisco, CA, USA, pp. 943–950. External Links: ISBN 1558607072, Link Cited by: §5.
 [13] (1999) Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, Cambridge, MA, USA, pp. 1057–1063. External Links: Link Cited by: Compatible features for Monotonic Policy Improvement, §1, §2, §3, §3, §3, §4, §4.
Comments
There are no comments yet.