Many deep reinforcement learning (RL) algorithms are based on approximate dynamic programming. For example, the celebrated DQN  is based on approximate value iteration. As a pure critic approach, it can only deal with finite action spaces. A more versatile approach, that allows handling both discrete and continuous action spaces, an important case in robot learning, consists in considering actor-critic architectures, where both the value function and the policy are represented. Most of such recent approaches are either variations of policy gradient [2, 3, 4], inspired by conservative policy iteration [5, 6, 7], or make use of entropy regularization [8, 9, 10].
If approximate policy iteration has already been the building block of actor-critics in the past 
, it has not been considered with deep learning approximators, as far as we know. We assume that this is due to the fact that the greedy operator is unstable (much like gradient descent with too big step sizes). A clever way to address this issue has been introduced byKakade and Langford  with Conservative Policy Iteration (CPI). Instead of taking the greedy policy, the new policy is a stochastic mixture of the current one and of the greedy one. This softens greediness and stabilizes learning.
With this classical approach, the current policy is a stochastic mixture of all past policies, which is not very practical. The core idea of CPI has been adapted in the deep RL literature by modifying how the greediness is softened. For example, Trust Region Policy Optimization (TRPO)  or Actor-Critic using Kronecker-factored Trust Region (ACKTR)  add a constraint on the greedy step, imposing that the average Kullback-Leibler (KL) divergence between consecutive policies is below a given threshold, and Proximal Policy Optimization (PPO) 
modifies the greedy step with a clipping loss that forces the ratio of action probabilities of consecutive policies to remain close to 1. To some extent, even policy gradient approaches can be seen as such, as following the policy gradient should provide a soft improvement (see also for a connection between CPI and policy gradient). Other approaches consider an entropy penalty [8, 9, 10], which effect is also to soften greediness (but can also modify the evaluation step).
In this paper, we will call generally “Soft Policy Iteration” (SPI) any approach that combines policy evaluation with a softened greedy step. As they require policy evaluation, these approaches are naturally on-policy. In classical dynamic programming, Modified Policy Iteration (MPI)  replaces the full evaluation of the policy by a partial evaluation. This idea has been extended to the approximate setting (Approximate MPI, or AMPI ), but never with deep learning approximators, as far as we know. This is probably due to the instability of the greedy step.
Yet, a partial evaluation presents some interest, compared to a full policy evaluation. It allows for an easier extension to off-policy learning by making use of Temporal Difference (TD) learning instead of using rollouts. It also draws a bridge between value and policy iterations (because MPI has these two algorithms as special cases). In this work, we propose an abstract actor-critic framework that brings together MPI and SPI, by mixing the partial evaluation of MPI with the softened greediness of SPI. We name the resulting approach Modified Soft Policy Iteration (MoSoPI).
To justify this approach, we show that MoSoPI converges to the optimal value function in the ideal case (no approximation error). This is a bare minimum. As a proof of concept of this general idea, we instantiate it with the PPO greediness, and compare it to the original PPO on a set of continuous control tasks 
. The only difference between both algorithms is the way state(-action) value functions are (partially) estimated, yet it allows gaining a lot regarding sample efficiency. To be complete, we’ll also compare this modified PPO to a state of the art off-policy actor-critic, Soft Actor-Critic (SAC). It is often competitive with it, while being usually more sample efficient.
A Markov Decision Process (MDP) is a tuple, with the state space, the action space, the transition kernel ( denotes the probability to go from to under action ), the reward function and the discount factor. A (stochastic) policy is a mapping from states to distribution of actions ( denotes the probability of choosing in ). The quality of a policy is quantified by the value function,
where denotes the expectation respectively to the trajectories sampled by the policy and the dynamics .
Write the Bellman operator, defined for any function as
The value function is the unique fixed point of the operator . The aim of RL is to maximize either the value function for each state or an average value function. To do so, the notion of Bellman optimality operator is useful:
The optimal value function is the unique fixed point of . The notion of greedy operator can be derived from . We say that is greedy respectively to (that is not necessarily a value function) if
The value function might not be convenient from a practical viewpoint, as applying the operators and requires knowing the dynamics. To alleviate this issue, a classical approach is to consider a
-function, that adds a degree of freedom on the first action to be chosen,
Similarly to the value function, we can define the associated , and operators. Value and -functions are linked by , and the advantage function is defined as (it is the state-wise centered -function).
3 Modified Soft Policy Iteration
In this section, we present the abstract variations of policy iteration that lead to MoSoPI, as well as briefly how they can be transformed into practical algorithms. We also justify MoSoPI by showing its convergence in an ideal case.
3.1 Policy Iteration
Policy iteration (PI) alternates policy improvement and policy evaluation:
In the exact case, everything can be computed analytically (given finite and small enough state and action spaces), and this PI scheme will converge in finite time. In an approximate setting, one has to approximate both the value function and the policy (possibly implicitly), and to learn them from samples.
We start by discussing the approximation of policy evaluation. First, as explained before, it is more convenient to work with -functions. Let be a parameterized -function, can be estimated using rollouts. Write generally for an empirical estimation, assume that a set of state-action couples is available, and that we can simulate the return (the cumulative discounted reward from a rollout starting in and following the policy afterwards), then the -function can be estimated by minimizing
There exist approaches for estimating -functions directly from transitions, such as LSTD , but they usually assume a linear parameterization.
If the action space is finite, the greedy policy can be deduced from the estimated -function :
Generally, one can also adopt a parameterized policy and solve the greedy step as maximizing the following optimization problem:
Notice that this would correspond to solving for some distribution instead of the greedy step in (6). Adding a state-dependant baseline to does not change the minimizer, and one consider usually an estimated advantage function
to reduce the variance of the gradient. With discrete actions, this corresponds to a cost-sensitive multi-class classification problem.
3.2 Soft Policy Iteration
The greedy step can be unstable in an approximate setting. To alleviate this problem, Kakade and Langford  proposed to soften it by mixing the greedy policy with the current one. Let , the greedy step is replaced by
This comes with a monotonic improvement guarantee, given a small enough . However, it is not very practical, as the new policy is a mixture of all previous policies.
To alleviate this problem, Schulman et al.  proposed to soften the greediness with a KL penalty between consecutive policies, that leads to minimize:
Other approaches are possible. For example, PPO combines the approximate greedy step (9) with importance sampling and a clipping of the ratio of probabilities:
The operator saturates the ratio of probabilities when it deviates too from 1 (at if the advantage is positive, else), without it it would be equivalent to (9).
In this work, we call SPI any policy iteration combined with a soft greedy step, that we frame as satisfying (so, we ask the policy to provide some improvement, without being the greedy one). In that sense, even a policy gradient step can be seen as softened greediness.
3.3 Modified Policy Iteration
If SPI modifies the greedy step, MPI  modifies the evaluation step. The operator being a contraction, we can write , for any , so notably for . MPI does partial evaluation by iterating the operator a finite number of times. Let , MPI iterates
For , we retrieve PI, and for we retrieve value iteration (VI): as , with it reduces to , that is VI.
We have that
This suggests two ways of estimating a value function (or next, directly a -function). First, consider the case and a parameterized -function. The classical approach consists in solving the following regression problem:
With , a solution is to perform an -step rollout (using ) and to replace in Eq. (15) by
This can be corrected for off-policy learning, using for example importance sampling or Retrace .
Another approach is to solve times the regression problem of Eq. (15), replacing by the newly computed after each regression but keeping the policy fixed over the regressions. In other words, solving Eq. (15) is one application of an approximate Bellman evaluation operator, and this amounts to applying it times.
Although using -step returns is pretty standard in deep RL (even if its relation to the classical MPI is rarely acknowledged, as far as we know), the second approach is less usual and has never been experimented, at least in a deep RL context, to the best of our knowledge.
3.4 Modified Soft Policy Iteration
MoSoPI simply consists in bringing together a soft policy step of SPI (so any kind of soft greediness) and the partial evaluation step of MPI:
To get a practical algorithm, one just has to choose a soft greedy step (eg., one of those presented in Sec. 3.2) and to estimate the partial evaluation of the -function (eg., with one of the approaches depicted in Sec. 3.3). We present in more details such an instantiation in Sec. 4, that uses the greedy step of PPO and applies times the approximate Bellman operator for evaluation. However, before this, a bare minimum is to know if the algorithmic scheme (17) converges toward the optimal value function in the ideal case (no approximation error).
To show this convergence, we make some mild assumptions, discussed in the Appendix.
Assumption 1 (Initialization).
The initial policy and value satisfy .
Assumption 2 (Improvement).
Since MoSoPI does not require the policy to be greedy respectively to , we assume that the improvement is strict whenever possible. That is, if it exists a policy such that , then .
Theorem 1 (Convergence of MoSoPI).
Notice that if Thm. 1 shows the convergence, it tells nothing about the rate of convergence. Indeed, it can be arbitrarily slow (there could be an infinitesimal improvement at each step). This is the price to pay for a very general notion of conservative greediness, getting a convergence rate would require specifying more this conservative greediness. The same holds for studying what would happen with approximation errors in the soft greedy or partial evaluation steps.
4 Modified Proximal Policy Optimization
As a proof of concept, we show how PPO can be modified using the MoSoPI idea, and call the resulting algorithm Modified PPO (MoPPO). Both algorithms will have the same greedy step, and differ by the way value functions are estimated.
We partially presented PPO in Sec. 3.2. Its greedy step is depicted in Eq. (12). The advantage is estimated with the temporal difference error computed with an approximate value function. The value function is estimated as in Eq. (15) (with instead of ), that is it corresponds to one application of the approximate Bellman operator. The advantage can be estimated by a temporal difference (TD) error, . Schulman et al.  go further and consider an advantage estimated by combining successive TD errors with eligibility traces. Let be the length of the trajectory, the advantage is estimated as
These are estimated in an on-policy manner. Notice that in practice a value of close to 1 is chosen, that makes this close to the (full) rollouts we described in Sec. (3.1).
MoPPO uses exactly the same greedy step as PPO (12), but the partial evaluation step depicted in Sec. 3.3. We use a replay buffer to store gathered transitions, and we evaluate the -function, in an off-policy manner, by solving times the regression problem (15). We estimate the advantage of a state-action couple by subtracting a Monte-Carlo empirical average of the state-action values from the estimated -function. If has been estimated based on the policy , we sample and we estimate
MoPPO is summarized in the Appendix. Current and target - and policy networks are initialized. The algorithm feeds a replay buffer by interacting with the environment. At regular steps, the -network is updated times by doing optimizations of (15
) with stochastic gradient descent, using the same policyduring all optimizations, but updating the target network between each optimization. In the case of continuous actions, the expected state-action value of each next state is estimated using Monte Carlo. This is followed by the optimization of the policy by stochastic gradient ascent on (12). As the transitions are sampled from the buffer (that is bigger than the update frequency), MoPPO is off-policy. As it only uses (repeatedly) one-step rollouts, it does not require off-policy corrections.
5 Related Works
Being built upon MPI and SPI, MoSoPI is obviously related to these approaches. However, the combination of the two induces key differences.
For example, MoPPO can be related to AMPI  as both use the evaluation of MPI. However, AMPI does this with -step rollouts, and the
-function is learnt on-policy. Moreover, the greedy step is not softened. AMPI has never been combined with neural networks, as far as we know. We hypothesize that it would be unstable, due to the greedy step. Also, it has never been considered practically with continuous actions. As it uses a soft greedy step, MoPPO can be related to various approaches such as TRPO, ACKTR  and even more PPO , with which the only difference is the evaluation step. Thanks to this, MoPPO is off-policy, contrary to the preceding algorithms.
As an off-policy deep actor-critic, MoPPO can also be related to approaches such as SAC , DDPG  or TD3 . They share the same characteristics (off-policy, actor-critic), but they are derived from different principles. SAC is build upon entropy-regularized policy iteration, while DDPG and TD3 are based on the deterministic policy gradient theorem . The proposed MoSoPI framework is somehow more general, as it allows considering any soft greedy step (and thus those of the aforementioned approaches). Notice that these approaches are made off-policy by (somehow implicitly) replacing the full policy evaluation by a single TD backup. This corresponds to setting in our framework (but learning and sample collection are entangled, contrary to our approach).
Our approach can also be linked to others that could be seen as quite different at a first look. For example, consider Maximum a posteriori Policy Optimisation (MPO) 
. It is derived using the expectation-maximization principle applied to a relative entropy objective. However, looking at the final algorithm, it is a kind of soft policy iteration approach. The greedy step is close to the one of TRPO, except that the resulting policy is computed analytically on a subset of states, and generalized by minimizing a KL divergence between a policy network and this analytical policy. The evaluation is done by applying the approximate Bellman evaluation operator, combined with something close to-step rollouts corrected by Retrace111They consider a truncated eligibility-traces-based estimator, that is a weighted sum of -step returns.  for off-policy learning. As so, it can be (roughly) seen as an instantiation of the proposed general MoSoPI framework.
First, we study the influence of some parameters of our algorithm on one task (Hopper). The parameter , that allows going from VI-like to PI-like approaches, is a first natural candidate222We explained earlier that the partial evaluation could be done using either -step rollouts or by applying -times the (approximate) Bellman operator. We have presented MoPPO with the second option, but we have tried both, the former is very unstable, even with a Retrace-based off-policy correction. We assume that this is due to a too high degree of “off-policyness”. See the Appendix for more details.. MoPPO separates sample collection and value/policy optimization. As shown in the Alg. provided in the Appendix, we indeed collect transitions and then train sequentially the -function ( times) and the policy. This is not that common in deep RL, to separate both processes, so we also experiment the influence of this parameter.
Second, we compare MoPPO to its natural competitor, PPO, on a set of tasks. Our simple modification will learn consistently better and faster, and requires sometime up to ten time less samples. Both PPO and MoPPO are run by ourselves, using the OpenAI implementation for PPO. To get a better vision of the efficiency of the proposed approach, we will also compare it to a recent state of the art off-policy actor-critic deep RL algorithm, SAC. For this comparison, we used the results provided by the authors for their experiments running on the same environments333https://sites.google.com/corp/view/soft-actor-critic (but not with the same random seeds nor computer architecture). On most benchmarks, MoPPO performs better and/or faster.
The algorithms are evaluated using either the approach of Haarnoja et al.  or of Wu et al. . For the first approach, the policy is evaluated every 1000 steps by using the mean action (instead of sampling). For the second, we average the 10 best evaluation scores acquired so far, that every 1000 steps (that requires keeping track of the best past 10 policies). Results are averaged over 5 seeds444For SAC, we used the provided results, corresponding to five seeds, but we do not know their values and how they have been chosen. For PPO and MoPPO, we took the best 5 seeds over 8 seeds, of values evenly spaced between 1000 and 8000. Without this, the results would be a little bit less stable, but it does not change the overall conclusion.. Notice that we provide additional details and discuss the choice of hyper-parameters in the Appendix.
6.1 Effect of some parameters
First, we study the influence of (performance is evaluated as by Wu et al. ). With , we have an approximate soft value iteration approach, and it get closer to policy iteration as increases (that could lead to an increase of computational budget).
In Fig. 0(a), we increase while keeping the budget of each regression fixed. That is, we process the same number of minibatches for each regression, here 50; the budget thus increases linearly with . Going from to speeds-up learning and improves the final performance (it almost doubles). Going from to provides a smaller improvement. So, this suggests that we can gain something by solving repeatedly the regression problem corresponding to the approximate Bellman operator, with a fixed policy.
This does not consume more samples, but comes with an increase of the computational budget. In Fig. 0(b), we study the effect of increasing at a fixed budget (that is, keeping the number of minibatches being processed fixed for the whole set of regressions, to in this case). In this example, increasing , even with a fixed budget, still helps (but less than with an increase of the budget).
MoPPO (see Alg. in Appx.) decouples sample collection and learning, by updating both networks sequentially after every interactions with the environment. One can expect that if this parameter is too large, learning will be less sample efficient (as networks are updated less frequently), while if too small, learning could become more unstable. This is illustrated in Fig. 0(c) for the Hopper task. With a more frequent update, learning is faster but variance is also higher.
6.2 Comparative Results
Here we compare MoPPO to PPO, which is quite natural. The only difference between both approaches is the way the advantage function is estimated. PPO is quite recent, and a standard baseline, but on-policy. Off-policy actor-critic algorithms is a fast evolving field, so we also compare our approach to SAC, that is also off-policy, even if derived from different principles, and has state of the art performances.
We evaluate results using the approaches of Haarnoja et al.  or of Wu et al. . The first one is representative of how learning progresses (current policy is evaluated every 1000 steps), while the other one is representative of the global efficiency of the algorithm (every 1000 steps, average of the best 10 policy evaluations computed so far).
In some environments, the performance of MoPPO degrades when the policy becomes too deterministic. We stop learning when this occurs, that explains why MoPPO curves sometime stop earlier555With the evaluation of Haarnoja et al. , continuing the curve would result in a possible degradation, while with the evaluation of Wu et al. , it would lead to a saturation.. We would like to highlight the fact that this phenomenon occurs for most of actor-critics, mainly due to the policy becoming too deterministic, and that it only happens much later (and usually training curves are stopped earlier).
Fig. 2 shows the performances of PPO and MoPPO on five Mujoco tasks (see graph titles). We observe that MoPPO consistently learn competitive or better policies faster (up to 5 to 10 time faster, eg. Hopper or Walker). This was to be expected, as MoPPO is off-policy, while PPO is on-policy. However, we recall MoPPO to be a simple modification of PPO, this illustrates the fact that the general MoSoPI framework can be useful regarding sample efficiency. We can also observe that MoPPO can be less stable, its policy tends to become close to deterministic earlier, and learning can have more variance (eg. Ant, even if it still performs better in average than PPO).
Fig. 3 compares the average of the past top ten policies for PPO, SAC and MoPPO. The comparison to PPO is as before. MoPPO performs as well as SAC in most environments, usually with much less samples. For example, in Walker it takes 5 times less sample to reach the same score as SAC. MoPPO is slower than SAC for Humanoid, but it reaches a better score (that SAC eventually reaches after 1.75 million interactions). It’s only for HalfCheetah that SAC obtains clearly better results, and faster.
In this paper, we proposed MoSoPI, a general framework that mixes the general idea of soft greediness, initiated by Kakade and Langford , with the partial evaluation approach of MPI . As a proof of concept, we introduced MoPPO, a modification of PPO that changes the way the advantage is estimated, and allows for off-policy learning. In our experiments, MoPPO consistently learns faster and provides better policies than PPO. This simple modification is also competitive with SAC, and often (but not always) performs better and/or faster.
MoPPO is a proof of concept, but the general framework of MoSoPI can be used to derive other algorithms. For example, if MoPPO learns much faster than PPO, we somehow pay this by less stability, as mentioned earlier. As such, it is efficient for learning a good policy from a small amount of samples (as we can keep the best policy computed so far), but would not be an ideal solution for continuous learning.
We envision to use different kinds of soft greediness to help stabilize learning. More specifically, we plan to take inspiration from Abdolmaleki et al.  or Haarnoja et al. , who basically stabilize learning by controlling better the entropy of the policy and/or how it evolves. Another approach could be to select in a smarter way what experience to keep in the replay buffer . We think that combining these ideas with the partial policy evaluation scheme proposed here could further improve sample efficiency.
- Mnih et al.  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 2015.
- Lillicrap et al.  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR), 2016.
- Wang et al.  Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample efficient actor-critic with experience replay. International Conference on Learning Representations (ICLR), 2017.
Mnih et al. 
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver,
and K. Kavukcuoglu.
Asynchronous methods for deep reinforcement learning.
International conference on machine learning (ICML), 2016.
- Schulman et al.  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International Conference on Machine Learning (ICML), 2015.
- Schulman et al.  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Wu et al.  Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba. Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. In Advances in neural information processing systems (NeurIPS), 2017.
- Haarnoja et al.  T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML), 2017.
- Haarnoja et al. [2018a] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning (ICML), 2018a.
- Haarnoja et al. [2018b] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905, 2018b.
- Gabillon et al.  V. Gabillon, A. Lazaric, M. Ghavamzadeh, and B. Scherrer. Classification-based policy iteration with a critic. In International Conference on Machine Learning (ICML), 2011.
- Kakade and Langford  S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002.
- Scherrer and Geist  B. Scherrer and M. Geist. Local policy search in a convex space and conservative policy iteration as boosted policy search. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases(ECML-KDD). Springer, 2014.
- Puterman and Shin  M. L. Puterman and M. C. Shin. Modified policy iteration algorithms for discounted markov decision problems. Management Science, 1978.
- Scherrer et al.  B. Scherrer, M. Ghavamzadeh, V. Gabillon, B. Lesner, and M. Geist. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research (JMLR), 2015.
- Todorov et al.  E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS). IEEE, 2012.
- Bradtke and Barto  S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine learning, 1996.
- Munos et al.  R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
- Fujimoto et al.  S. Fujimoto, H. van Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. International Conference on Machine Learning (ICML), 2018.
- Silver et al.  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning (ICML), 2014.
- Abdolmaleki et al.  A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. International Conference on Learning Representations (ICLR), 2018.
- Brockman et al.  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Abdolmaleki et al.  A. Abdolmaleki, J. T. Springenberg, J. Degrave, S. Bohez, Y. Tassa, D. Belov, N. Heess, and M. Riedmiller. Relative entropy regularized policy iteration. arXiv preprint arXiv:1812.02256, 2018.
- De Bruin et al.  T. De Bruin, J. Kober, K. Tuyls, and R. Babuška. Experience selection in deep reinforcement learning for control. The Journal of Machine Learning Research, 19(1):347–402, 2018.
- Touati et al.  A. Touati, P.-L. Bacon, D. Precup, and P. Vincent. Convergent tree-backup and retrace with function approximation. International Conference on Machine Learning (ICML), 2018.
Hessel et al. 
M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney,
D. Horgan, B. Piot, M. Azar, and D. Silver.
Rainbow: Combining improvements in deep reinforcement learning.
Association for the Advancement of Artificial Intelligence (AAAI), 2018.
Appendix A Theoretical Analysis
Here, we prove Thm. 1 and discuss the required assumptions.
Assumption 1 (Initialization).
The initial policy and value satisfy .
This assumption will allow to show monotonicity of values from the beginning. It is a mild assumption. For example, it is satisfied by taking . Otherwise, if an initial is not satisfying the assumption, subtracting a large enough constant allows satisfying Asm. 1. Let
be the vector whose components are all equal to 1. We define. We have that
The last equation provides the lower bound for such that and satisfy Asm. 1.
We also need an assumption regarding the conservative greedy step.
Assumption 2 (Improvement).
Since MoSoPI does not require the policy to be greedy respectively to , we assume that the improvement is strict whenever possible. That is, if it exists a policy such that , then .
This assumption is also mild. Notice that without it, we could show monotonic improvement, but not convergence towards the optimal policy. To see this, it is sufficient to consider the sequence of policies , that would be valid for MoSoPI without Asm. 2. We can now state our convergence result.
Theorem 1 (Convergence of MoSoPI).
First, we start by showing by induction that for any , we have that . For , we have by the greedy step that . By Asm. 1, we have . Putting this together, this shows that the induction is true for : . Now, assume that . By monotonicity of the Bellman operator, for any ,
Then, we have
|by the greedy step||(26)|
|by Eq. (25)||(28)|
The induction hypothesis is true at any iteration.
Next, we show that the series of values is increasing and bounded. First, we have
|by Eq. (25)||(31)|
By definition of the Bellman optimality operator and using the induction hypothesis, . By direct induction, this shows that . So we have that
The series being increasing and upper-bounded, it converges to satisfying . We still have to show that .
Asymptotically, is a fixed point, . The operator as the same unique fixed point as , so . Notice that if the fixed point of a Bellman operator is unique, more than one policy can have the same fixed point (so the policy in the previous equations might not be unique, but any member of a set of policies having the same value). Assume that (that is, is not optimal). In this case, there exists a policy (not in this set of policies) such that . This contradicts Asm. 2. So, , and the set of associated policies is optimal. ∎
Appendix B MoPPO pseudocode
Appendix C Approximate partial policy evaluation
As discussed in Sec. 3.3, to approximate the operator , one can either use -steps rollouts (corrected for off-policy learning) or apply repeatedly times the approximate Bellman operator (or, said otherwise, solve regression problems). We have adopted the latter approach for MoPPO, but we also experimented the former.
Rollouts off-policy correction is based on importance sampling, that can cause a huge variance (importance weights are ratio of probabilities, that can explose if probabilities are very different). To mitigate this effect, one can use the idea of Retrace , that consists in capping the importance weights at 1. We considered directly -step rollouts corrected with Retrace.
Both approaches ( regressions and -step) work well on simple problems such as the InvertedPendulum as seen in Fig. 4. Yet, one can observe that for the regression, increasing helps, while for the rollout approach increasing degrades the performance. When applied to a mid-size problem such as Hopper (Fig. 5), PPO combined with (off-policy) -step returns reaches much worse performance than PPO when .
Our assumption is that MoPPO is too aggressive regarding the degree of “off-policyness” for an -step return-based approach to work. On the contrary, performing successive regressions does not require any off-policy correction, and thus does not suffer from this variance problem. Also, it has been reported that Retrace might not be stable under function approximation . Moreover, we notice that -step returns are sometime used without correction in an off-policy context, for example for Rainbow . We hypothesize that it is because learning is slow enough (and is small enough too) so the transitions in the replay buffer are close to be on-policy.
These experiments suggest that doing regressions rather than -step returns is beneficial, and that choosing is also beneficial (with , both approaches are equivalent). As a consequence, we think that the general MoSoPI scheme could also be useful for other off-policy actor-critics (that roughly consider ).
Appendix D Hyper-parameters
Here, we provide the hyper-paremeters used in Sec. 6. For all environments, we used the state normalization provided by the OpenAI framework. The networks architectures are as follows.
For MoPPO and all experiments except Ant, the actor is a Gaussian policy with 2 hidden layers with activation for each layer output and the critic is feedforward neural network with (400,300) using activations for each layer output. For the Ant environment, a larger policy is used, with 2 hidden layers of size . The critic is the same. PPO shares the same policy architecture and uses the same architecture for state value function as MoPPO’s state-action value function. The only difference is the bigger input for MoPPO (state-action instead of state). All other hyper-parameters are provided in Tab. 1.
Besides the and that we discussed in Sec. 6, there is indeed a few other parameters to set (but less strongly linked to the change of the way policy is evaluated). In our experiments, we observed that hyper-parameter setting is important for getting best results from the algorithm, due to variety of Mujoco tasks. For example, while Hopper has 11-dimensional state, Humanoid has 376-dimensional state (and is usually considered as a difficult problem). MoPPO is an off-policy algorithm that uses a replay buffer. Best empirical results are achieved using a buffer size of 20k. Larger buffer sizes reduced the performance, maybe because larger buffers increase the degree of “off-policyness”, other things being fixed (it would contain data from more different policies). Using smaller buffers tends to make learnt policies more greedy on a smaller set of actions, and training focuses on state-action pairs that are sampled by similar policies. The clip ratio of PPO is set to 0.2, its typical value, while MoPPO uses a much smaller clipping ratio ( to ). We do so because more gradient descent steps are applied to the policy network (compared to PPO). The number of gradient descent steps applied to the -function, for each of the regressions, depends on the scale of the environment. In Hopper, we used 50-250 gradient steps, while large tasks required more gradient steps (reaching 1000 steps for Humanoid). We also not that if consistently provide good results, better results were obtained for on two tasks (Ant and Humanoid) within 1 million environment steps. On Humanoid task, can achieve around higher results but requires more environment steps (additional 120k environment step) and computations (10 times more).