Exploiting the sign of the advantage function to learn deterministic policies in continuous domains

06/10/2019 ∙ by Matthieu Zimmer, et al. ∙ Shanghai Jiao Tong University 0

In the context of learning deterministic policies in continuous domains, we revisit an approach, which was first proposed in Continuous Actor Critic Learning Automaton (CACLA) and later extended in Neural Fitted Actor Critic (NFAC). This approach is based on a policy update different from that of deterministic policy gradient (DPG). Previous work has observed its excellent performance empirically, but a theoretical justification is lacking. To fill this gap, we provide a theoretical explanation to motivate this unorthodox policy update by relating it to another update and making explicit the objective function of the latter. We furthermore discuss in depth the properties of these updates to get a deeper understanding of the overall approach. In addition, we extend it and propose a new trust region algorithm, Penalized NFAC (PeNFAC). Finally, we experimentally demonstrate in several classic control problems that it surpasses the state-of-the-art algorithms to learn deterministic policies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Model-free reinforcement learning combined with neural networks achieved several recent successes over a large range of domains

[Mnih et al.2015, Lillicrap et al.2016, Schulman et al.2017]

. Yet those methods are still difficult to apply without any expert knowledge, lack robustness and are very sensitive to hyperparameter optimization

[Henderson et al.2018, Colas et al.2018].

In this context, we focus in this paper on improving methods that learn deterministic policies. Such policies have three main advantages during the learning phase: 1) they usually require less interactive data because fewer parameters need to be learned, 2) their performances are less costly to estimate during testing phases because randomness only comes from the environment (as opposed to randomized policies), and 3) they are also less sensitive to the premature convergence problem, because they cannot directly control exploration. Moreover, deterministic policies are preferred in some domains (e.g., robotics), because we do not want the agent to act stochastically after the learning phase.

In continuous state and action space domains, solution methods require function approximation. Neural control architectures are excellent representations for policies because they can handle continuous domains, are easily scalable, and have a high degree of expressiveness. The weights of such neural networks are usually updated with a policy gradient method. As vanilla policy gradient suffers from high variance, it is generally implemented in an actor-critic architecture where an estimated value function helps to reduce the variance at the cost of introducing some bias

[Konda and Tsitsiklis1999]. In this architecture, the parameters (e.g., weights of neural networks) of the policy (i.e., actor) and its value function (i.e., critic) are updated simultaneously.

The basic version of an actor-critic architecture for learning deterministic policies in continuous domains is the deterministic policy gradient (DPG) method [Silver et al.2014]. Learning the value function is crucial but also difficult, which is why several extensions of DPG have been proposed. Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al.2016]

brings batch normalization

[Ioffe and Szegedy2015], target networks and replay buffer [Mnih et al.2015] to DPG and is one of the most used actor-critic methods for learning continuous deterministic policies. However, it has several limitations: 1) the critic learns the state-action value function (Q function), which is difficult to estimate, 2) it relies on the fact that non-biased estimates of the gradient of the Q function are accessible, which is not the case in the model-free setting, 3) it does not use compatible functions: the policy gradient might be poorly estimated.

In this work, we focus on an alternative method that estimates the state value function (V function) instead of the Q function to learn continuous deterministic policies. VanHasselt2007 VanHasselt2007 were the first to propose to reinforce the policy toward an action with a positive temporal difference. They experimentally showed that using such a method, in an incremental actor-critic algorithm, called Continuous Actor Critic Learning Automaton (CACLA), provided better results than both the stochastic and the deterministic policy gradients111In their paper the deterministic policy gradient algorithm was called ADHDP [Prokhorov and Wunsch1997]. in the Mountain Car and the Acrobot environments. zimmer2016 zimmer2016,zimmer2018developmental validated those results in higher-dimensional environments, Half-Cheetah and Humanoid in Open Dynamic Engine [Smith2005], and proposed several extensions with the Neural Fitted Actor Critic (NFAC) algorithm. However, no theoretical explanation for their good performance, nor a clear discussion about which objective function those methods optimize were given. Providing such an explanation would help understand better why those algorithms work well, what are their properties and limitations, and how to further improve them.

We first show that CACLA and NFAC can be viewed as policy gradient methods and that they are closely related to a specific form of the stochastic policy gradient (SPG) [Sutton et al.1999]. Then we discuss some of their properties and limitations. Moreover, we extend them with trust region updates and call the new algorithm Penalized Neural Fitted Actor Critic (PeNFAC). Finally, we experimentally show that PeNFAC performs well on three high-dimensional continuous environments compared to the state-of-the-art methods.

2 Background

A continuous Markov Decision Process (MDP)

[Sutton1988] is a tuple where is a continuous state space, is a continuous action space with dimensions, is a transition function, is a reward function, is a distribution over initial states. In the model-free setting, it is assumed that the transition function and the reward function are unknown and can only be sampled at specific states according to the interaction between the agent and the environment.

The following notations are used: represents a deterministic policy and a stochastic one. Thus, for a given state , is an action,

is the probability of sampling action

from the policy , and is a distribution over the action space . For a policy , we denote the discounted state distribution by:

where is a discount factor and is the probability of being in state after applying policy timesteps from state . Its state value function is defined by where is the expectation induced by and , and for all , and

are random variables. Its action value function is given by

and its advantage function by .

In reinforcement learning, the goal is to find a policy that optimizes the expectation of the discounted rewards:

Due to the continuity of the state/action spaces, this optimization problem is usually restricted to a class of parametrized policies, which we denote (stochastic case) or (deterministic case). To simplify notations, we may write or instead of or . The stochastic policy gradient (SPG) in the continuous case can be written as [Sutton et al.1999]:

(1)

The DPG is defined as [Silver et al.2014]:

(2)
where
(3)

Policy gradient methods usually take a step according to those directions: . However, it is difficult to select a proper learning rate to control the step size. If is too big, the method may diverge. If it is too low, the learning will converge slowly (thus requiring more samples). To overcome this difficulty, a trust region method can be used to control the step size [Schulman et al.2015]. Indeed, one can guarantee monotonic gradient updates by exploiting an approximation of the policy advantage function [Kakade and Langford2002] of with respect to , which measures the difference of performance between the two policies:

(4)

The latter approximation holds when the two policies are close, which can be enforced by a KL divergence constraint in trust region policy optimization [Schulman et al.2015].

3 Algorithms

In this section, we recall three related algorithms (CACLA, CAC, NFAC) that we discuss later.

3.1 Continuous Actor Critic Learning Automaton

Continuous Actor Critic Learning Automaton (CACLA) [Van Hasselt and Wiering2007] is an actor-critic method that learns a stochastic policy and its estimated value function . We assume in this paper that CACLA uses isotropic Gaussian exploration, which implies that can be written as follows:

(5)

where

is the identity matrix and

possibly annealed during learning. CACLA alternates between two phases:

1) a hill climbing step in the action space using a random optimization (RO) algorithm [Matyas1965],

2) a gradient-like update in the policy parameter space. RO consists in repeating the following two steps:

i) sample a new action , which is executed in the environment in current state

, by adding a normally distributed noise to the current action

,

ii) if then else does not change.

Phase 2) is based on following update:

(6)

where is the temporal difference (TD) error. As the expectation of the TD error is equal to the advantage function, this update can be interpreted as follows: if an exploratory action has a positive advantage then policy should be updated towards .

Note that although CACLA executes a stochastic policy , it can be seen as learning a deterministic policy . VanHasselt2007 VanHasselt2007 state that when learning in continuous action space, moving away from a bad action could be meaningless. Indeed, while for stochastic policies, the probability of a bad action can be decreased, for deterministic policies, moving in the action space in the opposite direction of an action with a negative advantage may not necessarily lead to better actions. Thus, CACLA’s update is particularly appropriate for learning continuous deterministic policies.

3.2 Continuous Actor Critic

In our discussion, we also refer to a slightly different version of CACLA, Continuous Actor Critic (CAC) [Van Hasselt and Wiering2007]. The only difference between CAC and CACLA is that the update in CAC is scaled by the TD error:

(7)

Thus an action with a larger positive advantage (here, estimated by the TD error) will have a bigger impact over the global objective.

3.3 Neural Fitted Actor Critic

The Neural Fitted Actor Critic (NFAC) [Zimmer et al.2016, Zimmer et al.2018] algorithm is an efficient instantiation of the CACLA update, which integrates the following techniques: batch normalization, -returns for both the critic and the actor, and batch learning with Adam[Kingma and Ba2015]. In this algorithm, the update of the parameters is not done anymore at each time step, but at the end of a given number of episodes.

4 Discussions

In this section, we discuss the algorithms to provide some theoretical explanation for their good performance.

4.1 Cacla

We first explain the relationship between an algorithm based on stochastic policy gradient (SPG) and CACLA. For this discussion, we assume that SPG is applied to parametrized policies that are Gaussian policies (i.e., Gaussian around ). Then the first common feature between the two algorithms is that the distributions over states they induce during learning are the same (i.e., ) because they both use the same exploratory policy to interact with the environment. Moreover, SPG can be written as follows:

For CACLA, we interpret update (6) as a stochastic update in the following direction:

(8)
with

where is the Heaviside function. Indeed, the inner integral is estimated using a single Monte Carlo sample during the run of CACLA.

Under this form, it is easy to see the similarity between SPG and CACLA. The constant factor can be neglected because it may be integrated into the learning rate. The sign difference of the term is because SPG performs gradient ascent and CACLA gradient descent. So the main difference between SPG and CACLA is the replacement of by . Therefore CACLA optimizes its exploratory stochastic policy through an approximation of SPG hoping to improve the underlying deterministic policy (for a fixed state, the direction of CACLA and SPG are the same up to a scalar).

Moreover, relating CACLA’s update with (8) also brings to light two main limitations. The first one concerns the inner integral over the action space which has a high variance. Therefore, we expect CACLA to be less and less data efficient in high-dimension action space (which is the main theoretical justification of DPG over SPG - see Appendix B.1). The second limitation that appears is that over one update, CACLA does not share the same exact optimal solutions as DPG or SPG. Indeed, if we define such as it is not possible to prove that (8) will also be 0 (because of the integral over the state space). It means that CACLA could decrease the performance of this local optimal solution.

4.2 Cac

Similarly, the update in CAC can be seen as a stochastic update in the following direction:

with

This shows that CAC is even closer to SPG than CACLA and provides a good theoretical justification of this update at a local level (not moving in potentially worst action). However, there is also a justification at a more global level.

Lemma 4.1.

For a fixed state, when the exploration tends to zero, CAC maintains the sign of the DPG update with a scaled magnitude:

(9)

where is a positive function between with as the number of parameters of the deterministic policy and is the Hadamard product (element-wise product).

The proof is provided in Appendix A.1. The consequence of this lemma is that, for a given state and low exploration, a local optimal solution for DPG will also be one for CAC. However it is still not the case for the overall update because of the integral over the different states. The weights given to each direction over different states are not the same in CAC and DPG. One might think that in such a case, it would be better to use DPG. However, in practice, the CAC update may in fact be more accurate when using an approximate advantage function. Indeed, there exist cases where DPG with an approximate critic might update towards a direction which could decrease the performance. For instance, when the estimated advantage is negative, the advantage around is therefore known to be poorly estimated. In such a case, thanks to the Heaviside function, CAC will not perform any update for actions in the neighborhood of such that . However, in such a case, DPG will still perform an update according to this poorly estimated gradient.

5 Extension to Trust Region

In this section, we extend the approach to use a trust region method.

5.1 Trust Region for Deterministic Policies

We now introduce a trust region method dedicated to continuous deterministic policies. Given current deterministic policy , and an exploratory policy defined from , the question is to find a new deterministic policy that improves upon . Because a deterministic policy is usually never played in the environment outside of testing phases, a direct measure between two deterministic policies (i.e., a deterministic equivalent of Equation 4) is not directly exploitable. Instead we introduce the following measure:

Lemma 5.1.

The performance of a deterministic policy can be expressed by the advantage function of another stochastic policy built upon a deterministic policy as:

(10)

See Appendix A.2 for the proof. The first two quantities in the RHS of (5.1) are independent of . The second one represents the difference of performance from moving from the deterministic policy to its stochastic version . Because would be too costly to estimate, we approximate it with the simpler quantity , as done by Schulman2015 Schulman2015 for TRPO, a predecessor to PPO.

Theorem 5.2.

Given two deterministic policies and , a stochastic Gaussian policy with mean in state and independent variance , if the transition function is L-Lipschitz continuous with respect to the action from any state then:

where .

The proof is available in Appendix A.3. Thus, to ensure a stable improvement at each update, we need to keep both and small. Note that the Lipschitz continuity condition is natural in continuous action spaces. It simply states that for a given state, actions that are close will produce similar transitions.

5.2 Practical Algorithm

To obtain a concrete and efficient algorithm, the trust region method can be combined with the previous algorithms. Its integration to NFAC with a CAC update for the actor is called Penalized Neural Fitted Actor Critic (PeNFAC).

VanHasselt2007 VanHasselt2007 observed that the CAC update performs worse that the CACLA update in their algorithms. In their setting where the policy and the critic are updated at each timestep, we believe this observation is explained by the use of the TD error (computed from a single sample) to estimate the advantage function. However, when using variance reduction techniques such as -returns and learning from a batch of interactions, or when mitigating the update with a trust region constraint, we observe that this estimation becomes better (see Figure 4). This explains why we choose a CAC update in PeNFAC.

In order to ensure that stays small over the whole state space, we approximate it with a Euclidean norm over the state visited by . To implement this constraint, we add a regularization term to the update and automatically adapts its coefficient, for a trajectory :

where is a regularization coefficient. Similarly to the adaptive version of Proximal Policy Optimization (PPO) [Schulman et al.2017], is updated in the following way (starting from ):

  • if : ,

  • if : ,

where with being the number of gathered states. Those hyper-parameters are usually not optimized because the learning is not too sensitive to them. The essential value to adapt for the designer is . Note that the introduction of this hyperparameter mitigates the need to optimize the learning rate for the update of the policy, which is generally a much harder task.

6 Experiments

We performed two sets of experiments to answer the following questions:

1) How does PeNFAC compare with state-of-the-art algorithms for learning deterministic policies?

2) Which components of PeNFAC contribute the most to its performance?

The experiments were performed on environments with continuous state and action spaces in a time-discretized simulation. We chose to perform the experiments on OpenAI Roboschool [Schulman et al.2017], a free open-source software, which allows anyone to easily reproduce our experiments. In order to evaluate the performance of an algorithm, deterministic policies obtained during learning are evaluated at a constant interval during testing phases: policy is played in the environment without exploration. The interactions gathered during this evaluation are not available to any algorithms. The source code of the PeNFAC algorithm is available at github.com/matthieu637/ddrl. The hyperparameters used are reported in Appendix E as well as the considered range during the grid search.

6.1 Performance of PeNFAC

We compared the performance of PeNFAC to learn continuous deterministic policies with two state-of-the-art algorithms: PPO and DDPG. A comparison with NFAC is available in the ablation study (Section 6.2) and in Appendix C. Because PPO learns a stochastic policy, for the testing phases, we built a deterministic policy as follows . We denote this algorithm as ”deterministic PPO”. In Appendix D, we experimentally show that this does not penalize the comparison with PPO, as deterministic PPO provides better results than standard PPO. For PPO, we used the OpenAI Baseline implementation. To implement PeNFAC and compare it with NFAC, we use the DDRL library [Zimmer et al.2018]. Given that DDPG is present in those two libraries, we provided the two performances for it. The OpenAI Baseline version uses an exploration in the parameter space and the DDRL version uses n-step returns.

Figure 1: Comparison of PeNFAC, DDPG and deterministic PPO over 60 different seeds for each algorithm in Hopper.
Figure 2: Comparison of PeNFAC, DDPG and deterministic PPO over 60 different seeds for each algorithm in HalfCheetah.
Figure 3: Comparison of PeNFAC, DDPG and deterministic PPO over 60 seeds for each algorithm in Humanoid.

We performed learning experiments over three high-dimensional domains: Hopper, HalfCheetah and Humanoid. Dimensions of are (Hopper), (HalfCheetah) and (Humanoid).

The neural network architecture is composed of two hidden layers of 64 units for either the policy or the value function. The choice of the activation function in the hidden units was optimized for each algorithm: we found that ReLU was better for all of them except for PPO (where tanh was better). The output activation of the critic is linear and the output activation of the actor is tanh.

In Figures 1-4

, the lighter shade depicts one standard deviation around the average, while the darker shade is the standard deviation divided by the square root of the number of seeds.

In Figures 1-3, PeNFAC outperforms DDPG and deterministic PPO during the testing phase. On Humanoid, even after optimizing the hyperparameters, we could not obtain the same results as those of PPO PPO. We conjecture that this may be explained as follows: 1) the RoboschoolHumanoid moved from version 0 to 1, 2) deterministic PPO

Figure 4: Comparison of the different components (-returns, fitted value-iteration, CAC vs CACLA update, batch normalization) of the PeNFAC algorithm during the testing phase over the HalfCheetah environment and 60 seeds for each version.

might be less efficient than PPO, 3) neither LinearAnneal for the exploration, nor adaptive Adam step size is present in the OpenAI Baseline implementation. However, we argue that the comparison should still be fair since PeNFAC also does not use those two components. On Humanoid, we did not find a set of hyperparameters where DDPG could work correctly with both implementations.

6.2 Components of PeNFAC

In Figure 4, we present an ablation analysis in the HalfCheetah domain to understand which components of the PenFAC algorithm are the most essential to its good performance. From top to bottom plots of Figure 4, we ran PenFAC with or without trust region, with or without -returns, with or without fitted value iteration, with CACLA update or CAC update, and finally with or without batch normalization.

It appears that -returns and fitted value iteration are the most needed, while the effect of batch normalization is small and mostly helps in the beginning of the learning.

We also tried updating the actor every timestep without taking into account the sign of the advantage function (i.e., using SPG instead of CAC), but the algorithm was not able to learn at all. This also demonstrates that the CAC update is an essential component of PenFAC.

7 Conclusion

In the context of learning deterministic policies, we studied the properties of two not very well-known but efficient updates, Continuous Actor Critic Learning Automaton (CACLA) and Continuous Actor Critic (CAC). We first showed how closely they both are related to the stochastic policy gradient (SPG). We explained why they are well designed to learn continuous deterministic policies when the value function is only approximated. We also highlighted the limitations of those methods: a potential poor sample efficiency when the dimension of the action space increases and no guarantee that the underlying deterministic policy will converge toward a local optimum of even with a linear approximation.

In the second part, we extended Neural Fitted Actor Critic (NFAC), itself an extension of CACLA, with a trust region constraint designed for deterministic policies and proposed a new algorithm, Penalized NFAC (PeNFAC). Finally, we tried our implementation on various high-dimensional continuous environments and showed that PeNFAC performs better than DDPG and PPO to learn continuous deterministic policies.

As future work, we plan to consider off-policy learning and the combination of the updates of CAC and DPG together to ensure the convergence toward a local optimum while benefiting from the good updates of CAC.

Acknowledgments

This work has been supported in part by the program of National Natural Science Foundation of China (No. 61872238). Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).

References

  • [Colas et al.2018] Cédric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms. In

    International Conference on Machine Learning (ICML)

    , Stockholm, Sweden, July 2018.
  • [Henderson et al.2018] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [Ioffe and Szegedy2015] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167, 2015.
  • [Kakade and Langford2002] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
  • [Kingma and Ba2015] Diederik P. Kingma and Jimmy L. Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, pages 1–13, 2015.
  • [Konda and Tsitsiklis1999] Vijay R. Konda and John N. Tsitsiklis. Actor-Critic Algorithms. Neural Information Processing Systems, 13:1008–1014, 1999.
  • [Lillicrap et al.2016] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ICLR, 2016.
  • [Matyas1965] J Matyas. Random optimization. Automation and Remote control, 26(2):246–253, 1965.
  • [Mnih et al.2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [Prokhorov and Wunsch1997] Danil V. Prokhorov and Donald C. Wunsch. Adaptive critic designs. IEEE Transactions on Neural Networks, 8(5):997–1007, 1997.
  • [Schulman et al.2015] John Schulman, Sergey Levine, Michael Jordan, and Pieter Abbeel. Trust Region Policy Optimization. International Conference on Machine Learning, page 16, 2015.
  • [Schulman et al.2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
  • [Silver et al.2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning, pages 387–395, 2014.
  • [Smith2005] Russell Smith. Open dynamics engine. 2005.
  • [Sutton et al.1999] Richard S. Sutton, David Mcallester, Satinder Singh, and Yishay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063, 1999.
  • [Sutton1988] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • [Van Hasselt and Wiering2007] Hado Van Hasselt and Marco A. Wiering. Reinforcement learning in continuous action spaces. In Proceedings of the IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, pages 272–279, 2007.
  • [Zimmer et al.2016] Matthieu Zimmer, Yann Boniface, and Alain Dutech. Neural Fitted Actor-Critic. In European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2016.
  • [Zimmer et al.2018] Matthieu Zimmer, Yann Boniface, and Alain Dutech. Developmental reinforcement learning through sensorimotor space enlargement. In The 8th Joint IEEE International Conference on Development and Learning and on Epigenetic Robotics, September 2018.

Appendix A Proofs

a.1 Relation between DPG and CAC update for a given state

For simplification, the proof of a single dimension of the parameter space is provided. To denote the th

dimension of a vector

, we write . If is a matrix, represents the th column vector. We will use the following result from Silver2014 Silver2014:

Thus, the following standard regularity conditions are required: are continuous in all variables and bounded. From this result, we derive the following equation for a fixed state :

We first study the special case of and want to show that is also zero:

Now, we study the more general case :

a.2 Performance of a deterministic policy expressed from a Gaussian stochastic policy

The proof is very similar to [Kakade and Langford2002, Schulman et al.2015] and easily extends to mixtures of stochastic and deterministic policies:

a.3 Trust region for continuous deterministic policies

For this theorem we also use the following standard regularity conditions: and . denotes the number of dimension of the action space. We start from the two terms we want to bound:

(11)

where .

So, we need to bound the difference between and for a given state :

(12)

Finally, we have to bound the difference between and . To do so, we define , and all the possible path from the state to the state .

(13)
(14)
(15)

To obtain (13), we use the assumption that the transition function is L-Lipschitz continuous with respect to the action and the L2 norm. To obtain (14), we use (5). Equation 15 does no longer depend on and , thus added to (12) and (11) it gives:

(16)

To obtain (16), we suppose that is smaller than 1. We can make this assumption without losing in generality: it would only affect the magnitude of the Lipschitz constant. Thus if stays smaller than , the optimal will be , and (16) could be reduced to:

Appendix B Additional experiments on CACLA’s update

In those two experiments, we want to highlight the good performance of CACLA compared to SPG and DPG without neural networks. The main argument to use DPG instead of SPG is its efficiency when the action dimensions become large. In the first experiment, we study if CACLA suffers from the same variance problem as SPG. The second experiment supports our claim that CACLA is more robust than SPG and DPG when the approximation made by the critic is less accurate.

b.1 Sensitivity to action space dimensionality

We used a setup similar to that of Silver2014 Silver2014: those environments contain only one state and the horizon is fixed to one. They are designed such that the dimensionality of the action space can easily be controlled but there is only little bias in the critic approximation. The policy parameters are directly representing the action: .

Compatible features are used to learn the Q value function for both SPG and DPG. For CACLA, the value function V is approximated through a single parameter. The Gaussian exploration noise and the learning rate of both the critic and actor have been optimized for each algorithm on each environment. In Figure 5, similarly to Silver2014 Silver2014, we observe that SPG is indeed more sensitive to larger action dimensions. CACLA is also sensitive to this increase in dimensionality but not as much as SPG. Finally, we also note that even if the solution of CACLA and DPG are not exactly the same theoretically, they are very similar in practice.

Figure 5: Comparison of DPG, SPG and CACLA over three domains with 100 seeds for each algorithm. On the left, the action dimensions is 5 and 50 on the right.

b.2 Robustness to the critic approximation errors

Compared to the previous experience, we introduce a bigger bias in the approximation of the critic by changing the application domains: the horizon is deeper and there is an infinite number of states. The policy is represented as where are tiles coding features.

Figure 6: Comparison of CACLA, DPG and SPG over two environments of OpenAI Gym and one environment of Roboschool (60 seeds are used for each algorithm).

In Figure 6, we observe that as soon as value functions become harder to learn, CACLA performs better than both SPG and DPG.

Appendix C Broader comparison between PeNFAC and NFAC

To avoid overloading previous curves, we did not report the performance of NFAC (except in the ablation study on the HalfCheetah environment). In Figure 7, we extend this study to two other domains of Roboschool: Hopper and Humanoid.

Figure 7: Comparison of PeNFAC and NFAC over RoboschoolHopper and RoboschoolHumanoid with 60 seeds for each algorithm.

We observe that PeNFAC is significantly better than NFAC which demonstrates the efficiency of the trust region update combined with CAC.

Appendix D Impact of evaluating PPO with a deterministic policy

Figure 8: Comparison of evaluating PPO with a deterministic policy instead of the stochastic policy produced by PPO.

In Figure 8, we observe that using a deterministic policy to evaluate the performance of PPO is not penalizing. This is the only experiment of the paper where deterministic policies and stochastic policies are compared.

Appendix E Hyperparameters

For the sake of reproducibility [Henderson et al.2018], the hyperparameters used during the grid search are reported here. In Tables 2-5, ”ho”, ”ha” and ”hu” stand respectively for Hopper, HalfCheetah, and Humanoid Roboschool environments.

Actor network
Critic network
Actor output activation TanH
Table 1: Set of hyperparameters used during the training with every algorithm.
Network hidden activation , TanH
Actor learning rate , ,
Critic learning rate , ,
Batch norm first layer of the actor
, ,
ADAM ,
Number of ADAM iteration (actor) 10, ,
Number of ADAM iteration (critic) 1
, , , , ,
(Truncated Gaussian law) , , , ,
Number fitted iteration , , ,
Update each episodes , , , , , , , , ,
Table 2: Set of hyperparameters used during the training with PeNFAC.
Network hidden activation , ReLu, Leaky ReLU (0.01)
Layer norm no
ADAM
Entropy coefficient 0
Clip range 0.2
Learning rate
nminibatches ,
noptepochs , , ,
nsteps , , , , ,
sample used to make the policy more deterministic 15
Table 3: Set of hyperparameters used during the training with PPO.
Network hidden activation
Actor learning rate
Critic learning rate
Batch norm first layer of the actor
ADAM ,
L2 regularization of the critic ,
Exploration Gaussian (),
Mini batch size , ,
Reward scale , ,
Soft update of target networks ,
Replay memory
N-step returns ,
Table 4: Set of hyperparameters used during the training with DDPG (DDRL implementation).
Network hidden activation , TanH
Actor learning rate
Critic learning rate
Layer norm no
ADAM
L2 regularization of the critic
Exploration Ornstein Uhlenbeck (),
Mini batch size
Reward scale ,
Soft update of target networks ,
Replay memory
nb_rollout_steps ,
nb_train_steps ,,
Table 5: Set of hyperparameters used during the training with DDPG (OpenAI baselines implementation).