FiDi-RL: Incorporating Deep Reinforcement Learning with Finite-Difference Policy Search for Efficient Learning of Continuous Control

07/01/2019 ∙ by Longxiang Shi, et al. ∙ Zhejiang University 5

In recent years significant progress has been made in dealing with challenging problems using reinforcement learning.Despite its great success, reinforcement learning still faces challenge in continuous control tasks. Conventional methods always compute the derivatives of the optimal goal with a costly computation resources, and are inefficient, unstable and lack of robust-ness when dealing with such tasks. Alternatively, derivative-based methods treat the optimization process as a blackbox and show robustness and stability in learning continuous control tasks, but not data efficient in learning. The combination of both methods so as to get the best of the both has raised attention. However, most of the existing combination works adopt complex neural networks (NNs) as the policy for control. The double-edged sword of deep NNs can yield better performance, but also makes it difficult for parameter tuning and computation. To this end, in this paper we presents a novel method called FiDi-RL, which incorporates deep RL with Finite-Difference (FiDi) policy search.FiDi-RL combines Deep Deterministic Policy Gradients (DDPG)with Augment Random Search (ARS) and aims at improving the data efficiency of ARS. The empirical results show that FiDi-RL can improves the performance and stability of ARS, and provide competitive results against some existing deep reinforcement learning methods



There are no comments yet.


page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have witnessed the thriving development in reinforcement learning RL. Significant progress has been made in areas like: playing the Game Go [1, 2], Atari [3], robotics [4], and real-time strategy games [5, 6]

. Despite its great success, reinforcement learning still faces challenge in physical control tasks, which have real-valued continuous action spaces. Conventional reinforcement learning methods for continuous policy search seek to approximate the optimal policy by estimating the gradient via expected return obtained from sampled trajectories. Famous methods such as vanilla policy gradient 

[7], (deep) deterministic policy gradient (DPG/DDPG) [8, 4], Asynchronous Advantage Actor-Critic (A3C) [9], Trust Region Policy Optimization [10] and Proximal Policy Optimization (PPO) [11] are widely adopted in practical use. Those methods have shown promising results in continuous control tasks such as Mujoco locomotion tasks [12]. Unfortunately, existing gradient methods are not robust to the environment and are fragile to the hyper-parameters or even random seeds [13, 14]. In addition, the estimation of gradient are always computation costly and are not easy to parallelize [15], which narrowed its application.

In contrast, an alternative approach for solving continuous control tasks without computing the gradients is based on random search theory. Those methods, which are also known as ”blackbox optimization”, treat the optimization process as a black box and only use the return of each simulation to search the optimal policy directly, i.e, the direct policy search [16]. Gradient-free methods such as Cross-Entropy Method (CEM) [17], Evolutional Strategies (ES) [18, 15] and Finite Difference Method (FDM) [14, 19] offer low computation cost, considerable result, fast training speed while also being easy to understand [10]. Although gradient-free methods show robustness to the environment and are easy to conduct [14, 15], they also suffer from low sample efficiency. Unlike derivative-based methods that can learn from each elementary step back and forth from the simulated path [9], in gradient-free methods numerous generated trajectories are discarded after evaluate the estimated return.

Recently, the combination of gradient-free methods and gradient-based methods so as to get the best of the both methods has raised attention. For example, GEP-PG [20] adopts goal exploration process to fill the replay buffer and then applies DDPG to learn the policy. Their algorithm is more sample efficient than DDPG. Evolutional reinforcement learning (ERL) [21] presents an efficient combination of ES and DDPG. The DDPG policy is periodically inserted into the population of ES and the performance outperforms the both methods. CEM-RL [22] combines Covariance Adaption Method (CMA) with deep RL methods and the results show that the learning can be faster and better. Maheswaranathan et al. [23]

theoretically analyzes the optimization problems with a surrogate gradient and incorporate the ES with such surrogate gradient. However, most of the existing combination works adopt complex neural networks (NNs) as the policy for control. The double-edged sword of deep NNs can yield better performance, but also introduce numerous of hyper-parameters defining the network structure, which makes it difficult for parameter tuning. In addition, The backpropagation of such networks are always computation costly when performing gradient-based learning. Nevertheless, in

[24, 14], they have shown that simple linear representation of policy is enough for control those tasks.

To address this issue, in this paper we propose FiDi-RL, which incorporates deep RL insights with Finite-Difference (FiDi) policy search based on linear representation of policies. FiDi-RL combines DDPG with ARS and aims at improving the data efficiency of ARS. We evaluate FiDi-RL in the well-known Mujoco locomotion tasks [12]. Our empirical results show that FiDi-RL can accelerate the learning comparing to the original ARS and DDPG, and can improve performance and stability of ARS as well. In addition, FiDi-RL also shows competitive result against popular policy search methods such as CEM, PPO, TRPO.

The rest of this paper is organized as follows: Section II discusses the related work. Section III presents the preliminary of reinforcement learning for continuous control and problem definition of this paper. Section IV introduce the proposed framework. Section V provides experimental analysis, followed by conclusion and discussion in Section VII.

Ii Related Works

Here, we summarize some related works. We divide the related works into 3 categories:

Ii-a Gradient-based Methods for Continuous Control

Various existing works on derivative-based reinforcement learning were proposed for solving the continuous control tasks. In this part we mainly focus on the Policy Gradient methods. Since conventional policy gradient methods such as REINFORCE [7] are too fragile to scale to difficult problems [25] [26], many new policy gradient methods are proposed with the development of deep learning. For example, [27] and [28] use an actor-critic framework to train stochastic policies with experience replay. Inspired by off-policy actor-critic [29], Deterministic Policy Gradient (DPG) [8] and its derivatives play an important role for RL methods in continuous control. However, the original DPG only evaluated on several simple toy environments and did not evaluate on complex environments with high dimensional observation space. The most popular version, Deep Deterministic Policy Gradient algorithm (DDPG) [4] integrates DPG with the insights of DQN [3], has been successful adopted into many robotic control tasks. Despite its success, DDPG is suffered from low stability and lack of sample efficiency. To address this issue, Q-prop [30] was proposed using a Taylor expansion of the off-policy critic as a control variate, which improves the stability and sample efficiency of DDPG. TD3 [31] was designed to overcome the overestimation problem in DDPG, and improves the performance as well. In addition, Soft actor-critic [32] was proposed based on the maximum entropy reinforcement learning framework, which further improves the stability of DDPG.

Another approach is Trust Region Policy Optimization (TRPO) [10], which directly builds stochastic policies without the actor-critic framework. TRPO produces near monotonic improvements in return by making carefully updates to policy. However, as no action-value function is learned, TRPO appears to be less data efficient [4]. Based on TRPO, ACKTR [33] uses Kronecker-factored approximation and propose an actor-critic based TRPO, which improves the performance and data efficiency in TRPR. Furthermore, Proximal Policy Optimization (PPO) [11] improves TRPO by alternately sampling and optimizing the policy, improves the sample efficiency of TRPO.

To conclude, most of the current gradient-based reinforcement learning methods for continuous control face problems such as instability and computation costly. To overcome those issues the state-of-the-art methods become more and more complicated.

Ii-B Gradient-free Methods for Continuous Control

As an alternative method to gradient-based RL, most of the gradient-free methods for continuous control can be categorized into 3 classes:

  1. Evolutional strategies (ES): As a popular blackbox optimization approach, ES adopts genetic algorithm to optimize the policy. The simplest ES form, optimize the policy directly using ES 

    [34][35]. Covariance Adaption Method (CMA), is another famous ES method which combine ES with ideas of derandomization and cumulation [36][37]. [38] gives an review of ES for learning various tasks. Recent works with ES [15] shows that ES can learn continuous tasks with high scalability and efficiency.

  2. Cross-entropy methods (CEM): Similar to CMA, CEM evolves the policy by generating the parameters with high reward [39][40]. For example, [41] proposes a CEM method with adaptive basis function. [42] uses CEM to learn policies in decentralized partial observable Markovian Decision Process (POMDP) environments. CEM method can learn very fast, but also easy trapped in local optima.

  3. Finite difference methods (FDM): FDM uses finite difference to calculate the approximation of the gradients and therefore optimize the policy by gradient ascent/descent. For example, Pegasus is proposed based on FDM for solving POMDP problems.  [19]. Recently, [14] propose ARS, an efficient method for learning continuous control tasks with low computation cost and high efficiency.

In summary, although gradient-free method can learn the policy fast with low computation cost, the generated trajectories are only used once to obtain the rewards for evaluation.

Ii-C Combining Gradient-based Methods with Gradient-free Methods

Several works have explored the combination of gradient-based methods and gradient-free methods. For example, Goal Exploration Process-Policy Gradient (GEP-PG) [20]

adopts goal exploration process to fill the replay buffer and then uses DDPG to learn the policy. The GEP is very close to evolutional methods. Their experiments show that GEP-PG is more sample-efficient and have a low variance comparing to DDPG. However, their combination does not improve the efficiency of gradient update of DDPG. Evolutional Reinforcement Learning (ERL) 

[21] introduce a hybrid algorithm that periodically insert the DDPG agent to the evolutional optimization process, and improves the stability and efficiency in learning and exploration. However, the ERL does not benefit from the search efficiency of ES. Their experiment result also shows high variance in some tasks. Similar to our work, in [23], the authors analyze the optimization problems with a surrogate gradient, and show that incorporating ES to surrogate gradient can improve the performance and efficiency of traditional RL methods. However, their experiments only performed on simple tasks and lack of practical demonstrations. In addition, the CEM-RL [22] is also very close to our work. In this work, the CMA and gradient-based method DDPG/TD3 is combined together. Their result shows that learning can be accelerated with CEM-RL. However, in CEM-RL neural networks were used for policy representation. Their experiment results show that the structure of policy networks are sensitive to the performance.

Iii Preliminaries and Problem Settings

In this section, we introduce some basic concepts, notions and our target problems.

Iii-a Preliminaries

Reinforcement learning problems can be formulated as a Markovian Decision Process (MDP): , where is the state space, is the action space, is the discount factor and is the transition function that maps each state-action pair to some distribution over . We consider the standard reinforcement learning setup: an agent interacting with the environment in discrete time steps, at each time step , the agent observes a state , takes an action under some policy , and receives a scalar reward . A policy

describes the agent’s behavior, which is a probability distribution that maps a state to an action:


The return from a state is defined as the total discounted future reward: , where is the terminal state. The state-action value is a mapping on to , which describes the expected discounted future reward when taking action at observation following policy :

The goal of reinforcement learning is to find an optimal policy which maximize the expected return from the starting state:

Here is the state distribution under policy .

The Bellman equation describes the recursive relationship in state-action value:

One popular RL method for dealing with continuous task is Deep Deterministic Policy Gradient (DDPG) [4], which adopts actor-critic in policy gradient. The actor of DDPG optimize the policy directly using the deterministic policy gradient theorem [8]. Denoting and as the parameterized policy and the action-value function

respectively, the gradient of the loss function

can be calculated as:


The loss of the critic function can be calculated:


During learning, gradient ascent is performed on the actor function to maximize , while gradient descent is performed on the critic function to minimize .

Recently, a new method named Augment Random Search (ARS) [14] was proposed, aiming at solving the continuous control task with high efficiency and low computation cost. As a gradient-free algorithm, ARS optimize the policy directly by calculating finite difference approximation along the random direction. Denoting as the unknown dynamics of the environment,

as the generated random variables,

as the total discount reward received by evaluating policy and as the exploration range, the loss function can be expressed with:


The gradient can be approximate with:


Based on the normalization to the observations and gradients, ARS contains 4 versions: ARS-,ARS-,ARS- and ARS-. In ARS-, the gradients from Equation 4

are normalized by dividing their standard deviation. In ARS-

, state normalization is added to ARS-. In their version with , only policies with top rewards are selected to compute the gradients.

Iii-B Problem Settings

In this paper we consider the standard model-free reinforcement learning setting in control, which an agent interacts with the environment. We specially focus on the continuous control tasks, where the action spaces are real-valued: . The policy is parameterized by some function: . We are aiming at finding the optimal policies that can maximize the average return under some environment variables :

To address the above issues in deep neural networks, our goal is to find an optimal policy with linear representation with gradient-based and gradient-free methods.

Iv FiDi-RL: Combining Deep Deterministic Policy Gradient with Augment Random Search

In this section, we will introduce FiDi-RL in detail. We will first illustrate the overall architecture. Then the detail algorithm is described.

Iv-a Architecture

Proposed in 2018 by Mania et al., ARS has shown great potential in solving continuous control tasks with simple linear policies and low computation cost. Comparing to conventional RL methods, ARS optimizes the policy by performing random search on simulated trajectories. However, ARS can only learn from a complete episodes and lack of data efficiency. In contrast, off-policy learning methods can learn the elementary steps repeatedly and can great improve the sample efficiency. Therefore, in this paper we incorporate the ARS with off-policy RL methods. Since ARS optimize the policy directly, off-policy policy gradient methods is mandatory for combination. Popular off-policy policy gradient methods such as DPG, DDPG, TD3 and algorithms derived by those methods are capable for relearning the trajectories. Our FiDi-RL architecture is able to integrate ARS with those kind of RL methods, but for convenience of expression, we here use DDPG as an example.

As illustrated before, ARS estimates the gradients of the policy by evaluating the policies with some designated random noise. In contrast, DDPG computes the gradients based on deterministic policy gradient theorem [8] through off-policy learning. At each iteration, the two methods both use a single policy for optimization, and the overall optimization goal of the both methods is the same essentially. Therefore, we can use a linear combination to describe the integrated loss function of FiDi-RL as below:


Since the loss functions of both methods are aiming at maximize the total expected discount reward, their linear combination holds the same optimization goal as well. A new involved parameter denotes the update proportion of DDPG, which serves as a control variable for the trade-off between the gradient-free learning and gradient-based learning. Based on this equation, the gradient of FiDi-RL loss function can be calculated as a weighted summation of the two methods:


As DDPG is an off-policy RL method, the loss function of actor and critic can be calculated using pre-collected data from replay buffer. Therefore, we can directly fill the replay buffer of DDPG using the generated trajectories by ARS. Since the ARS perform exploration by adding parameter space noise, such integration can also facilitate the exploration of DDPG as well [43]. The overall framework of FiDi-RL is illustrated in Figure 1. Note that the critic function in our framework is not a mandatory. The return can also be evaluated by Monte-Carlo method based on the sample trajectories [7, 8], which can further reduce the computation cost. By performing gradient ascent iteratively, the policy can be improved.

Our architecture is different with the existing combination works [20, 21, 22] in several ways. First, we use a single policy during the whole learning process. The absence of draw the elites from the populations can make the learning more efficient. Second, the combination between gradient-based and gradient-free methods is conduct through a collaborate loss function. The optimization is performed through both gradient-free and gradient-based learning, thus making the optimization process more directly.

Fig. 1: Architecture of FiDi-RL

Iv-B Algorithm

Based on the architecture, we give the algorithm in Algorithm 1. We here use linear policies for learning, denoting as the policy matrix, which is a matrix. The pseud code of prototype algorithm is described in Algorithm 1.

For each episode of learning, we first generated rollouts and store the them into experience replay buffer. We then use ARS algorithm (either v1,v2 or v1-t,v2-t) to compute the gradients w.r.t current policy. Subsequently, the DDPG agent learns from the generated rollouts to optimize both actor and critic function. The initial actor function for learning is identical to the origin policy in the beginning of the episode. The gradients of policy is accumulated w.r.t the learning steps of derivative-based method. Finally the gradients of the two method is aggregated.

1:  Input:ARS learning rate , number of directions sampled per iteration of ARS , standard deviation of exploration noise , number of top-performing directions to use (), DDPG learning step , DDPG update coefficient , DDPG critic learning rate experience replay refresh period .
2:  Initialize: initial policy , critic network with random initialized, experience replay buffer .
3:  for  do
4:     Generate policies with i.i.d standard normal entries.
5:     Collect rollouts based on the policies and evaluate the rollouts with total discounted rewards.
6:     Store the rollouts in the experience replay buffer of DDPG.
7:     Compute gradients of policy based on Equation 4.
8:     Initialize DDPG gradients
9:     for  do
10:        Calculate gradient based on Equation 1 for policy .
12:        Update critic network by gradient descent with learning rate with the loss function in Equation 2.
13:     end for
15:     if  then
16:        Flush the replay buffer .
17:     end if
18:  end for
Algorithm 1 FiDi-RL algorithm: Augment random search with deep deterministic policy gradients

Considering the integration of DDPG with ARS, we here make two improvements to the original DDPG algorithm:

  1. No soft target policy updates. In the original DDPG algorithm, soft target updates are used to improve the learning stability at the risk of slowing down the learning [4]. With our method, as the policy are updated by both ARS and DDPG, the DDPG update is controlled by the update step parameter and update proportion . Hence, it is unnecessary to use soft target policy update. In addition, since the critic network are critical to the policy update, the synchronous update of actor and critic are indispensable. In our experiment, we also find the critic network become vulnerable to diverge with soft-update.

  2. Periodically refresh the experience replay buffer. Experience replay is an important mechanism for reinforcement learning [1, 44]. With experience replay, the robustness and data efficiency of learning are improved. Instead of replace the old transitions in the replay buffer, we here use a more controllable version of original experience replay to make full use of the generated data and facilitate the learning as well. The experience replay buffer is periodically flushed under a refresh parameter w.r.t the episode number .

By iteratively improve the policy with FiDi-RL, we can finally obtain a policy that meet our goal.

V Experiment Result

V-a Experiment Setup

We evaluate the FiDi-RL method on the Mujoco environment [12], which is a popular environment for evaluating RL algorithms for continuous control tasks. The implementation is based on the OpenAI Gym [45]. Six robotic control tasks are used for evaluation: HalfCheetah-v2, Ant-v2, Hopper-v2, Swimmer-v2, Wakler2d-v2 and Humanoid-v2:

  • HalfCheetah-v2: Agent controls a cheetah-like body to run forward as quickly as possible. The state dimension is 17 and the action dimension is 6.

  • Ant-v2: Agent controls a 4-leg ant to move forward as quickly as possible. The state dimension is 111 and the action dimension is 8.

  • Hopper-v2: Agent controls a monoped to keep it from falling. The state dimension is 11 and the action dimension is 3.

  • Swimmer-v2: Agent controls a snake-like robot to swim forward as fast as possible. The state dimension is 8 and the action dimension is 2.

  • Walker2d-v2: Agent controls a bipedal walker to move forward as fast as possible. The state dimension is 17 and the action dimension is 6.

  • Humanoid-v2: Agent control a human-like robot to stand up. The state dimension is 376 and the action dimension is 17.

In performance evaluation part, We evaluate our method with state-of-the-art RL methods: DDPG [4], TRPO [10] PPO [11], ARS  [14] and CEM [17] under 5 random seeds to evaluate the performance of FiDi-RL. In addition, we also evaluate the total learning steps for reaching a prescribed threshold against some derivative-based methods to evaluate the learning efficiency. The details of the experiment settings can be found in the Appendix part.

(a) HalfCheetah-v2
(b) Ant-v2
(c) Hopper-v2
(d) Swimmer-v2
(e) Walker2d-v2
(f) Humanoid-v2
Fig. 2: Illustration of Mujoco robotic tasks [12]

V-B Performance Evaluation under 5 Random Seeds

We evaluate our method against several state-of-the-art RL methods. For each method, we evaluated the policy periodically during training by testing the policy without random exploration noise. We also evaluate our method against some derivative-based methods: DDPG, TRPO and PPO.

Fig. 3: Learning curves on Mujoco robotic tasks

For ARS, CEM and ARS+DDPG, each episode runs with a whole learning step: policy generation, simulation, evaluation and policy update. Then we test the policy without any randomness for evaluation. The policy used in ARS, FiDi-RL and CEM are linear policies that simple map observation to action. For critic function in FiDi-RL, we adopt a 2 layer neural network, with rectified linear unit as activate function for hidden layer and output directly without any activation function. Each layer contains 64 neurons. We run ARS and FiDi-RL with 10 random fixed seeds and select 5 seeds with the best results. To make a fair comparison between ARS and FiDi-RL, for each tasks the random seeds of both methods are identical. In addition we also give the learning curves of the baseline methods. We use OpenAI Spining Up


to run the baseline experiments. For the gradient-based methods, each epoch contains one episode including 1000 steps of iteration with hyperparameters in appendix. The evaluation results is shown in Figure

3. The solid lines represent the mean total reward of 5 independent runs and the shaded region represent the standard deviation of the results. All the results are smoothed using right-centered moving average of 50 successive epochs. From the figures we can see the learning speed of FiDi-RL is generally faster than others (except Humanoid-v2). For most of the tasks, FiDi-RL outperforms the rest algorithms. In addition, We find that with DDPG incorporated with ARS, the algorithm turns out to be more stable with smaller variance than the original ARS. In addition, we also notice that linear policies are also sufficient for training those tasks, which is neatly impossible with deep RL algorithm.

We also find that in Hopper-v2 and Walker2d-v2, the original ARS is always trapped in local optima, which caused the high variance in the performance. In Figure 4, we give the result of 20 times run under Hopper-v2 environment with 20 arbitrary random seeds. In such setting, ARS converges to the score around 1000 19 times and only 1 time goes up to 3000, while FiDi-RL converges above 3000 4 times. The FiDi-RL can help the ARS to go the local optima in the learning process thus enhancing the stability of ARS.

Fig. 4: Hopper-v2 with 20 random seeds

V-C Average Learning Steps to Reach the Threshold

We also compare the average learning steps to reach a prescribed threshold of FiDi-RL against the gradient-based methods. The threshold we adopt here is the same as  [30]. The hyperparameters of FiDi-RL were chosen based on the same evaluations on Table II. We compare the FiDi-RL against the gradient-based methods SAC, DDPG, PPO and TD3. The learning steps of Fidi-RL are calculated by sum of gradient-based iteration and finite-difference-based iteration. In addition, the learning steps for SAC, DDPG, PPO and TD3 are estimated according to the learning curves in [32, 11]. Table I shows the results. From the results we can see with the help of gradient-free policy update, the FiDi-RL requires fewer learning timesteps than the gradient-based methods to reach the threshold of each environment.

Enviroment Threshold SAC DDPG PPO TD3 FiDi-RL
Ant-v2 NA NA
Swimmer-v2 unknown unknown unknown
Waker2d-v2 NA
Humanoid-v2 NA NA

TABLE I: Average learning steps before reaching the threshold
Environment number of number of best noise ARS learning rate DDPG update portion DDPG step critic learning rate

TABLE II: Parameters for ARS and FiDI-RL

V-D Learned Behavior

We also study the learned behavior of FiDi-RL and ARS in HalfCheetah environment. We use the well-trained model of FiDi-RL and ARS without exploration noise and record the performance in the Mujuco environment. The video can be found in From the video we can see FiDi-RL can run much faster than ARS. The agent controlled by FiDi-RL is also more energy-saving in running, as the swing range of each leg is smaller than ARS. Moreover, in the beginning, the FiDi-RL controlled agent shows high adaption capacity than ARS controlled agent as the agent accelerates faster in running. Through learning the elementary transitions via gradient-based learning, FiDi-RL improves the performance of gradient-free learning.

Vi Conclusion and Discussion

In this paper we propose FiDi-RL, a novel method that incorporating deep reinforcement learning method DDPG with gradient-free policy search method ARS. The FiDi-RL method can use simple linear method to cope with complex continuous control tasks through policy iteration by integration of gradient-based and gradient-free method. The trade-off between those methods can also be adjusted by the new involved parameters. Empirical results show that the FiDi-RL can improve the data efficiency, stability and performance of augment random search. Moreover, our result also provides a competitive alternate method to current deep reinforcement learning method.

Limitations and future works: Incorporating deep RL methods with ARS also involves new hyperparameters to the original ARS. Comparing to the original ARS, it’s more difficult for parameter tuning. In the experiment we also find that the those new involved parameters may be dynamic adjusted according to the learning process. We will study this issue in the future. In addition, inherited from ARS and DDPG, the FiDi-RL is also more versatile to the random seeds. In the future we will focus on further improvement of stability of FiDi-RL.

Appendix A hyperparameters for ARS and FiDi-RL

Table II gives the parameters of FiDi-RL and ARS. For each environment, the FiDi-RL and ARS share the same parameters on gradient-free update. For all the experiments conducted in this paper, we use the discount factor of . The batchsize of gradient-based learning in FiDi-RL is set to . All the experiments of FiDi-RL and ARS are conducted under 10 fixed random seeds and we select 5 best ones to draw the performance comparison in Figure 3.

Appendix B Hyperparameters for CEM

For the CEM method, we here use the same simulation episodes as in FiDi-RL and ARS. The parameters are sampled with Gaussian distribution. Other parameters are listed in Table


Environment Population size Top % Initial std of Parameters
HalfCheetah-v2 32 20% 0.5
Ant-v2 120 20% 1
Hopper-v2 16 20% 1
Swimmer-v2 32 20% 1
Walker2d-v2 80 20% 1
Humanoid-v2 460 20% 1

TABLE III: Hyperparameters for CEM

Appendix C Hyperparameters for Gradient-based Methods

As described before, we run the gradient-based methods using the OpenAI Spinning Up. For each algorithm the actor and critic function is a 2-layers neural network with neurons in each layer. The discount factor is set to . For each epoch we run learning iterations among all the environments. The main parameter we tuned in experiment is learning rate of actor and critic networks. The other parameters is as default in the SpinningUp. Table IV gives the learning rates of different methods.

Environment Actor learning rate Critic learning rate
HalfCheetah 0.0001 0.0001
Ant-v2 0.001 0.0001
Hopper-v2 0.0001 0.0001
Swimmer-v2 0.0001 0.0001
Walker2d-v2 0.0001 0.0001
Humanoid-v2 0.0001 0.0001
HalfCheetah 0.0001 -
Ant-v2 0.0001 -
Hopper-v2 0.0001 -
Swimmer-v2 0.0001 -
Walker2d-v2 0.0001 -
Humanoid-v2 0.0001 -
HalfCheetah 0.0001 -
Ant-v2 0.0001 -
Hopper-v2 0.0001 -
Swimmer-v2 0.0001 -
Walker2d-v2 0.0001 -
Humanoid-v2 0.0001 -

TABLE IV: Hyperparameters for Gradient-based Methods


The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.