Recent years have witnessed the thriving development in reinforcement learning RL. Significant progress has been made in areas like: playing the Game Go [1, 2], Atari , robotics , and real-time strategy games [5, 6]
. Despite its great success, reinforcement learning still faces challenge in physical control tasks, which have real-valued continuous action spaces. Conventional reinforcement learning methods for continuous policy search seek to approximate the optimal policy by estimating the gradient via expected return obtained from sampled trajectories. Famous methods such as vanilla policy gradient, (deep) deterministic policy gradient (DPG/DDPG) [8, 4], Asynchronous Advantage Actor-Critic (A3C) , Trust Region Policy Optimization  and Proximal Policy Optimization (PPO)  are widely adopted in practical use. Those methods have shown promising results in continuous control tasks such as Mujoco locomotion tasks . Unfortunately, existing gradient methods are not robust to the environment and are fragile to the hyper-parameters or even random seeds [13, 14]. In addition, the estimation of gradient are always computation costly and are not easy to parallelize , which narrowed its application.
In contrast, an alternative approach for solving continuous control tasks without computing the gradients is based on random search theory. Those methods, which are also known as ”blackbox optimization”, treat the optimization process as a black box and only use the return of each simulation to search the optimal policy directly, i.e, the direct policy search . Gradient-free methods such as Cross-Entropy Method (CEM) , Evolutional Strategies (ES) [18, 15] and Finite Difference Method (FDM) [14, 19] offer low computation cost, considerable result, fast training speed while also being easy to understand . Although gradient-free methods show robustness to the environment and are easy to conduct [14, 15], they also suffer from low sample efficiency. Unlike derivative-based methods that can learn from each elementary step back and forth from the simulated path , in gradient-free methods numerous generated trajectories are discarded after evaluate the estimated return.
Recently, the combination of gradient-free methods and gradient-based methods so as to get the best of the both methods has raised attention. For example, GEP-PG  adopts goal exploration process to fill the replay buffer and then applies DDPG to learn the policy. Their algorithm is more sample efficient than DDPG. Evolutional reinforcement learning (ERL)  presents an efficient combination of ES and DDPG. The DDPG policy is periodically inserted into the population of ES and the performance outperforms the both methods. CEM-RL  combines Covariance Adaption Method (CMA) with deep RL methods and the results show that the learning can be faster and better. Maheswaranathan et al. 
theoretically analyzes the optimization problems with a surrogate gradient and incorporate the ES with such surrogate gradient. However, most of the existing combination works adopt complex neural networks (NNs) as the policy for control. The double-edged sword of deep NNs can yield better performance, but also introduce numerous of hyper-parameters defining the network structure, which makes it difficult for parameter tuning. In addition, The backpropagation of such networks are always computation costly when performing gradient-based learning. Nevertheless, in[24, 14], they have shown that simple linear representation of policy is enough for control those tasks.
To address this issue, in this paper we propose FiDi-RL, which incorporates deep RL insights with Finite-Difference (FiDi) policy search based on linear representation of policies. FiDi-RL combines DDPG with ARS and aims at improving the data efficiency of ARS. We evaluate FiDi-RL in the well-known Mujoco locomotion tasks . Our empirical results show that FiDi-RL can accelerate the learning comparing to the original ARS and DDPG, and can improve performance and stability of ARS as well. In addition, FiDi-RL also shows competitive result against popular policy search methods such as CEM, PPO, TRPO.
The rest of this paper is organized as follows: Section II discusses the related work. Section III presents the preliminary of reinforcement learning for continuous control and problem definition of this paper. Section IV introduce the proposed framework. Section V provides experimental analysis, followed by conclusion and discussion in Section VII.
Ii Related Works
Here, we summarize some related works. We divide the related works into 3 categories:
Ii-a Gradient-based Methods for Continuous Control
Various existing works on derivative-based reinforcement learning were proposed for solving the continuous control tasks. In this part we mainly focus on the Policy Gradient methods. Since conventional policy gradient methods such as REINFORCE  are too fragile to scale to difficult problems  , many new policy gradient methods are proposed with the development of deep learning. For example,  and  use an actor-critic framework to train stochastic policies with experience replay. Inspired by off-policy actor-critic , Deterministic Policy Gradient (DPG)  and its derivatives play an important role for RL methods in continuous control. However, the original DPG only evaluated on several simple toy environments and did not evaluate on complex environments with high dimensional observation space. The most popular version, Deep Deterministic Policy Gradient algorithm (DDPG)  integrates DPG with the insights of DQN , has been successful adopted into many robotic control tasks. Despite its success, DDPG is suffered from low stability and lack of sample efficiency. To address this issue, Q-prop  was proposed using a Taylor expansion of the off-policy critic as a control variate, which improves the stability and sample efficiency of DDPG. TD3  was designed to overcome the overestimation problem in DDPG, and improves the performance as well. In addition, Soft actor-critic  was proposed based on the maximum entropy reinforcement learning framework, which further improves the stability of DDPG.
Another approach is Trust Region Policy Optimization (TRPO) , which directly builds stochastic policies without the actor-critic framework. TRPO produces near monotonic improvements in return by making carefully updates to policy. However, as no action-value function is learned, TRPO appears to be less data efficient . Based on TRPO, ACKTR  uses Kronecker-factored approximation and propose an actor-critic based TRPO, which improves the performance and data efficiency in TRPR. Furthermore, Proximal Policy Optimization (PPO)  improves TRPO by alternately sampling and optimizing the policy, improves the sample efficiency of TRPO.
To conclude, most of the current gradient-based reinforcement learning methods for continuous control face problems such as instability and computation costly. To overcome those issues the state-of-the-art methods become more and more complicated.
Ii-B Gradient-free Methods for Continuous Control
As an alternative method to gradient-based RL, most of the gradient-free methods for continuous control can be categorized into 3 classes:
Evolutional strategies (ES): As a popular blackbox optimization approach, ES adopts genetic algorithm to optimize the policy. The simplest ES form, optimize the policy directly using ES. Covariance Adaption Method (CMA), is another famous ES method which combine ES with ideas of derandomization and cumulation .  gives an review of ES for learning various tasks. Recent works with ES  shows that ES can learn continuous tasks with high scalability and efficiency.
Cross-entropy methods (CEM): Similar to CMA, CEM evolves the policy by generating the parameters with high reward . For example,  proposes a CEM method with adaptive basis function.  uses CEM to learn policies in decentralized partial observable Markovian Decision Process (POMDP) environments. CEM method can learn very fast, but also easy trapped in local optima.
Finite difference methods (FDM): FDM uses finite difference to calculate the approximation of the gradients and therefore optimize the policy by gradient ascent/descent. For example, Pegasus is proposed based on FDM for solving POMDP problems. . Recently,  propose ARS, an efficient method for learning continuous control tasks with low computation cost and high efficiency.
In summary, although gradient-free method can learn the policy fast with low computation cost, the generated trajectories are only used once to obtain the rewards for evaluation.
Ii-C Combining Gradient-based Methods with Gradient-free Methods
Several works have explored the combination of gradient-based methods and gradient-free methods. For example, Goal Exploration Process-Policy Gradient (GEP-PG) 
adopts goal exploration process to fill the replay buffer and then uses DDPG to learn the policy. The GEP is very close to evolutional methods. Their experiments show that GEP-PG is more sample-efficient and have a low variance comparing to DDPG. However, their combination does not improve the efficiency of gradient update of DDPG. Evolutional Reinforcement Learning (ERL) introduce a hybrid algorithm that periodically insert the DDPG agent to the evolutional optimization process, and improves the stability and efficiency in learning and exploration. However, the ERL does not benefit from the search efficiency of ES. Their experiment result also shows high variance in some tasks. Similar to our work, in , the authors analyze the optimization problems with a surrogate gradient, and show that incorporating ES to surrogate gradient can improve the performance and efficiency of traditional RL methods. However, their experiments only performed on simple tasks and lack of practical demonstrations. In addition, the CEM-RL  is also very close to our work. In this work, the CMA and gradient-based method DDPG/TD3 is combined together. Their result shows that learning can be accelerated with CEM-RL. However, in CEM-RL neural networks were used for policy representation. Their experiment results show that the structure of policy networks are sensitive to the performance.
Iii Preliminaries and Problem Settings
In this section, we introduce some basic concepts, notions and our target problems.
Reinforcement learning problems can be formulated as a Markovian Decision Process (MDP): , where is the state space, is the action space, is the discount factor and is the transition function that maps each state-action pair to some distribution over . We consider the standard reinforcement learning setup: an agent interacting with the environment in discrete time steps, at each time step , the agent observes a state , takes an action under some policy , and receives a scalar reward . A policy
describes the agent’s behavior, which is a probability distribution that maps a state to an action:.
The return from a state is defined as the total discounted future reward: , where is the terminal state. The state-action value is a mapping on to , which describes the expected discounted future reward when taking action at observation following policy :
The goal of reinforcement learning is to find an optimal policy which maximize the expected return from the starting state:
Here is the state distribution under policy .
The Bellman equation describes the recursive relationship in state-action value:
One popular RL method for dealing with continuous task is Deep Deterministic Policy Gradient (DDPG) , which adopts actor-critic in policy gradient. The actor of DDPG optimize the policy directly using the deterministic policy gradient theorem . Denoting and as the parameterized policy and the action-value function
respectively, the gradient of the loss functioncan be calculated as:
The loss of the critic function can be calculated:
During learning, gradient ascent is performed on the actor function to maximize , while gradient descent is performed on the critic function to minimize .
Recently, a new method named Augment Random Search (ARS)  was proposed, aiming at solving the continuous control task with high efficiency and low computation cost. As a gradient-free algorithm, ARS optimize the policy directly by calculating finite difference approximation along the random direction. Denoting as the unknown dynamics of the environment,
as the generated random variables,as the total discount reward received by evaluating policy and as the exploration range, the loss function can be expressed with:
The gradient can be approximate with:
Based on the normalization to the observations and gradients, ARS contains 4 versions: ARS-,ARS-,ARS- and ARS-. In ARS-, the gradients from Equation 4
are normalized by dividing their standard deviation. In ARS-, state normalization is added to ARS-. In their version with , only policies with top rewards are selected to compute the gradients.
Iii-B Problem Settings
In this paper we consider the standard model-free reinforcement learning setting in control, which an agent interacts with the environment. We specially focus on the continuous control tasks, where the action spaces are real-valued: . The policy is parameterized by some function: . We are aiming at finding the optimal policies that can maximize the average return under some environment variables :
To address the above issues in deep neural networks, our goal is to find an optimal policy with linear representation with gradient-based and gradient-free methods.
Iv FiDi-RL: Combining Deep Deterministic Policy Gradient with Augment Random Search
In this section, we will introduce FiDi-RL in detail. We will first illustrate the overall architecture. Then the detail algorithm is described.
Proposed in 2018 by Mania et al., ARS has shown great potential in solving continuous control tasks with simple linear policies and low computation cost. Comparing to conventional RL methods, ARS optimizes the policy by performing random search on simulated trajectories. However, ARS can only learn from a complete episodes and lack of data efficiency. In contrast, off-policy learning methods can learn the elementary steps repeatedly and can great improve the sample efficiency. Therefore, in this paper we incorporate the ARS with off-policy RL methods. Since ARS optimize the policy directly, off-policy policy gradient methods is mandatory for combination. Popular off-policy policy gradient methods such as DPG, DDPG, TD3 and algorithms derived by those methods are capable for relearning the trajectories. Our FiDi-RL architecture is able to integrate ARS with those kind of RL methods, but for convenience of expression, we here use DDPG as an example.
As illustrated before, ARS estimates the gradients of the policy by evaluating the policies with some designated random noise. In contrast, DDPG computes the gradients based on deterministic policy gradient theorem  through off-policy learning. At each iteration, the two methods both use a single policy for optimization, and the overall optimization goal of the both methods is the same essentially. Therefore, we can use a linear combination to describe the integrated loss function of FiDi-RL as below:
Since the loss functions of both methods are aiming at maximize the total expected discount reward, their linear combination holds the same optimization goal as well. A new involved parameter denotes the update proportion of DDPG, which serves as a control variable for the trade-off between the gradient-free learning and gradient-based learning. Based on this equation, the gradient of FiDi-RL loss function can be calculated as a weighted summation of the two methods:
As DDPG is an off-policy RL method, the loss function of actor and critic can be calculated using pre-collected data from replay buffer. Therefore, we can directly fill the replay buffer of DDPG using the generated trajectories by ARS. Since the ARS perform exploration by adding parameter space noise, such integration can also facilitate the exploration of DDPG as well . The overall framework of FiDi-RL is illustrated in Figure 1. Note that the critic function in our framework is not a mandatory. The return can also be evaluated by Monte-Carlo method based on the sample trajectories [7, 8], which can further reduce the computation cost. By performing gradient ascent iteratively, the policy can be improved.
Our architecture is different with the existing combination works [20, 21, 22] in several ways. First, we use a single policy during the whole learning process. The absence of draw the elites from the populations can make the learning more efficient. Second, the combination between gradient-based and gradient-free methods is conduct through a collaborate loss function. The optimization is performed through both gradient-free and gradient-based learning, thus making the optimization process more directly.
Based on the architecture, we give the algorithm in Algorithm 1. We here use linear policies for learning, denoting as the policy matrix, which is a matrix. The pseud code of prototype algorithm is described in Algorithm 1.
For each episode of learning, we first generated rollouts and store the them into experience replay buffer. We then use ARS algorithm (either v1,v2 or v1-t,v2-t) to compute the gradients w.r.t current policy. Subsequently, the DDPG agent learns from the generated rollouts to optimize both actor and critic function. The initial actor function for learning is identical to the origin policy in the beginning of the episode. The gradients of policy is accumulated w.r.t the learning steps of derivative-based method. Finally the gradients of the two method is aggregated.
Considering the integration of DDPG with ARS, we here make two improvements to the original DDPG algorithm:
No soft target policy updates. In the original DDPG algorithm, soft target updates are used to improve the learning stability at the risk of slowing down the learning . With our method, as the policy are updated by both ARS and DDPG, the DDPG update is controlled by the update step parameter and update proportion . Hence, it is unnecessary to use soft target policy update. In addition, since the critic network are critical to the policy update, the synchronous update of actor and critic are indispensable. In our experiment, we also find the critic network become vulnerable to diverge with soft-update.
Periodically refresh the experience replay buffer. Experience replay is an important mechanism for reinforcement learning [1, 44]. With experience replay, the robustness and data efficiency of learning are improved. Instead of replace the old transitions in the replay buffer, we here use a more controllable version of original experience replay to make full use of the generated data and facilitate the learning as well. The experience replay buffer is periodically flushed under a refresh parameter w.r.t the episode number .
By iteratively improve the policy with FiDi-RL, we can finally obtain a policy that meet our goal.
V Experiment Result
V-a Experiment Setup
We evaluate the FiDi-RL method on the Mujoco environment , which is a popular environment for evaluating RL algorithms for continuous control tasks. The implementation is based on the OpenAI Gym . Six robotic control tasks are used for evaluation: HalfCheetah-v2, Ant-v2, Hopper-v2, Swimmer-v2, Wakler2d-v2 and Humanoid-v2:
HalfCheetah-v2: Agent controls a cheetah-like body to run forward as quickly as possible. The state dimension is 17 and the action dimension is 6.
Ant-v2: Agent controls a 4-leg ant to move forward as quickly as possible. The state dimension is 111 and the action dimension is 8.
Hopper-v2: Agent controls a monoped to keep it from falling. The state dimension is 11 and the action dimension is 3.
Swimmer-v2: Agent controls a snake-like robot to swim forward as fast as possible. The state dimension is 8 and the action dimension is 2.
Walker2d-v2: Agent controls a bipedal walker to move forward as fast as possible. The state dimension is 17 and the action dimension is 6.
Humanoid-v2: Agent control a human-like robot to stand up. The state dimension is 376 and the action dimension is 17.
In performance evaluation part, We evaluate our method with state-of-the-art RL methods: DDPG , TRPO  PPO , ARS  and CEM  under 5 random seeds to evaluate the performance of FiDi-RL. In addition, we also evaluate the total learning steps for reaching a prescribed threshold against some derivative-based methods to evaluate the learning efficiency. The details of the experiment settings can be found in the Appendix part.
V-B Performance Evaluation under 5 Random Seeds
We evaluate our method against several state-of-the-art RL methods. For each method, we evaluated the policy periodically during training by testing the policy without random exploration noise. We also evaluate our method against some derivative-based methods: DDPG, TRPO and PPO.
For ARS, CEM and ARS+DDPG, each episode runs with a whole learning step: policy generation, simulation, evaluation and policy update. Then we test the policy without any randomness for evaluation. The policy used in ARS, FiDi-RL and CEM are linear policies that simple map observation to action. For critic function in FiDi-RL, we adopt a 2 layer neural network, with rectified linear unit as activate function for hidden layer and output directly without any activation function. Each layer contains 64 neurons. We run ARS and FiDi-RL with 10 random fixed seeds and select 5 seeds with the best results. To make a fair comparison between ARS and FiDi-RL, for each tasks the random seeds of both methods are identical. In addition we also give the learning curves of the baseline methods. We use OpenAI Spining Up111https://spinningup.openai.com/en/latest/
to run the baseline experiments. For the gradient-based methods, each epoch contains one episode including 1000 steps of iteration with hyperparameters in appendix. The evaluation results is shown in Figure3. The solid lines represent the mean total reward of 5 independent runs and the shaded region represent the standard deviation of the results. All the results are smoothed using right-centered moving average of 50 successive epochs. From the figures we can see the learning speed of FiDi-RL is generally faster than others (except Humanoid-v2). For most of the tasks, FiDi-RL outperforms the rest algorithms. In addition, We find that with DDPG incorporated with ARS, the algorithm turns out to be more stable with smaller variance than the original ARS. In addition, we also notice that linear policies are also sufficient for training those tasks, which is neatly impossible with deep RL algorithm.
We also find that in Hopper-v2 and Walker2d-v2, the original ARS is always trapped in local optima, which caused the high variance in the performance. In Figure 4, we give the result of 20 times run under Hopper-v2 environment with 20 arbitrary random seeds. In such setting, ARS converges to the score around 1000 19 times and only 1 time goes up to 3000, while FiDi-RL converges above 3000 4 times. The FiDi-RL can help the ARS to go the local optima in the learning process thus enhancing the stability of ARS.
V-C Average Learning Steps to Reach the Threshold
We also compare the average learning steps to reach a prescribed threshold of FiDi-RL against the gradient-based methods. The threshold we adopt here is the same as . The hyperparameters of FiDi-RL were chosen based on the same evaluations on Table II. We compare the FiDi-RL against the gradient-based methods SAC, DDPG, PPO and TD3. The learning steps of Fidi-RL are calculated by sum of gradient-based iteration and finite-difference-based iteration. In addition, the learning steps for SAC, DDPG, PPO and TD3 are estimated according to the learning curves in [32, 11]. Table I shows the results. From the results we can see with the help of gradient-free policy update, the FiDi-RL requires fewer learning timesteps than the gradient-based methods to reach the threshold of each environment.
|Environment||number of||number of best||noise||ARS learning rate||DDPG update portion||DDPG step||critic learning rate|
V-D Learned Behavior
We also study the learned behavior of FiDi-RL and ARS in HalfCheetah environment. We use the well-trained model of FiDi-RL and ARS without exploration noise and record the performance in the Mujuco environment. The video can be found in https://youtu.be/a1Qtfg1kql8. From the video we can see FiDi-RL can run much faster than ARS. The agent controlled by FiDi-RL is also more energy-saving in running, as the swing range of each leg is smaller than ARS. Moreover, in the beginning, the FiDi-RL controlled agent shows high adaption capacity than ARS controlled agent as the agent accelerates faster in running. Through learning the elementary transitions via gradient-based learning, FiDi-RL improves the performance of gradient-free learning.
Vi Conclusion and Discussion
In this paper we propose FiDi-RL, a novel method that incorporating deep reinforcement learning method DDPG with gradient-free policy search method ARS. The FiDi-RL method can use simple linear method to cope with complex continuous control tasks through policy iteration by integration of gradient-based and gradient-free method. The trade-off between those methods can also be adjusted by the new involved parameters. Empirical results show that the FiDi-RL can improve the data efficiency, stability and performance of augment random search. Moreover, our result also provides a competitive alternate method to current deep reinforcement learning method.
Limitations and future works: Incorporating deep RL methods with ARS also involves new hyperparameters to the original ARS. Comparing to the original ARS, it’s more difficult for parameter tuning. In the experiment we also find that the those new involved parameters may be dynamic adjusted according to the learning process. We will study this issue in the future. In addition, inherited from ARS and DDPG, the FiDi-RL is also more versatile to the random seeds. In the future we will focus on further improvement of stability of FiDi-RL.
Appendix A hyperparameters for ARS and FiDi-RL
Table II gives the parameters of FiDi-RL and ARS. For each environment, the FiDi-RL and ARS share the same parameters on gradient-free update. For all the experiments conducted in this paper, we use the discount factor of . The batchsize of gradient-based learning in FiDi-RL is set to . All the experiments of FiDi-RL and ARS are conducted under 10 fixed random seeds and we select 5 best ones to draw the performance comparison in Figure 3.
Appendix B Hyperparameters for CEM
For the CEM method, we here use the same simulation episodes as in FiDi-RL and ARS. The parameters are sampled with Gaussian distribution. Other parameters are listed in TableIII
|Environment||Population size||Top %||Initial std of Parameters|
Appendix C Hyperparameters for Gradient-based Methods
As described before, we run the gradient-based methods using the OpenAI Spinning Up. For each algorithm the actor and critic function is a 2-layers neural network with neurons in each layer. The discount factor is set to . For each epoch we run learning iterations among all the environments. The main parameter we tuned in experiment is learning rate of actor and critic networks. The other parameters is as default in the SpinningUp. Table IV gives the learning rates of different methods.
|Environment||Actor learning rate||Critic learning rate|
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.
-  D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” nature, vol. 529, no. 7587, p. 484, 2016.
-  D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  N. Usunier, G. Synnaeve, Z. Lin, and S. Chintala, “Episodic exploration for deep deterministic policies: An application to starcraft micromanagement tasks,” arXiv preprint arXiv:1609.02993, 2016.
J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson,
“Counterfactual multi-agent policy gradients,” in
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
-  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
-  E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
-  P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger, “Deep reinforcement learning that matters,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  H. Mania, A. Guy, and B. Recht, “Simple random search of static linear policies is competitive for reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 1800–1809.
-  T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” arXiv preprint arXiv:1703.03864, 2017.
-  J. Schmidhuber and J. Zhao, “Direct policy search and uncertain policy evaluation,” in AAAI spring symposium on search under uncertain and incomplete information, stanford univ, 1998, pp. 119–124.
-  I. Szita and A. Lörincz, “Learning tetris using the noisy cross-entropy method,” Neural computation, vol. 18, no. 12, pp. 2936–2941, 2006.
-  F. Stulp and O. Sigaud, “Path integral policy improvement with covariance matrix adaptation,” in Proceedings of the 29th International Conference on Machine Learning. Omnipress, 2012, pp. 1547–1554.
-  A. Y. Ng and M. Jordan, “Pegasus: A policy search method for large mdps and pomdps,” in Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2000, pp. 406–415.
-  C. Colas, O. Sigaud, and P.-Y. Oudeyer, “Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms,” in International Conference on Machine Learning, 2018, pp. 1038–1047.
-  S. Khadka and K. Tumer, “Evolution-guided policy gradient in reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 1188–1200.
-  A. Pourchot and O. Sigaud, “Cem-rl: Combining evolutionary and gradient-based methods for policy search,” arXiv preprint arXiv:1810.01222, 2018.
-  N. Maheswaranathan, L. Metz, G. Tucker, D. Choi, and J. Sohl-Dickstein, “Guided evolutionary strategies: augmenting random search with surrogate gradients,” in International Conference on Machine Learning, 2019, pp. 4264–4273.
-  A. Rajeswaran, K. Lowrey, E. V. Todorov, and S. M. Kakade, “Towards generalization and simplicity in continuous control,” in Advances in Neural Information Processing Systems, 2017, pp. 6550–6561.
-  S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016.
-  J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
-  P. Wawrzyński, “Real-time reinforcement learning by sequential actor–critics and experience replay,” Neural Networks, vol. 22, no. 10, pp. 1484–1497, 2009.
-  P. WawrzyńSki and A. K. Tanwani, “Autonomous reinforcement learning with experience replay,” Neural Networks, vol. 41, pp. 156–167, 2013.
-  T. Degris, M. White, and R. S. Sutton, “Off-policy actor-critic,” arXiv preprint arXiv:1205.4839, 2012.
-  S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q-prop: Sample-efficient policy gradient with an off-policy critic,” arXiv preprint arXiv:1611.02247, 2016.
-  S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” arXiv preprint arXiv:1802.09477, 2018.
-  T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
-  Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, “Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation,” in Advances in neural information processing systems, 2017, pp. 5279–5288.
J. Koutnik, F. Gomez, and J. Schmidhuber, “Evolving neural networks in
compressed weight space,” in
Proceedings of the 12th annual conference on Genetic and evolutionary computation. ACM, 2010, pp. 619–626.
-  J. Koutník, G. Cuccu, J. Schmidhuber, and F. Gomez, “Evolving large-scale neural networks for vision-based reinforcement learning,” in Proceedings of the 15th annual conference on Genetic and evolutionary computation. ACM, 2013, pp. 1061–1068.
-  N. Hansen and A. Ostermeier, “Completely derandomized self-adaptation in evolution strategies,” Evolutionary computation, vol. 9, no. 2, pp. 159–195, 2001.
-  V. Heidrich-Meisner and C. Igel, “Evolution strategies for direct policy search,” in International Conference on Parallel Problem Solving from Nature. Springer, 2008, pp. 428–437.
-  S. Risi and J. Togelius, “Neuroevolution in games: State of the art and open challenges,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 9, no. 1, pp. 25–41, 2015.
R. Y. Rubinstein and D. P. Kroese,
The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2013.
-  S. Mannor, R. Y. Rubinstein, and Y. Gat, “The cross entropy method for fast policy search,” in Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 512–519.
-  L. Busoniu, D. Ernst, B. De Schutter, and R. Babuska, “Cross-entropy optimization of control policies with adaptive basis functions,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41, no. 1, pp. 196–209, 2010.
-  F. A. Oliehoek, J. F. Kooij, and N. Vlassis, “The cross-entropy method for policy search in decentralized pomdps,” Informatica, vol. 32, no. 4, pp. 341–357, 2008.
-  M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz, “Parameter space noise for exploration,” arXiv preprint arXiv:1706.01905, 2017.
-  S. Zhang and R. S. Sutton, “A deeper look at experience replay,” CoRR, vol. abs/1712.01275, 2017. [Online]. Available: http://arxiv.org/abs/1712.01275
-  G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.