Guiding Evolutionary Strategies by Differentiable Robot Simulators

by   Vladislav Kurenkov, et al.

In recent years, Evolutionary Strategies were actively explored in robotic tasks for policy search as they provide a simpler alternative to reinforcement learning algorithms. However, this class of algorithms is often claimed to be extremely sample-inefficient. On the other hand, there is a growing interest in Differentiable Robot Simulators (DRS) as they potentially can find successful policies with only a handful of trajectories. But the resulting gradient is not always useful for the first-order optimization. In this work, we demonstrate how DRS gradient can be used in conjunction with Evolutionary Strategies. Preliminary results suggest that this combination can reduce sample complexity of Evolutionary Strategies by 3x-5x times in both simulation and the real world.



There are no comments yet.


page 3

page 4

page 7


On evolutionary selection of blackjack strategies

We apply the approach of evolutionary programming to the problem of opti...

Behavior-based Neuroevolutionary Training in Reinforcement Learning

In addition to their undisputed success in solving classical optimizatio...

CoNES: Convex Natural Evolutionary Strategies

We present a novel algorithm – convex natural evolutionary strategies (C...

Shaped Policy Search for Evolutionary Strategies using Waypoints

In this paper, we try to improve exploration in Blackbox methods, partic...

Efficacy of Modern Neuro-Evolutionary Strategies for Continuous Control Optimization

We analyze the efficacy of modern neuro-evolutionary strategies for cont...

Guided evolutionary strategies: escaping the curse of dimensionality in random search

Many applications in machine learning require optimizing a function whos...

Competitive Coevolution through Evolutionary Complexification

Two major goals in machine learning are the discovery and improvement of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Evolutionary Strategies is a class of zeroth-order black-box optimization algorithms that were successfully applied in various simulated robotics tasks [13, 16]

. They are known as easy to parallelize, and a small number of hyperparameters makes them easy to tune. However, these methods were observed to exhibit a higher sample complexity (in comparison to Reinforcement Learning algorithms) and a strong dependence on an initial network initialization

[2], making their adoption for training robots directly in reality troublesome.

Recently, there was a rapid growth of work on Differentiable Robot Simulators (DRS) [6, 10, 8, 15, 17, 5]. The promise is to drastically reduce the number of samples needed for finding successful policies in comparison to reinforcement learning or black-box policy search methods. However, implementing DRS is non-trivial and requires specific approaches (especially for collision handling [8, 15]) to make the resulting gradient useful for first-order optimization. Moreover, first-order optimization methods that rely on DRS’s gradient may get stuck in local optima and some specific treatment could be needed.

Lately, multiple attempts were made to merge Evolutionary Strategies (ES) and Reinforcement Learning (RL) algorithms [4, 11, 14]. Following this line of work, instead of combining ES and RL, we aim to combine the ends of the expected sample complexity spectrum, namely, DRS and Evolutionary Strategies. Specifically, in cases where the former is not useful for optimization with first-order methods. We propose to use recently introduced Guided Evolutionary Strategies (Guided-ES) [12] and treat DRS’s gradient as a surrogate. Using this combination, we demonstrate that it is possible to reduce sample complexity of Evolutionary Strategies by 3x-5x times when training directly on a real robot (Section 3.1). Furthermore, we show in simulation that even misleading gradients from a DRS can be utilized to speed up the convergence of Evolutionary Strategies (Section 3.2).

2 Evolutionary Strategies with Differentiable Robot Simulators

To unite DRS and Evolutionary Strategies, we propose to use an algorithm introduced in [12] – Guided Evolutionary Strategies. This method can make use of any surrogate gradients that are correlated with the true gradient to accelerate the convergence of evolutionary strategies. The surrogate gradients can be corrupted or biased in any way, the only requirement for them is to preserve a positive correlation with the true gradient. We presume that this is the case for the gradients computed with DRS and therefore propose to use it as a surrogate in Guided-ES.

The proposed approach is outlined in Algorithm 1. The key idea is to compute surrogate gradient using a differentiable robot simulator. Then a history of previous surrogates can be formed into a matrix of size (where is a size of search space) and an orthonormal basis extracted. This basis is next used to define a covariance matrix utilized for generating the perturbations. The hyperparameter controls how biased should the perturbations be in the direction of the surrogate gradient.

0:  Initial solution ; Optimizer ; Cost function ; DRS Gradient
0:  Final solution
1:  for  to  do
2:     // DRS Part
3:     Get DRS gradient
5:     // Guided-ES Part
6:     Update low-dimensional guiding subspace with the DRS gradient
7:     Define search covariance
8:     for  to  do
9:        Sample perturbation
10:        Compute antithetic pair of losses and
11:     end for

     Compute Guided ES gradient estimate

13:     Update parameters using given optimizer
14:  end for
15:  return  
Algorithm 1 Evolutionary Strategies Guided by DRS

We note that there is no need to rely on the proposed zeroth-order optimization if one has access to the exact gradient of the objective function. However, there are cases where the gradient obtained with DRS is not effective when used with first-order optimization methods to optimize the objective. In the following section, we will demonstrate two of them and show how one can benefit from the proposed approach to improve the sample efficiency of evolutionary strategies.

3 Experiments

3.1 Accelerated Learning on Real Robot

Evolutionary strategies and their variants were observed to be on par with reinforcement learning algorithms Salimans et al. [16], but require a larger amount of episodes for training in some cases. Still, algorithms of this class were successfully applied to train robots directly in the real world at the expense of higher experimental time [7]. Here, we expect that incorporating information from DRS into the training process should accelerate the convergence of evolutionary strategies.

We use the approach described in Section 2, where a surrogate gradient is taken as a difference between current parameters and parameters obtained after a fixed amount of optimization steps in simulation with DRS. The ascent direction obtained with DRS does not exactly match the ascent direction of the objective function that depends on the real world, but we expect it to be at least positively correlated. A detailed algorithm can be found in the appendix111Source code:

Figure 1: Guided-ES with DRS achieves considerably lower sample complexity for training on real robot in comparison to vanilla evolutionary strategies (Vanilla-ES) and Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Moreover, the method is more robust across random seeds as indicated on the interquartile cost range graph.

Due to limited access to the experimental platform with physical robots, we only consider one problem – a swinging pendulum [3], where the goal is to start swinging constantly at 180 degrees. The cost function is defined as a sum of differences between current energy and target energy at each timestep. The state space is represented by the last four measurements of position and velocity.

We compare Guided-ES with DRS against Vanilla Evolutionary Strategies (Vanilla-ES) [16] and also against Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [9] to involve a method that adapts the covariance matrix during the training process. In Figure 1, we observe that the proposed method converges faster than both Vanilla-ES and CMA-ES, moreover, the convergence is robust to random seeds in opposition to other considered algorithms. We notice that Vanilla-ES and CMA-ES are highly dependent on initial network initialization (to the extent of non-convergence under our experimental budget), which is not the case for Guided-ES with DRS.

We find the results of this experiment encouraging as it gives a piece of evidence in favor of leveraging DRS (and a closed-loop transfer from simulation in general) for data-efficient evolutionary strategies. But more sophisticated robots and problems should be probed further.

3.2 When DRS Gradients Are Misleading

To demonstrate that the convergence of Guided-ES with DRS is possible even when the gradients are not useful for first-order optimization methods, we rely on a Mass-Spring simulator depicted in Figure 4. [10]

observed that a naive implementation of this simulator results in a gradient that can not be used with a stochastic gradient descent – optimization process does not converge to a satisfactory solution. And a specific approach to collision handling (time-of-impact fix) is necessary. In this setup, we want to probe how well Guided-ES performs if provided with gradients computed using a naive version of the simulator, therefore we remove the fix proposed by


We observe in Figure 2 that the proposed approach is able to find a satisfactory solution even when guided by the ineffective gradient. Moreover, the convergence on average is faster than with simple evolutionary strategies. However, we observe that variation between training iterations is quite high, but we believe it can be overcome with proper learning rate scheduling. On the other side, in-sample variation – within a training iteration but over multiple seeds, is low suggesting that a satisfactory result is found fast independent of a network initialization (which is not the case for Vanilla-ES). The results of this experiment suggest that if one has access to a differentiable robot simulator with inaccurate backward propagation, they should not abandon such a simulator. As there is a possibility it can still be used to reduce sample complexity of evolutionary strategies, which are easier to tune than reinforcement learning algorithms.

Figure 2: Guided-ES with DRS achieves lower sample complexity in comparison to vanilla evolutionary strategies (Vanilla-ES) with lower in-sample variation.

[Computed across 5 seeds for each algorithm, shaded area represents 95%-confidence interval.]

4 Conclusion and Future Work

In this work, we proposed a natural way to combine Differentiable Robot Simulators and Evolutionary Strategies, and demonstrated two cases where such a combination could be beneficial to reduce sample complexity of the latter. We find these results encouraging, especially if one is interested to train robots directly in real life. However, future work should include more control robotic problems to probe the viability of the proposed method in more intricated setups that involve real and simulated robots.


The authors would like to thank Oleg Balakhnov and Sergei Savin for helpful discussions and help with the robot hardware.


  • [1] J. Bernstein, A. Vahdat, Y. Yue, and M. Liu (2021-01)

    On the distance between two neural networks and the stability of learning

    arXiv:2002.03432 [cs, math, stat]. External Links: 2002.03432 Cited by: §5.1.
  • [2] P. Chrabaszcz, I. Loshchilov, and F. Hutter (2018-02) Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari. arXiv:1802.08842 [cs]. External Links: 1802.08842 Cited by: §1.
  • [3] C. C. Chung and J. Hauser (1995-06) Nonlinear control of a swinging pendulum. Automatica 31 (6), pp. 851–862 (en). External Links: ISSN 0005-1098, Document Cited by: §3.1.
  • [4] E. Conti, V. Madhavan, F. Petroski Such, J. Lehman, K. Stanley, and J. Clune (2018) Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.
  • [5] F. de Avila Belbute-Peres, K. Smith, K. Allen, J. Tenenbaum, and J. Z. Kolter (2018) End-to-End Differentiable Physics for Learning and Control. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.
  • [6] J. Degrave, M. Hermans, J. Dambre, and F. wyffels (2019)

    A Differentiable Physics Engine for Deep Learning in Robotics

    Frontiers in Neurorobotics 13, pp. 6. External Links: ISSN 1662-5218, Document Cited by: §1.
  • [7] D. Floreano and S. Nolfi (2000-11) Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. Intelligent Robotics and Autonomous Agents Series, A Bradford Book, Cambridge, MA, USA (en). External Links: ISBN 978-0-262-14070-6 Cited by: §3.1.
  • [8] M. Geilinger, D. Hahn, J. Zehnder, M. Bächer, B. Thomaszewski, and S. Coros (2020-11) ADD: analytically differentiable dynamics for multi-body systems with frictional contact. ACM Transactions on Graphics 39 (6), pp. 190:1–190:15. External Links: ISSN 0730-0301, Document Cited by: §1.
  • [9] N. Hansen, S. D. Müller, and P. Koumoutsakos (2003-03) Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation 11 (1), pp. 1–18. External Links: ISSN 1063-6560, Document Cited by: §3.1.
  • [10] Y. Hu, L. Anderson, T. Li, Q. Sun, N. Carr, J. Ragan-Kelley, and F. Durand (2020-02) DiffTaichi: Differentiable Programming for Physical Simulation. arXiv:1910.00935 [physics, stat]. External Links: 1910.00935 Cited by: §1, §3.2.
  • [11] S. Khadka and K. Tumer (2018) Evolution-Guided Policy Gradient in Reinforcement Learning. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.
  • [12] N. Maheswaranathan, L. Metz, G. Tucker, D. Choi, and J. Sohl-Dickstein (2019-06) Guided evolutionary strategies: Augmenting random search with surrogate gradients. arXiv:1806.10230 [cs, stat]. External Links: 1806.10230 Cited by: §1, §2.
  • [13] H. Mania, A. Guy, and B. Recht (2018-03) Simple random search provides a competitive approach to reinforcement learning. arXiv:1803.07055 [cs, math, stat]. External Links: 1803.07055 Cited by: §1.
  • [14] A. Pourchot and O. Sigaud (2019-02) CEM-RL: Combining evolutionary and gradient-based methods for policy search. arXiv:1810.01222 [cs, stat]. External Links: 1810.01222 Cited by: §1.
  • [15] Y. Qiao, J. Liang, V. Koltun, and M. C. Lin (2020-07) Scalable Differentiable Physics for Learning and Control. arXiv:2007.02168 [cs, stat]. External Links: 2007.02168 Cited by: §1.
  • [16] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever (2017-09) Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864 [cs, stat]. External Links: 1703.03864 Cited by: §1, §3.1, §3.1.
  • [17] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, and P. Battaglia (2020-11) Learning to Simulate Complex Physics with Graph Networks. In

    Proceedings of the 37th International Conference on Machine Learning

    pp. 8459–8468 (en). External Links: ISSN 2640-3498 Cited by: §1.

5 Appendix

5.1 Hyperparameters

For experiments on the real robot, we did an extensive hyperparameter search for each algorithm in simulation and used the best set of them for training on the real robot. For experiments in Section 3.2 we also did an extensive hyperparameter search and reported the best curves for both of the algorithms.

We tried several optimization algorithms (SGD, Adam, RMSProp, AdaGrad, Fromage) and found the Fromage

[1] to perform most stable when DRS gradients are involved, we believe this is due to its inherent property of normalizing the gradients, which we observed to be a necessary additional step for all other optimization algorithms to converge with DRS gradient.

5.2 Pseudocode for Accelerated Learning on Real Robot

0:  Initial solution ; Optimizers ; Cost functions , ; DRS Gradient
0:  Final solution
1:  for  to  do
2:     // DRS Part
4:     for  to  do
5:        Make a step with DRS gradient
6:     end for
7:     Compute a simulation descent direction
9:     // Guided-ES Part
10:     Update low-dimensional guiding subspace with the
11:     Define search covariance
12:     for  to  do
13:        Sample perturbation
14:        Compute antithetic pair of losses and
15:     end for
16:     Compute Guided ES gradient estimate
17:     Update parameters using given optimizer
18:  end for
19:  return  
Algorithm 2 Evolutionary Strategies Guided by DRS for Training on Real Robot

5.3 Environments

5.3.1 Pendulum

Figure 3: Pendulum robot. The goal is to start swinging at 180 degrees. The robot starts pointing downward at zero velocity and acceleration.

Cost Function

5.3.2 Mass-Spring

The environment is depicted in Figure 4, the objective function is defined as a difference between the center of mass positions for first and last timesteps projected at .

Figure 4: Mass-Spring robot environment. The robot should advance to the right as far as possible in a fixed amount of time. The actuation is done by varying the lengths of the springs colored in red or blue. One simulation corresponds to 8 seconds of real-time. For videos check Differentiable Mass-Spring Simulator section at