. They are known as easy to parallelize, and a small number of hyperparameters makes them easy to tune. However, these methods were observed to exhibit a higher sample complexity (in comparison to Reinforcement Learning algorithms) and a strong dependence on an initial network initialization, making their adoption for training robots directly in reality troublesome.
Recently, there was a rapid growth of work on Differentiable Robot Simulators (DRS) [6, 10, 8, 15, 17, 5]. The promise is to drastically reduce the number of samples needed for finding successful policies in comparison to reinforcement learning or black-box policy search methods. However, implementing DRS is non-trivial and requires specific approaches (especially for collision handling [8, 15]) to make the resulting gradient useful for first-order optimization. Moreover, first-order optimization methods that rely on DRS’s gradient may get stuck in local optima and some specific treatment could be needed.
Lately, multiple attempts were made to merge Evolutionary Strategies (ES) and Reinforcement Learning (RL) algorithms [4, 11, 14]. Following this line of work, instead of combining ES and RL, we aim to combine the ends of the expected sample complexity spectrum, namely, DRS and Evolutionary Strategies. Specifically, in cases where the former is not useful for optimization with first-order methods. We propose to use recently introduced Guided Evolutionary Strategies (Guided-ES)  and treat DRS’s gradient as a surrogate. Using this combination, we demonstrate that it is possible to reduce sample complexity of Evolutionary Strategies by 3x-5x times when training directly on a real robot (Section 3.1). Furthermore, we show in simulation that even misleading gradients from a DRS can be utilized to speed up the convergence of Evolutionary Strategies (Section 3.2).
2 Evolutionary Strategies with Differentiable Robot Simulators
To unite DRS and Evolutionary Strategies, we propose to use an algorithm introduced in  – Guided Evolutionary Strategies. This method can make use of any surrogate gradients that are correlated with the true gradient to accelerate the convergence of evolutionary strategies. The surrogate gradients can be corrupted or biased in any way, the only requirement for them is to preserve a positive correlation with the true gradient. We presume that this is the case for the gradients computed with DRS and therefore propose to use it as a surrogate in Guided-ES.
The proposed approach is outlined in Algorithm 1. The key idea is to compute surrogate gradient using a differentiable robot simulator. Then a history of previous surrogates can be formed into a matrix of size (where is a size of search space) and an orthonormal basis extracted. This basis is next used to define a covariance matrix utilized for generating the perturbations. The hyperparameter controls how biased should the perturbations be in the direction of the surrogate gradient.
We note that there is no need to rely on the proposed zeroth-order optimization if one has access to the exact gradient of the objective function. However, there are cases where the gradient obtained with DRS is not effective when used with first-order optimization methods to optimize the objective. In the following section, we will demonstrate two of them and show how one can benefit from the proposed approach to improve the sample efficiency of evolutionary strategies.
3.1 Accelerated Learning on Real Robot
Evolutionary strategies and their variants were observed to be on par with reinforcement learning algorithms Salimans et al. , but require a larger amount of episodes for training in some cases. Still, algorithms of this class were successfully applied to train robots directly in the real world at the expense of higher experimental time . Here, we expect that incorporating information from DRS into the training process should accelerate the convergence of evolutionary strategies.
We use the approach described in Section 2, where a surrogate gradient is taken as a difference between current parameters and parameters obtained after a fixed amount of optimization steps in simulation with DRS. The ascent direction obtained with DRS does not exactly match the ascent direction of the objective function that depends on the real world, but we expect it to be at least positively correlated. A detailed algorithm can be found in the appendix111Source code: https://github.com/vkurenkov/guided-es-by-differentiable-simulators.
Due to limited access to the experimental platform with physical robots, we only consider one problem – a swinging pendulum , where the goal is to start swinging constantly at 180 degrees. The cost function is defined as a sum of differences between current energy and target energy at each timestep. The state space is represented by the last four measurements of position and velocity.
We compare Guided-ES with DRS against Vanilla Evolutionary Strategies (Vanilla-ES)  and also against Covariance Matrix Adaptation Evolution Strategy (CMA-ES)  to involve a method that adapts the covariance matrix during the training process. In Figure 1, we observe that the proposed method converges faster than both Vanilla-ES and CMA-ES, moreover, the convergence is robust to random seeds in opposition to other considered algorithms. We notice that Vanilla-ES and CMA-ES are highly dependent on initial network initialization (to the extent of non-convergence under our experimental budget), which is not the case for Guided-ES with DRS.
We find the results of this experiment encouraging as it gives a piece of evidence in favor of leveraging DRS (and a closed-loop transfer from simulation in general) for data-efficient evolutionary strategies. But more sophisticated robots and problems should be probed further.
3.2 When DRS Gradients Are Misleading
To demonstrate that the convergence of Guided-ES with DRS is possible even when the gradients are not useful for first-order optimization methods, we rely on a Mass-Spring simulator depicted in Figure 4. 
observed that a naive implementation of this simulator results in a gradient that can not be used with a stochastic gradient descent – optimization process does not converge to a satisfactory solution. And a specific approach to collision handling (time-of-impact fix) is necessary. In this setup, we want to probe how well Guided-ES performs if provided with gradients computed using a naive version of the simulator, therefore we remove the fix proposed by.
We observe in Figure 2 that the proposed approach is able to find a satisfactory solution even when guided by the ineffective gradient. Moreover, the convergence on average is faster than with simple evolutionary strategies. However, we observe that variation between training iterations is quite high, but we believe it can be overcome with proper learning rate scheduling. On the other side, in-sample variation – within a training iteration but over multiple seeds, is low suggesting that a satisfactory result is found fast independent of a network initialization (which is not the case for Vanilla-ES). The results of this experiment suggest that if one has access to a differentiable robot simulator with inaccurate backward propagation, they should not abandon such a simulator. As there is a possibility it can still be used to reduce sample complexity of evolutionary strategies, which are easier to tune than reinforcement learning algorithms.
4 Conclusion and Future Work
In this work, we proposed a natural way to combine Differentiable Robot Simulators and Evolutionary Strategies, and demonstrated two cases where such a combination could be beneficial to reduce sample complexity of the latter. We find these results encouraging, especially if one is interested to train robots directly in real life. However, future work should include more control robotic problems to probe the viability of the proposed method in more intricated setups that involve real and simulated robots.
The authors would like to thank Oleg Balakhnov and Sergei Savin for helpful discussions and help with the robot hardware.
On the distance between two neural networks and the stability of learning. arXiv:2002.03432 [cs, math, stat]. External Links: Cited by: §5.1.
-  (2018-02) Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari. arXiv:1802.08842 [cs]. External Links: Cited by: §1.
-  (1995-06) Nonlinear control of a swinging pendulum. Automatica 31 (6), pp. 851–862 (en). External Links: Cited by: §3.1.
-  (2018) Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.
-  (2018) End-to-End Differentiable Physics for Learning and Control. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.
A Differentiable Physics Engine for Deep Learning in Robotics. Frontiers in Neurorobotics 13, pp. 6. External Links: Cited by: §1.
-  (2000-11) Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines. Intelligent Robotics and Autonomous Agents Series, A Bradford Book, Cambridge, MA, USA (en). External Links: Cited by: §3.1.
-  (2020-11) ADD: analytically differentiable dynamics for multi-body systems with frictional contact. ACM Transactions on Graphics 39 (6), pp. 190:1–190:15. External Links: Cited by: §1.
-  (2003-03) Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation 11 (1), pp. 1–18. External Links: Cited by: §3.1.
-  (2020-02) DiffTaichi: Differentiable Programming for Physical Simulation. arXiv:1910.00935 [physics, stat]. External Links: Cited by: §1, §3.2.
-  (2018) Evolution-Guided Policy Gradient in Reinforcement Learning. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.
-  (2019-06) Guided evolutionary strategies: Augmenting random search with surrogate gradients. arXiv:1806.10230 [cs, stat]. External Links: Cited by: §1, §2.
-  (2018-03) Simple random search provides a competitive approach to reinforcement learning. arXiv:1803.07055 [cs, math, stat]. External Links: Cited by: §1.
-  (2019-02) CEM-RL: Combining evolutionary and gradient-based methods for policy search. arXiv:1810.01222 [cs, stat]. External Links: Cited by: §1.
-  (2020-07) Scalable Differentiable Physics for Learning and Control. arXiv:2007.02168 [cs, stat]. External Links: Cited by: §1.
-  (2017-09) Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864 [cs, stat]. External Links: Cited by: §1, §3.1, §3.1.
Learning to Simulate Complex Physics with Graph Networks.
Proceedings of the 37th International Conference on Machine Learning, pp. 8459–8468 (en). External Links: Cited by: §1.
For experiments on the real robot, we did an extensive hyperparameter search for each algorithm in simulation and used the best set of them for training on the real robot. For experiments in Section 3.2 we also did an extensive hyperparameter search and reported the best curves for both of the algorithms.
We tried several optimization algorithms (SGD, Adam, RMSProp, AdaGrad, Fromage) and found the Fromage to perform most stable when DRS gradients are involved, we believe this is due to its inherent property of normalizing the gradients, which we observed to be a necessary additional step for all other optimization algorithms to converge with DRS gradient.
5.2 Pseudocode for Accelerated Learning on Real Robot
The environment is depicted in Figure 4, the objective function is defined as a difference between the center of mass positions for first and last timesteps projected at .