1 Introduction
Evolutionary Strategies is a class of zerothorder blackbox optimization algorithms that were successfully applied in various simulated robotics tasks [13, 16]
. They are known as easy to parallelize, and a small number of hyperparameters makes them easy to tune. However, these methods were observed to exhibit a higher sample complexity (in comparison to Reinforcement Learning algorithms) and a strong dependence on an initial network initialization
[2], making their adoption for training robots directly in reality troublesome.Recently, there was a rapid growth of work on Differentiable Robot Simulators (DRS) [6, 10, 8, 15, 17, 5]. The promise is to drastically reduce the number of samples needed for finding successful policies in comparison to reinforcement learning or blackbox policy search methods. However, implementing DRS is nontrivial and requires specific approaches (especially for collision handling [8, 15]) to make the resulting gradient useful for firstorder optimization. Moreover, firstorder optimization methods that rely on DRS’s gradient may get stuck in local optima and some specific treatment could be needed.
Lately, multiple attempts were made to merge Evolutionary Strategies (ES) and Reinforcement Learning (RL) algorithms [4, 11, 14]. Following this line of work, instead of combining ES and RL, we aim to combine the ends of the expected sample complexity spectrum, namely, DRS and Evolutionary Strategies. Specifically, in cases where the former is not useful for optimization with firstorder methods. We propose to use recently introduced Guided Evolutionary Strategies (GuidedES) [12] and treat DRS’s gradient as a surrogate. Using this combination, we demonstrate that it is possible to reduce sample complexity of Evolutionary Strategies by 3x5x times when training directly on a real robot (Section 3.1). Furthermore, we show in simulation that even misleading gradients from a DRS can be utilized to speed up the convergence of Evolutionary Strategies (Section 3.2).
2 Evolutionary Strategies with Differentiable Robot Simulators
To unite DRS and Evolutionary Strategies, we propose to use an algorithm introduced in [12] – Guided Evolutionary Strategies. This method can make use of any surrogate gradients that are correlated with the true gradient to accelerate the convergence of evolutionary strategies. The surrogate gradients can be corrupted or biased in any way, the only requirement for them is to preserve a positive correlation with the true gradient. We presume that this is the case for the gradients computed with DRS and therefore propose to use it as a surrogate in GuidedES.
The proposed approach is outlined in Algorithm 1. The key idea is to compute surrogate gradient using a differentiable robot simulator. Then a history of previous surrogates can be formed into a matrix of size (where is a size of search space) and an orthonormal basis extracted. This basis is next used to define a covariance matrix utilized for generating the perturbations. The hyperparameter controls how biased should the perturbations be in the direction of the surrogate gradient.
We note that there is no need to rely on the proposed zerothorder optimization if one has access to the exact gradient of the objective function. However, there are cases where the gradient obtained with DRS is not effective when used with firstorder optimization methods to optimize the objective. In the following section, we will demonstrate two of them and show how one can benefit from the proposed approach to improve the sample efficiency of evolutionary strategies.
3 Experiments
3.1 Accelerated Learning on Real Robot
Evolutionary strategies and their variants were observed to be on par with reinforcement learning algorithms Salimans et al. [16], but require a larger amount of episodes for training in some cases. Still, algorithms of this class were successfully applied to train robots directly in the real world at the expense of higher experimental time [7]. Here, we expect that incorporating information from DRS into the training process should accelerate the convergence of evolutionary strategies.
We use the approach described in Section 2, where a surrogate gradient is taken as a difference between current parameters and parameters obtained after a fixed amount of optimization steps in simulation with DRS. The ascent direction obtained with DRS does not exactly match the ascent direction of the objective function that depends on the real world, but we expect it to be at least positively correlated. A detailed algorithm can be found in the appendix^{1}^{1}1Source code: https://github.com/vkurenkov/guidedesbydifferentiablesimulators.
Due to limited access to the experimental platform with physical robots, we only consider one problem – a swinging pendulum [3], where the goal is to start swinging constantly at 180 degrees. The cost function is defined as a sum of differences between current energy and target energy at each timestep. The state space is represented by the last four measurements of position and velocity.
We compare GuidedES with DRS against Vanilla Evolutionary Strategies (VanillaES) [16] and also against Covariance Matrix Adaptation Evolution Strategy (CMAES) [9] to involve a method that adapts the covariance matrix during the training process. In Figure 1, we observe that the proposed method converges faster than both VanillaES and CMAES, moreover, the convergence is robust to random seeds in opposition to other considered algorithms. We notice that VanillaES and CMAES are highly dependent on initial network initialization (to the extent of nonconvergence under our experimental budget), which is not the case for GuidedES with DRS.
We find the results of this experiment encouraging as it gives a piece of evidence in favor of leveraging DRS (and a closedloop transfer from simulation in general) for dataefficient evolutionary strategies. But more sophisticated robots and problems should be probed further.
3.2 When DRS Gradients Are Misleading
To demonstrate that the convergence of GuidedES with DRS is possible even when the gradients are not useful for firstorder optimization methods, we rely on a MassSpring simulator depicted in Figure 4. [10]
observed that a naive implementation of this simulator results in a gradient that can not be used with a stochastic gradient descent – optimization process does not converge to a satisfactory solution. And a specific approach to collision handling (timeofimpact fix) is necessary. In this setup, we want to probe how well GuidedES performs if provided with gradients computed using a naive version of the simulator, therefore we remove the fix proposed by
[10].We observe in Figure 2 that the proposed approach is able to find a satisfactory solution even when guided by the ineffective gradient. Moreover, the convergence on average is faster than with simple evolutionary strategies. However, we observe that variation between training iterations is quite high, but we believe it can be overcome with proper learning rate scheduling. On the other side, insample variation – within a training iteration but over multiple seeds, is low suggesting that a satisfactory result is found fast independent of a network initialization (which is not the case for VanillaES). The results of this experiment suggest that if one has access to a differentiable robot simulator with inaccurate backward propagation, they should not abandon such a simulator. As there is a possibility it can still be used to reduce sample complexity of evolutionary strategies, which are easier to tune than reinforcement learning algorithms.
4 Conclusion and Future Work
In this work, we proposed a natural way to combine Differentiable Robot Simulators and Evolutionary Strategies, and demonstrated two cases where such a combination could be beneficial to reduce sample complexity of the latter. We find these results encouraging, especially if one is interested to train robots directly in real life. However, future work should include more control robotic problems to probe the viability of the proposed method in more intricated setups that involve real and simulated robots.
Acknowledgments
The authors would like to thank Oleg Balakhnov and Sergei Savin for helpful discussions and help with the robot hardware.
References

[1]
(202101)
On the distance between two neural networks and the stability of learning
. arXiv:2002.03432 [cs, math, stat]. External Links: 2002.03432 Cited by: §5.1.  [2] (201802) Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari. arXiv:1802.08842 [cs]. External Links: 1802.08842 Cited by: §1.
 [3] (199506) Nonlinear control of a swinging pendulum. Automatica 31 (6), pp. 851–862 (en). External Links: ISSN 00051098, Document Cited by: §3.1.
 [4] (2018) Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of NoveltySeeking Agents. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.
 [5] (2018) EndtoEnd Differentiable Physics for Learning and Control. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.

[6]
(2019)
A Differentiable Physics Engine for Deep Learning in Robotics
. Frontiers in Neurorobotics 13, pp. 6. External Links: ISSN 16625218, Document Cited by: §1.  [7] (200011) Evolutionary Robotics: The Biology, Intelligence, and Technology of SelfOrganizing Machines. Intelligent Robotics and Autonomous Agents Series, A Bradford Book, Cambridge, MA, USA (en). External Links: ISBN 9780262140706 Cited by: §3.1.
 [8] (202011) ADD: analytically differentiable dynamics for multibody systems with frictional contact. ACM Transactions on Graphics 39 (6), pp. 190:1–190:15. External Links: ISSN 07300301, Document Cited by: §1.
 [9] (200303) Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMAES). Evolutionary Computation 11 (1), pp. 1–18. External Links: ISSN 10636560, Document Cited by: §3.1.
 [10] (202002) DiffTaichi: Differentiable Programming for Physical Simulation. arXiv:1910.00935 [physics, stat]. External Links: 1910.00935 Cited by: §1, §3.2.
 [11] (2018) EvolutionGuided Policy Gradient in Reinforcement Learning. In Advances in Neural Information Processing Systems, Vol. 31. Cited by: §1.
 [12] (201906) Guided evolutionary strategies: Augmenting random search with surrogate gradients. arXiv:1806.10230 [cs, stat]. External Links: 1806.10230 Cited by: §1, §2.
 [13] (201803) Simple random search provides a competitive approach to reinforcement learning. arXiv:1803.07055 [cs, math, stat]. External Links: 1803.07055 Cited by: §1.
 [14] (201902) CEMRL: Combining evolutionary and gradientbased methods for policy search. arXiv:1810.01222 [cs, stat]. External Links: 1810.01222 Cited by: §1.
 [15] (202007) Scalable Differentiable Physics for Learning and Control. arXiv:2007.02168 [cs, stat]. External Links: 2007.02168 Cited by: §1.
 [16] (201709) Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv:1703.03864 [cs, stat]. External Links: 1703.03864 Cited by: §1, §3.1, §3.1.

[17]
(202011)
Learning to Simulate Complex Physics with Graph Networks.
In
Proceedings of the 37th International Conference on Machine Learning
, pp. 8459–8468 (en). External Links: ISSN 26403498 Cited by: §1.
5 Appendix
5.1 Hyperparameters
For experiments on the real robot, we did an extensive hyperparameter search for each algorithm in simulation and used the best set of them for training on the real robot. For experiments in Section 3.2 we also did an extensive hyperparameter search and reported the best curves for both of the algorithms.
We tried several optimization algorithms (SGD, Adam, RMSProp, AdaGrad, Fromage) and found the Fromage
[1] to perform most stable when DRS gradients are involved, we believe this is due to its inherent property of normalizing the gradients, which we observed to be a necessary additional step for all other optimization algorithms to converge with DRS gradient.5.2 Pseudocode for Accelerated Learning on Real Robot
5.3 Environments
5.3.1 Pendulum
Cost Function
5.3.2 MassSpring
The environment is depicted in Figure 4, the objective function is defined as a difference between the center of mass positions for first and last timesteps projected at .
Comments
There are no comments yet.