1 Introduction
Recent developments in Deep Reinforcement Learning (DRL) have shown the potential to learn complex robotic controllers in an automatic way with minimal human intervention. However, due to the high sample complexity of DRL algorithms, directly training control policies on the hardware still remains largely impractical for agile tasks such as locomotion.
A promising direction to address this issue is to use the idea of transfer learning which learns a model in a source environment and transfers it to a target environment of interest. In the context of learning robotic control policies, we can consider the real world the target environment and the computer simulation the source environment. Learning in simulated environment provides a safe and efficient way to explore large variety of different situations that a real robot might encounter. However, due to the model discrepancy between physics simulation and the realworld environment, also known as the Reality Gap (Boeing & Bräunl, 2012; Koos et al., 2010), the trained policy usually fails in the target environment. Efforts have been made to analyze the cause of the Reality Gap (Neunert et al., 2017) and to develop more accurate computer simulation (Tan et al., 2018) to improve the ability of a policy when transferred it to real hardware. Orthogonal to improving the fidelity of the physics simulation, researchers have also attempted to cross the reality gap by training more capable policies that succeed in a large variety of simulated environments. Our method falls into the second category.
To develop a policy capable of performing in various environments with different governing dynamics, one can consider to train a robust policy or to train an adaptive policy
. In both cases, the policy is trained in environments with randomized dynamics. A robust policy is trained under a range of dynamics without identifying the specific dynamic parameters. Such a policy can only perform well if the simulation is a good approximation of the real world dynamics. In addition, for more agile motor skills, robust policies may appear overconservative due to the uncertainty in the training environments. On the other hand, when an adaptive policy is used, it learns to first identify, implicitly or explicitly, the dynamics of its environment, and then selects the best action according to the identified dynamics. Being able to act differently according to the dynamics allows the adaptive policy to achieve higher performance on a larger range of dynamic systems. However, when the target dynamics is notably different from the training dynamics, it may still produce suboptimal results for two reasons. First, when a sequence of novel observations is presented, the learned identification model in an adaptive policy may produce inaccurate estimations. Second, even when the identification model is perfect, the corresponding action may not be optimal for the new situation.
In this work, we introduce a new method that enjoys the versatility of an adaptive policy, while avoiding the challenges of system identification. Instead of relating the observations in the target environment to the similar experiences in the training environment, our method searches for the best policy directly based on the task performance in the target environment.
Our algorithm can be divided to two stages. The first stage trains a family of policies, each optimized for a particular vector of dynamic parameters. The family of policies can be parameterized by the dynamic parameters in a continuous representation. Each member of the family, referred to as a
strategy, is a policy associated with particular dynamic parameters. Using a locomotion controller as an example, a strategy associated with low friction coefficient may exhibit cautious walking motion, while a strategy associated with high friction coefficient may result in more aggressive running motion. In the second stage we perform a search over the strategies in the target environment to find the one that achieves the highest task performance.We evaluate our method on three examples that demonstrate transfer of a policy learned in one simulator DART, to another simulator MuJoCo. Due to the differences in the constraint solvers, these simulators can produce notably different simulation results. A more detailed description of the differences between DART and MuJoCo is provided in Appendix A. We also add latency to the MuJoCo environment to mimic a real world scenario, which further increases the difficulty of the transfer. In addition, we use a quadruped robot simulated in Bullet to demonstrate that our method can overcome actuator modeling errors. Latency and actuator modeling have been found to be important for SimtoReal transfer of locomotion policies (Tan et al., 2018; Neunert et al., 2017). Finally, we transfer a policy learned for a robot composed of rigid bodies to a robot whose endeffector is deformable, demonstrating the possiblity of using our method to transfer to problems that are challenging to model faithfully.
2 Related Work
While DRL has demonstrated its ability to learn control policies for complex and dynamic motor skills in simulation (Schulman et al., 2015, 2017; Peng et al., 2018a, 2017; Yu et al., 2018; Heess et al., 2017), very few learning algorithms have successfully transferred these policies to the real world. Researchers have proposed to address this issue by optimizing or learning a simulation model using data from the realworld (Tan et al., 2018, 2016; Deisenroth & Rasmussen, 2011; Ha & Yamane, 2015; Abbeel & Ng, 2005). The main drawback for these methods is that for highly agile and high dimensional control problems, fitting an accurate dynamic model can be challenging and data inefficient.
Complementary to learning an accurate simulation model, a different line of research in simtoreal transfer is to learn policies that can work under a large variety of simulated environments. One common approach is domain randomization. Training a robust policy with domain randomization has been shown to improve the ability to transfer a policy (Tan et al., 2018; Tobin et al., 2017; Rajeswaran et al., 2016; Pinto et al., 2017). Tobin et al. (2017) trained an object detector with randomized appearance and applied it in a realworld gripping task. Tan et al. (2018) showed that training a robust policy with randomized dynamic parameters is crucial for transferring quadruped locomotion to the real world. Designing the parameters and range of the domain to be randomized requires specific knowledge for different tasks. If the range is set too high, the policy may learn a conservative strategy or fail to learn the task, while a small range may not provide enough variation for the policy to transfer to realworld.
A similar idea is to train an adaptive policy with the current and the past observations as input. Such an adaptive policy is able to identify the dynamic parameters online either implicitly (OpenAI et al., 2018; Peng et al., 2018b) or explicitly (Yu et al., 2017) and apply actions appropriate for different system dynamics. Recently, adaptive policies have been used for simtoreal transfer, such as inhand manipulation tasks (OpenAI et al., 2018) or nonprehensile manipulation tasks (Peng et al., 2018b). Instead of training one robust or adaptive policy, Zhang et al. (2018) trained multiple policies for a set of randomized environments and learned to combine them linearly in a separate set of environments. The main advantage of these methods is that they can be trained entirely in simulation and deployed in realworld without further finetuning. However, policies trained in simulation may not generalize well when the discrepancy between the target environment and the simulation is too large. Our method also uses dynamic randomization to train policies that exhibit different strategies for different dynamics, however, instead of relying on the simulation to learn an identification model for selecting the strategy, we propose to directly optimize the strategy in the target environment.
3 Background
We formulate the motor skill learning problem as a Markov Decision Process (MDP),
, where is the state space, is the action space, is the reward function, is the transition function, is the initial state distribution and is the discount factor. The goal of reinforcement learning is to find a control policy that maximizes the expected accumulated reward: , where , and . In practice, we usually only have access to an observation of the robot that contains a partial information of the robot’s state. In this case, we will have a PartiallyObservable Markov Decision Process (POMDP) and the policy would become , where is the observation space.In the context of transfer learning, we can define a source MDP and a target MDP and the goal would be to learn a policy for such that it also works well on . In this work, is regarded as a parameterized space of transition functions, , where is a vector of physical parameters defining the dynamic model (e.g. friction coefficient). The transfer learning in this context learns a policy under and transfers to , where .
4 Methods
We propose a new method for transferring a policy learned in simulated environment to a target environment with unknown dynamics. Our algorithm consists of two stages: learning a family of policy and optimizing strategy.
4.1 Learning a Family of Policies
The first stage of our method is to learn a family of policies, each for a particular dynamics
. One can potentially train each policy individually and interpolate them to cover the space of
(Stulp et al., 2013; Da Silva et al., 2012). However, as the dimension of increases, the number of policies required for interpolation grows exponentially.Since many of these policies are trained under similar dynamics, our method merges them into one neural network and trains the entire family of policies simultaneously. We follow the work by
Yu et al. (2017), which trains a policy that takes as input not only the observation of the robot , but also the physical parameters . At the beginning of each rollout during the training, we randomly pick a new set of physical parameters for the simulation and fix it throughout the rollout. After training the policy this way, we obtain a family of policies that is parameterized by the dynamics parameters . Given a particular , we define the corresponding policy as . We will call such an instantiated policy a strategy.4.2 Optimizating Strategy
The second stage of our method is to search for the optimal strategy in the space of for the target environment. Previous work learns a mapping between the experiences under source dynamics and the corresponding . When new experiences are generated in the target environment, this mapping will identify a based on similar experiences previously generated in the source environment. While using experience similarity as a metric to identify transfers well to a target environment that has the same dynamic parameter space (Yu et al., 2017), it does not generalize well when the dynamic parameter space is different.
Since our goal is to find a strategy that works well in the target environment, a more direct approach is to use the performance of the task, i.e. the accumulated reward, in the target environment as the metric to search for the strategy:
(1) 
Solving Equation 1 can be done efficiently because the search space in Equation 1 is the space of dynamic parameters , rather than the space of policies, which are represented as neural networks in our implementation. To further reduce the number of samples from the target environment needed for solving Equation 1
, we investigated a number of algorithms, including Bayesian optimization, modelbased methods and an evolutionary algorithm (CMA). A detailed description and comparison of these methods are provided in Appendix
C.We chose Covariance Matrix Adaptation (CMA) (Hansen et al., 1995)
, because it reliably outperforms other methods in terms of sampleefficiency. At each iteration of CMA, a set of samples are drawn from a Gaussian distribution over the space of
. For each sample, we instantiate a strategy and use it to generate rollouts in the target environment. The fitness of the sample is determined by evaluating the rollouts using . Based on the fitness values of the samples in the current iteration, the mean and the covariance matrix of the Gaussian distribution are updated for the next iteration.5 Experiments
To evaluate the ability of our method to overcome the reality gap, we train policies for four locomotion control tasks (hopper, walker2d, half cheetah, quadruped robot) and transfer each policy to environments with different dynamics. To mimic the reality gap seen in the realworld, we use target environments that are different from the source environments in their contact modeling, latency or actuator modeling. In addition, we also test the ability of our method to generalize to discrepancies in body mass, terrain slope and endeffector materials. Figure 1 shows the source and target environments for all the tasks and summarizes the modeled reality gap in each task. During training, we choose different combinations of dynamic parameters to randomize and make sure they do not overlap with the variations in the testing environments. For clarity of exposition, we denote the dimension of the dynamic parameters that are randomized during training as . For all examples, we use the Proximal Policy Optimization (PPO) (Schulman et al., 2017)
to optimize the control policy. We run three trials with different random seeds for each experiment and report the mean and one standard deviation of the average return. A more detailed description of the experiment setup as well as the simulated reality gaps are provided in Appendix
B.5.1 Baseline Methods
We compare our method, Strategy Optimization with CMAES (SOCMA) to two baseline methods: training a robust policy (Robust) and training an adaptive policy (Adapt). The robust policy is represented as a feed forward neural network, which takes as input the most recent observation from the robot, i.e.
. The policy needs to learn actions that work for all the training environments, but the dynamic parameters cannot be identified from its input. In contrast, an adaptive policy is given a history of observations as input, i.e. . This allows the policy to potentially identify the environment being tested and adaptively choose the actions based on the identified environment. There are many possible ways to train an adaptive policy, for example, one can use an LSTM network to represent the policy or use a history of observations as input to a feedforward network. We find that for the tasks we demonstrate, directly training an LSTM policy using PPO is much less efficient and reaches lower end performance than training a feedforward network with history input. Therefore, in our experiments we use a feedforward network with a history of observations to represent the adaptive policy . For fair comparison, we continue to train the baseline methods after transferring to the target environment, using the same amount of samples SOCMA consumes in the target environment. We refer this additional training step as ”finetuning” and detail the process in Appendix B.4.5.2 Hopper DART to MuJoCo
In the first example, we build a singlelegged robot in DART similar to the Hopper environment simulated by MuJoCo in OpenAI Gym (Brockman et al., 2016). We investigate two questions in this example: 1) does SOCMA work better than alternative methods in transferring to unknown environments? and 2) how does the choice of affect the performance of policy transfer? To this end, we perform experiments with , and . For the experiment with , we randomize the mass of the robot’s foot and the restitution coefficient between the foot and the ground. For , we in addition randomize the friction coefficient, the mass of the robot’s torso and the joint strength of the robot. We further include the mass of the rest two body parts and the joint damping to construct the randomized dynamic parameters for . The specific ranges of randomization are described in Appendix B.4.
We first evaluate how the performance of different methods varies with the number of samples in the target environment. As shown in Figure 2, when is low, none of the three methods were able to transfer to the MuJoCo Hopper successfully. This is possibly due to there not being enough variation in the dynamics to learn diverse strategies. When , SOCMA can successfully transfer the policy to MuJoCo Hopper with good performance, while the baseline methods were not able to adapt to the new environment using the same sample budget. We further increase to as shown in Figure 2 (c) and find that SOCMA achieved similar end performance to , while the baselines do not transfer well to the target environment.
We further investigate whether SOCMA can generalize to differences in joint limits in addition to the discrepancies between DART and MuJoCo. Specifically, we vary the magnitude of the ankle joint limit in radians (default is ) for the MuJoCo Hopper, and run all the methods with samples. The result can be found in Figure 3. We can see a similar trend that with low the transfer is challenging, and with higher value of SOCMA is able to achieve notably better transfer performance than the baseline methods.
5.3 Walker2d DART to MuJoCo with latency
In this example, we use the lower body of a biped robot constrained to a plane, according to the Walker2d environment in OpenAI Gym. We find that with different initializations of the policy network, training could lead to drastically different gaits, e.g. hopping with both legs, running with one legs dragging the other, normal running, etc. Some of these gaits are more robust to environment changes than others, which makes analyzing the performance of transfer learning algorithms challenging. To make sure the policies are more comparable, we use the symmetry loss from Yu et al. (2018), which leads to all policies learning a symmetric running gait. To mimic modeling error seen on real robots, we add a latency of ms to the MuJoCo simulator. We train policies with , for which we randomize the friction coefficient, restitution coefficient and the joint damping of the six joints during training. Figure 4 (a) shows the transfer performance of different method with respect to the sample numbers in the target environment.
In this example, training a robust policy and perform finetuning in the target environment achieved competitive performance to SOCMA, while training an adaptive policy does not perform well in transferring the policy.
We further vary the mass of the robot’s right foot in kg in the MuJoCo Walker2d environment and compare the transfer performance of SOCMA to the baselines. The default foot mass is kg. We use in total samples in the target environment for all methods being compared and the results can be found in Figure 4 (b). We can see that SOCMA performs marginally better than training a robust policy, while being notably better than adaptive policy.
5.4 HalfCheetah DART to MuJoCo with delay
In the third example, we train policies for the HalfCheetah environment from OpenAI Gym. We again test the performance of transfer from DART to MuJoCo for this example. In addition, we add a latency of ms to the target environment. We randomize dynamic parameters in the source environment consisting of the mass of all body parts, the friction coefficient and the restitution coefficient during training, i.e. . The results of the performance with respect to sample numbers in target environment can be found in Figure 5 (a). We in addition evaluate transfer to environments where the slope of the ground varies, as shown in Figure 5 (b). In both settings, SOCMA works the best among the three methods.
5.5 Quadruped robot with actuator modeling error
As demonstrated by Tan et al. (2018), when a robust policy is used, having an accurate actuator model is important to the successful transfer of policy from simulation to realworld for a quadruped robot, Minitaur (Figure 1 (d)). Specifically, they found that when a linear torquecurrent relation is assumed in the actuator dynamics in the simulation, the policy learned in simulation transfers poorly to the real hardware. When the actuator dynamics is modeled more accurately, in their case using a nonlinear torquecurrent relation, the transfer performance were notably improved.
In our experiment, we investigate whether SOCMA is able to overcome the error in actuator models. We use the same simulation environment from Tan et al. (2018), which is simulated in Bullet (Coumans & Bai, 20162017). During the training of the policy, we use a linear torquecurrent relation for the actuator model, and we transfer the learned policy to an environment with the more accurate nonlinear torquecurrent relation. We use the same dynamic parameters and corresponding ranges used by Tan et al. (2018) for dynamics randomization during training. The results show that SOCMA can successfully transfer a policy trained with a crude actuator model to an environment with more realistic actuators(Figure 6 (a)).
5.6 Hopper rigid to deformable foot
Applying deep reinforcement learning to environments with deformable objects can be computationally inefficient (Clegg et al., 2018). Being able to transfer a policy trained in a purely rigidbody environment to an environment containing deformable objects can greatly improve the efficiency of learning. In our last example, we transfer a policy trained for the Hopper example with rigid objects only to a Hopper model with a deformable foot (Figre 1 (e)). The soft foot is modeled using the soft shape in DART, which uses an approximate but relatively efficient way of modeling deformable objects (Jain & Liu, 2011). We train policies in the rigid Hopper environment and randomize the same set of dynamic parameters as in the in the DARTtoMuJoCo transfer example with . We then transfer the learned policy to the soft Hopper environment where the Hopper’s foot is deformable. The results can be found in Figure 6 (b). SOCMA was able to successfully control the robot to move forward without falling, while the baseline methods failed to do so.
6 Discussion and Conclusion
We have proposed a policy transfer algorithm where we first learn a family of policies simultaneously in a source environment that exhibits different behaviors and then search directly for a policy in the family that performs the best in the target environment. We show that our proposed method can overcome large modeling errors, including those commonly seen on real robotic platforms with relatively low amount of samples in the target environment. For all the examples in this work, we used the same reward function in the source and target environment. In practice, one may only have access to a sparse reward function in the target environment, e.g. distance travelled before fall to the ground. Our method, using an evolutionary algorithm (CMA) naturally handles sparse rewards and thus the performance gap between our method (SOCMA) and the baseline methods will likely to grow if a sparse reward is used. These results suggest that our method has the potential to transfer policies trained in simulation to real hardware.
There are a few interesting directions that merit further investigations. First, it would be interesting to explore other approaches for learning a family of policies that exhibit different behaviors. One such example is the method proposed by Eysenbach et al. (2018), where an agent learns diverse skills without a reward function in an unsupervised manner. Equipping our policy with memories is another interesting direction to investigate. The addition of memory will extend our method to target environments that vary over time. We have investigated in a few options for strategy optimization and found that CMAES works well for our examples. However, it would be desired if we can find a way to further reduce the sample required in the target environment. One possible direction is to warmstart the optimization using models learned in simulation, such as the calibration model in Zhang et al. (2018) or the online system identification model in Yu et al. (2017).
References
 (1) Pycma. URL https://github.com/CMAES/pycma.

Abbeel & Ng (2005)
Pieter Abbeel and Andrew Y. Ng.
Exploration and
Apprenticeship Learning in Reinforcement Learning.
In
International Conference on Machine Learning
, pp. 1–8, 2005.  Boeing & Bräunl (2012) Adrian Boeing and Thomas Bräunl. Leveraging multiple simulators for crossing the reality gap. In Control Automation Robotics & Vision (ICARCV), 2012 12th International Conference on, pp. 1113–1119. IEEE, 2012.
 Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Clegg et al. (2018) Alexander Clegg, Wenhao Yu, Jie Tan, C. Karen Liu, and Greg Turk. Learning to dress: Synthesizing human dressing motion via deep reinforcement learning. ACM Transactions on Graphics (TOG), 37(6), 2018.
 Coumans & Bai (20162017) Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simulation in robotics, games and machine learning., 20162017. URL http://pybullet.org.
 Da Silva et al. (2012) Bruno Da Silva, George Konidaris, and Andrew Barto. Learning parameterized skills. arXiv preprint arXiv:1206.6398, 2012.
 Deisenroth & Rasmussen (2011) Marc Deisenroth and Carl E Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pp. 465–472, 2011.
 Dhariwal et al. (2017) Prafulla Dhariwal, Christopher Hesse, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. https://github.com/openai/baselines, 2017.
 Eysenbach et al. (2018) Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
 Ha (2016) Sehoon Ha. Pydart2, 2016. URL https://github.com/sehoonha/pydart2.
 Ha & Yamane (2015) Sehoon Ha and Katsu Yamane. Reducing Hardware Experiments for Model Learning and Policy Optimization. IROS, 2015.
 Hansen et al. (1995) Nikolaus Hansen, Andreas Ostermeier, and Andreas Gawelczyk. On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. In ICGA, pp. 57–64, 1995.
 Heess et al. (2017) Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, Ali Eslami, Martin Riedmiller, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jain & Liu (2011) Sumit Jain and C Karen Liu. Controlling physicsbased characters using soft contacts. ACM Transactions on Graphics (TOG), 30(6):163, 2011.

Koos et al. (2010)
Sylvain Koos, JeanBaptiste Mouret, and Stéphane Doncieux.
Crossing the reality gap in evolutionary robotics by promoting
transferable controllers.
In
Proceedings of the 12th annual conference on Genetic and evolutionary computation
, pp. 119–126. ACM, 2010.  Lee et al. (2018) Jeongseok Lee, Michael X Grey, Sehoon Ha, Tobias Kunz, Sumit Jain, Yuting Ye, Siddhartha S Srinivasa, Mike Stilman, and C Karen Liu. Dart: Dynamic animation and robotics toolkit. The Journal of Open Source Software, 3(22):500, 2018.
 Neunert et al. (2017) Michael Neunert, Thiago Boaventura, and Jonas Buchli. Why offtheshelf physics simulators fail in evaluating feedback controller performancea case study for quadrupedal robots. In Advances in Cooperative Robotics, pp. 464–472. World Scientific, 2017.
 OpenAI et al. (2018) OpenAI, :, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba. Learning Dexterous InHand Manipulation. ArXiv eprints, August 2018.
 Peng et al. (2017) Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG), 36(4):41, 2017.
 Peng et al. (2018a) Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Exampleguided deep reinforcement learning of physicsbased character skills. arXiv preprint arXiv:1804.02717, 2018a.
 Peng et al. (2018b) Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Simtoreal transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE, 2018b.
 Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. arXiv preprint arXiv:1703.02702, 2017.
 Rajeswaran et al. (2016) Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
 Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
 Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Stulp et al. (2013) Freek Stulp, Gennaro Raiola, Antoine Hoarau, Serena Ivaldi, and Olivier Sigaud. Learning compact parameterized skills with a single regression. parameters, 5:9, 2013.
 (29) Jie Tan, Kristin Siu, and C Karen Liu. Contact handling for articulated rigid bodies using lcp.
 Tan et al. (2016) Jie Tan, Zhaoming Xie, Byron Boots, and C Karen Liu. Simulationbased design of dynamic controllers for humanoid balancing. In Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ International Conference on, pp. 2729–2736. IEEE, 2016.
 Tan et al. (2018) Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Simtoreal: Learning agile locomotion for quadruped robots. In Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018. doi: 10.15607/RSS.2018.XIV.010.
 Tobin et al. (2017) Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 23–30. IEEE, 2017.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Wulfmeier et al. (2017) Markus Wulfmeier, Ingmar Posner, and Pieter Abbeel. Mutual alignment transfer learning. arXiv preprint arXiv:1707.07907, 2017.
 Yu & Liu (2017) Wenhao Yu and C. Karen Liu. Dartenv, 2017. URL https://github.com/DartEnv/dartenv.
 Yu et al. (2017) Wenhao Yu, Jie Tan, C. Karen Liu, and Greg Turk. Preparing for the unknown: Learning a universal policy with online system identification. In Proceedings of Robotics: Science and Systems, Cambridge, Massachusetts, July 2017. doi: 10.15607/RSS.2017.XIII.048.
 Yu et al. (2018) Wenhao Yu, Greg Turk, and C. Karen Liu. Learning symmetric and lowenergy locomotion. ACM Transactions on Graphics (Proc. SIGGRAPH 2018  to appear), 37(4), 2018.
 Zhang et al. (2018) Chao Zhang, Yang Yu, and ZhiHua Zhou. Learning environmental calibration actions for policy selfevolution. In IJCAI, pp. 3061–3067, 2018.
Appendix A Differences between DART and MuJoCo
DART (Lee et al., 2018) and MuJoCo (Todorov et al., 2012) are both physicallybased simulators that computes how the state of virtual character or robot evolves over time and interacts with other objects in a physical way. Both of them have been demonstrated for transferring controllers learned for a simulated robot to a real hardware (Tan et al., 2018, 2016), and there has been work trying to transfer policies between DART and MuJoCo (Wulfmeier et al., 2017). The two simulators are similar in many aspects, for example both of them uses generalized coordinates for representing the state of a robot. Despite the many similarities between DART and MuJoCo, there are a few important differences between them that makes transferring a policy trained in one simulator to the other challenging. For the examples of DARTtoMuJoCo transfer presented in this paper, there are three major differences as described below:

Contact Handling
Contact modeling is important for robotic control applications, especially for locomotion tasks, where robots heavily rely on manipulating contacts between endeffector and the ground to move forward. In DART, contacts are handled by solving a linear complementarity problem (LCP) (Tan et al., ), which ensures that in the next timestep, the objects will not penetrate with each other, while satisfying the laws of physics. In MuJoCo, the contact dynamics is modeled using a complementarityfree formulation, which means the objects might penetrate with each other. The resulting impulse will increase with the penetration depth and separate the penetrating objects eventually.

Joint Limits
Similar to the contact solver, DART tries to solve the joint limit constraints exactly so that the joint limit is not violated in the next timestep, while MuJoCo uses a soft constraint formulation, which means the character may violate the joint limit constraint.

Armature
In MuJoCo, a diagonal matrix is added to the joint space inertia matrix that can help stabilize the simulation, where is a scalar named Armature in MuJoCo and is the identity matrix. This is not modeled in DART.
To illustrate how much difference these simulator characteristics can lead to, we compare the Hopper example in DART and MuJoCo by simulating both using the same sequence of randomly generated actions from an identical state. We plot the linear position and velocity of the torso and foot of the robot, which is shown in Figure 7. We can see that due to the differences in the dynamics, the two simulators would control the robot to reach notably different states even though the initial state and control signals are identical.
Environment  Observation  Action  Reward 

Hopper  
Walker2d  
HalfCheetah  
Quadruped 
Appendix B Experiment Details
b.1 Experiment Settings
We use Proximal Policy Optimization (PPO) implemented in OpenAI Baselines (Dhariwal et al., 2017) for training all the policies in our experiments. For simulation in DART, we use DartEnv (Yu & Liu, 2017), which implements the continuous control benchmarks in OpenAI Gym using PyDart (Ha, 2016). For all of our examples, we represent the policy as a feedforward neural network with three hidden layers, each consists of hidden nodes.
b.2 Environment Details
The observation space, action space and the reward function used in all of our examples can be found in Table 1. For the Walker2d environment, we found that with the original environment settings in OpenAI Gym, the robot sometimes learn to hop forward, possibly due to the ankle being too strong. Therefore, we reduce the torque limit of the ankle joint in both DART and MuJoCo environment for the Walker2d problem from to . We found that with this modification, we can reliably learn locomotion gaits that are closer to a human running gait.
Below we list the dynamic randomization settings used in our experiments. Table 2, Table 3 and Table 4 shows the range of the randomization for different dynamic parameters in different environments. For the quadruped example, we used the same settings as in Tan et al. (2018).
Dynamic Parmeter  Range 

Friction Coefficient  
Restitution Coefficient  
Mass  kg 
Joint Damping  
Joint Torque Scale 
Dynamic Parmeter  Range 

Friction Coefficient  
Restitution Coefficient  
Joint Damping 
Dynamic Parmeter  Range 

Friction Coefficient  
Restitution Coefficient  
Mass  kg 
Joint Torque Scale 
b.3 Simulated Reality Gaps
To evaluate the ability of our method to overcome the modeling error, we designed six types of modeling errors. Each example shown in our experiments contains one or more modeling errors listed below.

DART to MuJoCo
For the Hopper, Walker2d and HalfCheetah example, we trained policies that transfers from DART environment to MuJoCo environment. As discussed in Appendix A, the major differences between DART and MuJoCo are contacts, joint limits and armature.

Latency
The second type of modeling error we tested is latency in the signals. Specifically, we model the latency between when an observation is sent out from the robot, and when the action corresponding to this observation is executed on the robot. When a policy is trained without any delay, it is usually very challenging to transfer it to problems with delay added. The value of delay is usually below ms and we use ms and ms in our examples.

Actuator Modeling Error
As noted by Tan et al. (2018), error in actuator modeling is an important factor that contributes to the reality gap. They solved it by identifying a more accurate actuator model by fitting a piecewise linear function for the torquecurrent relation. We use their identified actuator model as the groundtruth target environment in our experiments and used the ideal linear torquecurrent relation in the source environments.

Foot Mass
In the example of Walker2d, we vary the mass of the right foot on the robot to create a family of target environments for testing. The range of the torso mass varies in kg.

Terrain Slope
In the example of HalfCheetah, we vary the slope of the ground to create a family of target environments for testing. This is implemented as rotating the gravity direction by the same angle. The angle varies in the range radians.

Rigid to Deformable
The last type of modeling error we test is that a deformable object in the target environment is modeled as a rigid object in the source environment. The deformable object is modeled using the soft shape object in DART. In our example, we created a deformable box of size around the foot of the Hopper. We set the stiffness of the deformable object to be and the damping to be . We refer readers to Jain & Liu (2011) for more details of the softbody simulation.
b.4 Policy Training
For training policies in the source environment, we run PPO for iterations. In each iteration, we sample steps from the source environment to update the policy. For the rest of the hyperparameters, we use the default value from OpenAI Baselines (Dhariwal et al., 2017). We use a large batch size in our experiments as the policy needs to be trained to work on different dynamic parameters .
For finetuning of the Robust and Adaptive policy in the target environment, we use PPO with a batch size of , which is the default value used in OpenAI Baselines. Here we use a smaller batch size for two reasons: 1) since the policy is trained to work on only one dynamics, we do not need as many samples to optimize the policy in general and 2) the finetuning process has a limited sample budget and thus we want to use a smaller batch size so that the policy can be improved more.
b.5 Strategy Optimization with CMAES
We use the CMAES implementation in python by (PyC, ). At each iteration of CMAES, we generate samples from the latest Gaussian distribution, where is the dimension of the dynamic parameters. During evaluation of each sample , we run the policy in the target environment for three trials and average the returns to obtain the fitness of this sample.
Appendix C Alternative Methods for Strategy Optimization
In addition to CMAES, we have also experimented with a few other options for finding the best such that works well in the target environment. Here we show some experiment results for Strategy Optimization with Bayesian Optimization (SOBO) and Modelbased Optimization (SOMB).
c.1 Bayesian Optimization
Bayesian Optimization is a gradientfree optimization method that is known to work well for low dimensional continuous problems where evaluating the quality of each sample can be expensive. The main idea in Bayesian optimization is to incrementally build a Gaussian process (GP) model that estimates the loss of a given search parameter. At each iteration, a new sample is drawn by optimizing an acquisition function on the GP model. The acquisition function takes into account the exploration (search where the GP has low uncertainty) and exploitation (search where the GP predicts low loss). The new sample is then evaluated and added to the training dataset for GP.
We test Bayesian Optimization on the Hopper and Quadruped example, as shown in Figure 8. We can see that Bayesian Optimization can achieve comparable performance as CMAES and thus is a viable choice to our problem. However, SOBA appears in general noisier than CMAES and is in general less computationally efficient due to the refitting of GP models.
c.2 Modelbased Optimization
Another possible way to perform strategy optimization is to use a modelbased method. In a modelbased method, we learn the dynamics of the target environment using generic models such as neural networks, Gaussian process, linear functions, etc. After we have learned a dynamics model, we can use it as an approximation of the target environment to optimize .
We first tried using feedforward neural networks to learn the dynamics and optimize . However, this method was not able to reliably find that lead to good performance. This is possibly due to that any error in the prediction of the states would quickly accumulate over time and lead to inaccurate predictions. In addition, this method would not be able to handle problems where latency is involved.
In the experiments presented here, we learn the dynamics of the target environment with a Long Short Term Memory (LSTM) network (Hochreiter & Schmidhuber, 1997). Given a target environment, we first sample uniformly and collect experience using until we have samples. We use these samples to fit an initial LSTM dynamic model. We then alternate between finding the best dynamic parameters such that achieves the best performance under the latest LSTM dynamic model and update the LSTM dynamic model using data generated from . This is repeated until we have reached the sample budget.
We found that LSTM notably outperformed feedforward networks when applied to strategy optimization. One result for Hopper DARTtoMuJoCo can be found in Figure 9. It can be seen that Modelbased method with LSTM is able to achieve similar performance as CMAES.
Modelbased method provides more flexibility over CMAES and Bayesian optimization. For example, if the target environment changes over time, it may be desired to have also be timevarying. However, this would lead to a high dimensional search space, which might require significantly more samples for CMAES or Bayesian Optimization to solve the problem. If we can learn an accurate enough model from the data, we can use it to generate synthetic data for solving the problem.
However, there are two major drawbacks for Modelbased method. The first is that to learn the dynamics model, we need to have access to the full state of the robot, which can be challenging or troublesome in the realworld. In contrast, CMAES and Bayesian optimization only require the final return of a rollout. Second, the Modelbased method is significantly slower to run than the other methods due to the frequent training of the LSTM network.