## I Introduction

Human beings and animals have a natural ability to adapt their motor skills to novel situations. Robots in the real world also often encounter unexpected tasks and environments, such as manipulating unseen objects or walking on unstructured terrains. Many of state-of-the-art robots still lack this capability to adapt, which prevents them from real-world deployment.

Recent advances in deep reinforcement learning (deep RL) shed light on developing effective motor skills in challenging situations

[schulman2017proximal, lillicrap2015continuous, mnih2016asynchronous]. However, the policy found by deep RL is usually limited to a single scenario and may not work if the target environment changes notably. One of the common techniques to overcome this generalization issue is to train a single policy that can handle a wide range of situations by exposing it to many random scenarios, so-called*domain randomization*(DR) [tobin2017domain]. DR is known to be effective for generating a single robust policy across different scenarios, which is suitable for some problems such as sim-to-real transfer. However, DR trades optimality for robustness: the policies learned by DR is not optimal under any situation. Another popular approach is

*meta reinforcement learning*(meta-RL) [rakelly2019efficient, finn2017model] that aims to solve a new task within a few iterations by training adaptation over a distribution of tasks. However, existing meta-RL methods are mostly effective for adapting to different reward functions, while are in general less effective for adapting to challenging control problems where the dynamics are changed [yang2019norml].

In this work, we aim to develop a meta-RL algorithm that can quickly adapt the behavior of the trained policies to novel reward functions and dynamics that are not examined during training. We extend the idea of Strategy Optimization (SO) [yu2018policy] that trains a policy modulated by a latent variable to exhibit versatile behaviors. During the evaluation, the latent variable is adapted directly on the hardware using a sampling-based optimization method. The key idea behind our proposed method, Meta Strategy Optimization (MSO), is to expose the learning agent to the same Strategy Optimization process during both training and testing phases. This meta-training allows the agents to learn a better latent policy space that is suitable for fast adaptation to new situations.

We demonstrate our proposed algorithm on training locomotion policies for the Ghost Robotics Minitaur [kenneally2016design], a quadruped robot. Our algorithm can successfully train locomotion policies that can be applied to the real hardware by adjusting its simulation-acquired behavior (Figure 1). In addition, we design two adaptation tasks for the real robot, walking with a weakened leg and climbing a slope, and a set of additional tasks in a simulated environment. We show that MSO is extremely data efficient ( rollouts or seconds of data) to adapt the policies to novel situations in the target environment. We compare our method to two baseline methods: domain randomization [TanRSS18] and strategy optimization with a projected universal policy [yu2019sim]. Our results show that MSO outperforms both baselines in the simulated and the real environments.

## Ii Related Works

### Ii-a Sim-to-real transfer for legged locomotion

Recent developments in deep reinforcement learning (Deep RL) have enabled training locomotion policies for legged robots with high dimensional observation and action spaces and challenging dynamics [schulman2017proximal, lillicrap2015continuous, mnih2016asynchronous], which demonstrate an attractive path toward automatically acquiring motor skills for robots. However, the sample complexity and potential safety concerns prevents deep RL from being applied directly on the hardware, while the discrepancies between computer simulation and real world, also known as the Reality Gap [neunert2017off], makes a simulation trained policy unlikely to work on the real robot.

Researchers have proposed a variety of techniques to enable a policy trained in simulation to be transferred to the real robot [TanRSS18, yu2019sim, hwangbo2019learning, peng2018sim, bousmalis2018using, fang2018multi, learndexmanipulation18, hanna2017grounded]. One important strategy is to improve the computer simulation to better match the real robot dynamics [TanRSS18, hwangbo2019learning, hanna2017grounded]. For example, Tan et al. [TanRSS18] improved the actuator dynamics by identifying a nonlinear torque-current relation and demonstrated successful transfer of locomotion policies for a quadruped robot. In this work, we leverage the model parameters and the nonlinear actuator model identified by Tan et al. [TanRSS18] for the quadruped robot. However, improving the simulation model alone does not allow the policy to be transferred to notably different dynamics or tasks.

Another important technique for sim-to-real transfer is to train control policies that are robust to a range of simulated environments and sensor noises. Different techniques have been proposed to train robust policies, such as domain randomization [tobin2017domain, peng2018sim, learndexmanipulation18, yan2019data], adversarial perturbation [pinto2017robust], and ensemble models [Mordatch, Lowrey]. Though training a policy with pure domain randomization may transfer to the real robot, it usually assumes that the training dynamics are not too far from the target dynamics. As shown in our experiments, domain randomization alone fails to transfer if the reality gap is large. In addition, without a mechanism to adjust the policy behavior, these policies cannot quickly adapt to cases where the reward function is changed.

### Ii-B Adapting control policy to novel tasks

To adapt to new reward functions or dynamics, it is necessary that the controller can modify its behavior according to the real-world experience. Existing works in this line of research can be roughly divided into two categories: model-free adaptation method and model-based adaptation method.

In model-free adaptation method, the control policy is directly adjusted according to experience from the target environment. One class of such method is the gradient-based meta learning approach [finn2017model, houthooft2018evolved, rothfuss2018promp, yang2019norml], where the goal is to train policies that can be quickly adapted by gradient-based optimization methods during test time. Gradient-based meta learning methods have been demonstrated on adapting to novel reward function and are universal in theory [finn2017meta]. However, it is in general less effective for adapting to novel dynamics. No-Reward Meta Learning (NoRML) [yang2019norml] addressed this issue by meta-learning an advantage function and an offset in addition to the policy parameters. NoRML has demonstrated effective adaptation to unseen dynamics in simulation. However, it has yet been demonstrated on real robots.

In contrast to gradient-based method, latent space based adaptation method encodes the training experience into a latent representation, which the policy is conditioned on [rakelly2019efficient, YuRSS17, yu2018policy, duan2016rl, james2018task]. The latent input to the policy is then fine-tuned when a new environment is presented. Most methods in this class try to infer the latent input using observations from the target environment. For example, Yu et al. [YuRSS17]

conditioned the policy on the physics parameters of the robot, and trained a separate prediction model that estimates the physics parameters given the history of observations and actions. These methods can potentially adapt to changes in environments in an online fashion. However, when the dynamics changes significantly, the inference model may produce non-optimal latent inputs. As a result, most works have been demonstrated in simulated environments only.

Instead of training an inference model, researchers have also proposed methods that directly optimizes the latent input to the policy in the target environment [yu2019sim, yu2018policy, cully2015robots]. As the latent space that the policy is conditioned on is usually low dimensional, it is possible to use sampling-based optimization methods such as CMA-ES [hansen1995adaptation], or Bayesian Optimization [mockus2012bayesian] to find the best latent input that achieves the highest performance. Such methods have been successfully applied to learning locomotion policies for a biped robot [yu2019sim] and adapting to novel environments for a hexapod robot [cully2015robots]. Our method extends this line of research by matching the process of optimizing latent input during training and testing. We demonstrate that by doing this, we learn a better latent space that is suitable for fast adaptation.

Model-based adaptation method, on the other hand, adapts the dynamics model learned in source domain and extracts the control policy using methods such as model-predictive control (MPC) [nagabandi2018learning, tanaskovic2013adaptive, aswani2012extensions, manganiello2014optimization, lenz2015deepmpc]. These methods have the advantage of being data efficient. However, the learned dynamics model usually uses the full state of the robot, which requires additional instruments such as a motion capture system.

## Iii Background

We represent the problem of legged locomotion as a Markov Decision Process (MDP):

, where is the state space of the robot, is the action space, is the transition function, is the reward function and is the initial state distribution. The goal of reinforcement learning is to find a policy , such that it maximizes the expected accumulated reward over time under the transition function :where , and

. In deep reinforcement learning, the policy is usually parameterized by a neural network with weights

and the policy is denoted as .Strategy Optimization (SO) [yu2018policy] extends the standard policy learning by training a universal policy (UP) that is conditioned on physics parameters of the simulated robot: (Figure (2)b). Under the assumption that we have access to the true physics parameters (e.g. in simulated environments), we can train a universal policy with any standard reinforcement learning algorithm by treating as part of the observations. The trained universal policy will change its behaviors with respect to different physics parameters , thus a policy with a particular physics parameter input can be treated as strategy.

In order to transfer the trained policy to the real world, SO solves the following optimization directly on the hardware:

(1) |

where denotes the performance of the strategy on the real robot. As the search space is significantly smaller than the network weight space, it permits the use of sampling-based optimization methods such as CMA-ES [hansen1995adaptation] or Bayesian Optimization [mockus2012bayesian], which can better handle noisy objectives such as the one used in RL than gradient based methods [varelas2019benchmarking]. To further reduce the search space during the transfer, Yu et al. proposed a projected universal policy (PUP, Figure (2)c) [yu2019sim], which projects the physics parameters to a lower dimensional latent space of context variables (usually - dimensional). At the learning phase, PUP takes the robot observation and physics parameters as input, while during strategy optimization PUP directly optimizes the low-dimensional context variables instead of the physics parameters.

Strategy optimization with projected universal policy (SO-PUP) has demonstrated successful sim-to-real transfer for biped locomotion problems. However, there are a few drawbacks with SO-PUP. First, the explicit representation of physics parameters of SO-PUP is not practical for high-dimensional environment or dynamics changes, such as a randomized terrain with thousands of height variables. Furthermore, it acquires the latent space of context variables through the projection network that has never experienced the adaptation process before. This mismatch between training and testing phases implies that the latent space learned by SO-PUP may not be in favor of fast adaptation.

## Iv Meta Strategy Optimization

In this work, we present Meta Strategy Optimization (MSO), a meta-learning algorithm that learns a latent variable conditioned policy on a large variety of simulated environments and can quickly adapt the trained policy to novel reward and dynamics with a few episodes of data from the target environment. The key idea behind MSO is that we adopt the same adaptation process to obtain the latent input to the policy during both training and testing. Therefore, our policy directly takes context variables as inputs (Figure (2) (d)).

We solve the following optimization problem during training in simulation:

(2) |

where is the weight of the policy network, is the performance of the strategy when the physics parameters are . Note that we refer to the physics parameters for clarity and consistency to previous works. However, one can easily extend it to include parameters from other components of the MDP such as the reward function.

Directly solving Equation 2 is challenging for two reasons. First, the objective term involves strategy optimization inside the expectation, which makes it difficult to compute the gradient with respect to the policy parameters . Second, every single evaluation of the policy parameters involves performing SO to get the optimal strategy (Equation 3), which increases the computational cost significantly.

We propose a practical algorithm for solving Equation 2 by making the following assumption: the changes in the optimal latent input from SO are small if the changes in the policy network weights are also small. As a result, we can approximately solve Equation 2 by interleaving the optimization of and :

(3) | ||||

(4) |

where is the iteration number.

Algorithm 1 describes the MSO algorithm in more details. For each iteration of policy learning, we first sample a set of tasks from the simulator and perform strategy optimization to obtain the current best strategies for these tasks. We then perform steps of policy updates with the fixed set of task-strategy pairs. In our experiments, we use and .

By computing the latent variable using strategy optimization, MSO avoids the need to compute a projection from to and thus can handle tasks with larger dimensions than SO-PUP. More importantly, by matching the process of obtaining the latent variable during training and testing, MSO can potentially learn a latent space that is more suitable for strategy optimization when adapting to novel scenarios.

## V Experiments

We aim to answer the following questions in our experiments: 1) Does MSO achieve better performance than the baseline methods DR [TanRSS18] and SO-PUP [yu2019sim] in adapting to new dynamics and rewards? 2) Does MSO train policies that can be successfully transferred to real robots and adapt to novel scenarios in the real world? 3) Is MSO sensitive to the specific choice of hyper-parameters? To answer these questions, we design a set of experiments in both simulation and real-world. Videos of our results can be seen in the supplement video.

### V-a Experiment setup

We use Minitaur from Ghost Robotics [kenneally2016design] as the robot platform to evaluate our algorithm. Minitaur has eight direct-drive actuators, two on each leg. In this work, we use a Proportional-Derivative controller (P gain is and D gain is ) to track the desired motor positions, which is the output of the policy. Minitaur is equipped with motor encoders to read the motor angles and an IMU sensor to estimate the orientation and angular velocity of the robot body. The robot is controlled at a frequency of Hz.

We build a physics simulation of the Minitaur in PyBullet [pybullet], a Python module that extends the Bullet Physics Engine. Our simulator incorporates the actuator model [TanRSS18], but we do not perform a thorough system identification for its parameters. As shown in our experiments, a naïve domain randomization technique does not give us a transferable policy directly.

The observation space of the robot consists of the current motor angles, the roll, pitch of the base, as well as their time derivatives. We design a reward function that encourages the robot to move forward:

(5) |

where denotes the position of the robot base at timestep , is the desired moving direction, is the control timestep, and is a velocity threshold for safety reasons. We use s and m/s in our experiments. Each episode of simulation has a maximum horizon of steps (s). The episode is terminated early if the robot falls, determined by the roll and pitch angles of the base.

We use Augmented Random Search (ARS), a policy optimization algorithm, for training the locomotion policy in simulation [mania2018simple]. At each iteration, ARS samples random perturbations of and estimates the policy gradient along the best performing perturbation directions using finite differences. We refer the readers to the original paper for more details. In our experiments, we sample perturbations for each iteration and use the top perturbations to update the policy weights. Although ARS has only been demonstrated for training linear policies, we find it also effectively in training neural network policies. We choose ARS because it can better leverage large scale computational resource, though MSO can also be applied to other on-policy RL algorithms such as PPO [schulman2017proximal]. We use Bayesian Optimization to perform SO and limit the maximum episode number to during training.

We compare MSO to two baselines: domain randomization (DR) [TanRSS18] and strategy optimization with projected universal policy (SO-PUP) [yu2019sim]. We run ARS for iterations for all methods and we use a two-dimensional latent space for MSO and SO-PUP. Table I shows the physics parameters and their corresponding range we use during training. During our experiments on the hardware, we find that episodes are sufficient to achieve successful adaptation. Thus we choose episodes during testing, even though during training SO is allowed to use episodes. To reduce the influence of the stochastic learning process, we train five policies for each method. Each trained policy is then evaluated on sampled tasks from the designed task distributions (Section V-B) for all simulated adaptation experiments.

### V-B Adaptation tasks

We design the following tasks on the real robot to evaluate the performance of MSO:

1) Sim-to-real transfer. The first task is to transfer the policy trained in simulation to the real Minitaur robot. Although we use the nonlinear actuator model from Tan et al. [TanRSS18], the reality gap in our case is still large as we use a different version of Minitaur and we do not perform additional system identification.

2) Weakened motors. It is common for real robots to experience motor weakening, e.g. due to over heating. In this task, we test the ability of MSO to adapt to weakened motors by setting the P gain to for the two motors on the front right leg of Minitaur. Such strength reduction () is beyond the range that the policy has seen during training.

3) Climbing up a slope. In this task, we place the robot on a slope of about degrees constructed by a white board and task the robot to climb up the hill. This is a challenging task because during training in the simulation the robot has only seen flat ground.

In addition, we design the following tasks in simulation for a more comprehensive analysis of the adaptation performance of MSO:

1) Extended randomization. In this task, we sample dynamics from the same set of parameters used in training (Table I), but with an extended range that is wider. We also reject samples that lie within the training range to focus on generalization capability. This gives us a large space of testing dynamics that have not been seen during training.

2) Climbing up slopes. We also evaluate MSO for climbing up a hill in simulated environments. We randomize the angle of the slope in degrees during evaluation.

3) Motor offset. One of the common defects of actuators is that the zero position is wrong. We evaluate the ability of MSO to adapt to such issues in this task. Specifically, we add an offset sampled in degrees to the observed angles of the two motors on the front left leg.

4) Carrying an object. All tasks above involves adapting to changes in dynamics only. In this task, we design a scenario where both dynamics and reward changes. Specifically, we ask the robot to carry a box of Kg while running forward. The new reward is how far the box is carried without falling to the ground. This task stresses the need of adapting the behavior of the policy, and a robust policy with a single behavior is unlikely to succeed.

For all simulated tasks except for extended randomization range, we also need to determine what values to use for the parameters randomized during training. As there is no single set of values that is representative of the robot, we also randomize these parameters using the same training range (Table I) for those tasks.

parameter | lower bound | upper bound |
---|---|---|

mass | 60% | 160% |

motor friction | 0.0Nm | 0.2Nm |

inertia | 25% | 200% |

motor strength | 50% | 150% |

latency | 0ms | 80ms |

battery voltage | 10V | 18V |

contact friction | 0.2 | 1.25 |

joint friction | 0.0Nm | 0.2Nm |

### V-C Results on real robot

We evaluate MSO on real Minitaur robot for the three tasks described in Section V-B. For MSO and the baseline methods, we use the policy with the highest training performance among the five trials to deploy on the real hardware. For MSO and SO-PUP, we allow episodes for the adaptation and repeat the best performing strategy for three times to obtain the final performance. For the sim-to-real task, we evaluate all three methods and report the result in Figure 3

. We see that MSO is able to not only achieve a better performance on average, but also obtain lower variance in performance.

For the task of weakened motor and slope climbing, we compare MSO to DR. As seen in the supplement video, when the front right leg is weakened, the robot lacks the strength to lift it up, and MSO finds a strategy that drags the front right leg forward without falling. On the other hand, DR still assumes full strength of the front right leg and relies on it to lift the base of the robot up, leading to it losing balance. Similarly for the task of climbing up the hill, MSO is able to find a strategy that successfully take the robot up the hill and go beyond the slope, while DR leads to the robot falling backward as it has only seen flat ground.

### V-D More analysis in simulation

We also evaluate our method in simulated adaptation tasks to provide a more comprehensive analysis of the performance of our algorithm. We first evaluate the training performance of MSO by testing it on the dynamics within the training range. As shown in Figure 4, MSO notably outperforms the other two methods.

We also report the performance of MSO and the baseline methods on the four adaptation tasks described in Section V-B: extended randomization, climbing up slope, biased motor zero position, and carry object. All results can be seen in Figure 4. For all adaptation tasks, MSO is able to outperform both SO-PUP and DR. Notably, for the task of climbing up a slope, MSO achieved a clear advantage over the baseline methods, while DR is not able to achieve positive return. On the other hand, the difference between MSO and SO-PUP is smaller when an offset is added to the observed motor angle, while DR performs much worse. These results suggest that some tasks, such as climbing up the slope, are more sensitive to learning a good latent strategy space than other tasks, like adding motor offset. MSO also works well for the task of carrying the object, where the policy needs to adapt to changes in both dynamics and reward. As seen in the supplement video, MSO can successfully find a strategy that stabilizes the base of the robot to prevent the object from falling to the ground, while the baseline methods achieves worse performance.

### V-E Ablation study

parameters | mean return (training) | mean return (extended) |
---|---|---|

=, =, = | 2.95 | 1.95 |

= | 2.91 | 1.85 |

= | 2.36 | 1.51 |

=, =, = | 2.95 | 1.95 |

= | 2.84 | 1.85 |

= | 2.97 | 1.95 |

=, =, = | 2.95 | 1.95 |

= | 2.70 | 1.78 |

= | 3.01 | 1.94 |

Finally, we investigate how sensitive our algorithm is to different choices of hyper-parameters. In particular, we vary three key parameters for MSO: 1) : the number of episodes allowed in SO during training, 2) : the dimension of the latent space, and 3) : the number of iterations between each SO during training. Our nominal model uses , and for the three parameters. We vary one parameter at a time from the nominal setting and pick two values for each parameter being ablated. We evaluate all variations of MSO on the training performance and the extended randomization task. During testing, we allow episodes for adaptation for all variations. Table II shows the result of the ablation.

In general, our method is not very sensitive to different hyper-parameters. Interestingly, even when a single episode is allowed for SO during training, i.e. a random strategy is selected, the resulting policy can still outperform DR notably. This is possibly because training a policy in this setting is similar to training a set of DR policies with different random seeds, and during testing, the best performing one will be picked.

## Vi Discussion and Conclusion

We have presented a learning algorithm for training locomotion policies that can quickly adapt to novel environments that are not seen during training time. The key idea to our method, Meta Strategy Optimization (MSO), is a meta-learning process that learns a latent strategy space suitable for fast adaptation during training, and quickly searches a good strategy to adapt to new rewards and dynamics during testing. We demonstrate MSO on a variety of simulated and real-world adaptation tasks, including walking on a slope, weakened motor, and carrying objects. MSO can successfully adapt to the novel tasks in episodes and outperforms other baseline methods.

Though MSO can successfully transfer policies to environments that are notably different from the training environments, it assumes that the testing environment does not change significantly over time. This limitation restricts the type of tasks that MSO can be applied to. For example, if the robot needs to walk across an slippery surface and a rough surface, it would require changing the strategy when the surface type changes. One possible future direction to address this issue is to adopt the idea of hierarchical RL [bacon2017option, liu2017learning] by treating the MSO-trained policy as a lower-level policy and train a higher-level policy that outputs the strategy. This will also enable the policy to adapt in an online fashion.