Log In Sign Up

Learning Domain Randomization Distributions for Transfer of Locomotion Policies

by   Melissa Mozifian, et al.

Domain randomization (DR) is a successful technique for learning robust policies for robot systems, when the dynamics of the target robot system are unknown. The success of policies trained with domain randomization however, is highly dependent on the correct selection of the randomization distribution. The majority of success stories typically use real world data in order to carefully select the DR distribution, or incorporate real world trajectories to better estimate appropriate randomization distributions. In this paper, we consider the problem of finding good domain randomization parameters for simulation, without prior access to data from the target system. We explore the use of gradient-based search methods to learn a domain randomization with the following properties: 1) The trained policy should be successful in environments sampled from the domain randomization distribution 2) The domain randomization distribution should be wide enough so that the experience similar to the target robot system is observed during training, while addressing the practicality of training finite capacity models. These two properties aim to ensure the trajectories encountered in the target system are close to those observed during training, as existing methods in machine learning are better suited for interpolation than extrapolation. We show how adapting the domain randomization distribution while training context-conditioned policies results in improvements on jump-start and asymptotic performance when transferring a learned policy to the target environment.


page 4

page 5

page 6

page 7


Bayesian Domain Randomization for Sim-to-Real Transfer

When learning policies for robot control, the real-world data required i...

Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience

We consider the problem of transferring policies to the real world by tr...

How to pick the domain randomization parameters for sim-to-real transfer of reinforcement learning policies?

Recently, reinforcement learning (RL) algorithms have demonstrated remar...

Policy Transfer with Strategy Optimization

Computer simulation provides an automatic and safe way for training robo...

Validate on Sim, Detect on Real – Model Selection for Domain Randomization

A practical approach to learning robot skills, often termed sim2real, is...

Dynamics Randomization Revisited:A Case Study for Quadrupedal Locomotion

Understanding the gap between simulation andreality is critical for rein...

Not Only Domain Randomization: Universal Policy with Embedding System Identification

Domain randomization (DR) cannot provide optimal policies for adapting t...

1 Introduction

Deep Reinforcement Learning (Deep-RL) is a powerful technique for synthesizing locomotion controllers for robot systems. Inspired by successes in video games 

(Mnih et al., 2015) and board games (Silver et al., 2016), recent work has demonstrated the applicability of Deep-RL in robotics (Levine et al., 2016). Since the data requirements for Deep-RL make their direct application to real robot systems costly, or even infeasible, a large body of recent work has focused on training controllers in simulation and deploying them in a real robot system. This is particularly challenging, but it is crucial to realize real world development of these systems.

Robot simulators provide a solution to the data requirements of Deep-RL. Except for simple robot systems in controlled environments, however, real robot experience may not correspond to situations that were used in simulation; an issue known as the reality gap (Jakobi et al., 1995). One way to address the reality gap is to perform system identification to tune the simulation parameters. This approach works if collecting data on the target system is not prohibitively expensive and the number of parameters of the simulation are small. The reality gap may still exist, however, due to a mis-specification of the simulation model.

Another method to shrink the reality gap is to train policies to maximize performance over a diverse set of simulation models, where the parameters of each model are sampled randomly, an approach known as domain randomization (DR). This aims to address the issue of model mis-specification by providing diverse simulated experience. Domain randomization has been demonstrated to effectively produce controllers that can be trained in simulation with high likelihood of successful outcomes on a real robot system after deployment (Andrychowicz et al., 2018) and fine-tuning with real world data (Chen et al., 2018).

While successful, an aspect that has not been addressed in depth is the selection of the domain randomization distribution. For vision-based components, DR should be tuned so that features learned in simulation do not depend strongly on the appearance of simulated environments. For the control components, the focus of this work, there is a dependency between optimal behaviour and the dynamics of the environment. In this case, the DR distribution should be selected carefully to ensure that the real robot experience is represented in the simulated experience sampled under DR. If real robot data is available, one could use gradient-free search  (Chebotar et al., 2018)

or Bayesian inference 

(Rajeswaran et al., 2016) to update the DR distribution after executing the learned policy on the target system. These methods are based on the assumption that there is a a set of simulators from which real world experience can be synthesized.

In this work we propose to learn the parameters of the simulator distribution, such that the policy is trained over the most diverse set of simulator parameters in which it can plausibly succeed. By making the simulation distribution as wide as possible, we aim to encode the largest set of behaviours that is possible on a single policy, with fixed capacity. As shown in our experiments, training on the widest distribution possible has two problems: our models usually have finite capacity and picking a domain randomization that is too varied slows down convergence as shown in Figure 9.

Instead, we let the optimization process focus on environments where the task is feasible. We propose an algorithm that simultaneously learns the domain randomization distribution while optimizing the policy to maximize performance over the learned distribution. To operate over a wide range of possible simulator parameters, we train context-aware policies which take as input the current state of the environment, alongside contextual information describing the sampled parameters of the simulator. This enables our policies to learn context-specific strategies which consider the current dynamics of the environment, rather than an average over all possible simulator parameters. When deployed on the target environment, we concurrently fine-tune the policy parameters while searching for the context that maximizes performance. We evaluate our method on a variety of control problems from the OpenAI Gym suite of benchmarks. We find that our method is able to improve on the performance of fixed domain randomization. Furthermore, we demonstrate our model’s robustness to initial simulator distribution parameters, showing that our method repeatably converges to similar domain randomization distributions across different experiments.

2 Related Work

(Packer et al., 2018) present an empirical study of generalization in Deep-RL, testing interpolation and extrapolation performance of state-of-the-art algorithms when varying simulation parameters in control tasks. The authors provide an experimental assessment of generalization under varying training and testing distributions. Our work extends these results by providing results for the case when the training distribution parameters are learned and change during policy training.

(Chebotar et al., 2018) propose training policies on a distribution of simulators, whose parameters are fit to real-world data. Their proposed algorithm switches back and forth between optimizing the policy under the DR distribution and updating the DR distribution by minimizing the discrepancy between simulated and real world trajectories. In contrast, we aim to learn policies that maximize performance over a diverse distribution of environments where the task is feasible, as a way of minimizing the interactions with the real robot system.

(Rajeswaran et al., 2016) propose a related approach for learning robust policies over a distribution of simulator models. The proposed approach, based on the the -percentile conditional value at risk (CVaR) (Tamar et al., 2015) objective, improves the policy performance on a small proportion of environments where the policy performs the worst. The authors propose an algorithm that updates the distribution of simulation models to maximize the likelihood of real-world trajectories, via Bayesian inference. The combination of worst-case performance optimization and Bayesian updates ensures that the resulting policy is robust to errors in the estimation of the simulation model parameters. Our method can be combined with the CVaR objective to encourage diversity of the learned DR distribution.

Related to learning the DR distribution, (Paul et al., 2018) propose using Bayesian Optimization (BO) to update from the simulation model distribution. This is done by evaluating the improvement over the current policy by using a policy gradient algorithm with data sampled from the current simulator distribution. The parameters of the simulator distribution for the next iteration are selected to maximize said improvement.

Other related methods rely on policies that are conditioned on context: variables used to represent samples from the simulator distribution, either explicitly or implicitly. For example,  (Chen et al., 2018)

propose learning a policy conditioned on the hardware properties of the robot, encoded as a vector

. These represent variations on the dynamics of the environment (i.e. friction, mass), drawn from a fixed simulator distribution. When explicit, the context is equal to the simulator parameters. When implicit, the mapping between context vectors and simulator environments is learned during training, using policy optimization. At test time, when the true context is unknown, the context vector that is fed as input to the policy is obtained by gradient descent on the task performance objective. Similarly,  (Yu et al., 2018) propose training policies conditioned on simulator parameters (explicitly), then optimizing the context vector alone to maximize performance at test time. Training is done by collecting trajectories on a fixed simulation model distribution. The argument of the authors is that searching over context vectors is easier than searching over policy parameters. The proposed method relies on population-based gradient-free search for optimizing the context vector to maximize task performance. Our method follow a similar approach to these methods, but we focus on learning the training distribution.

(Rakelly et al., 2019) also use context-conditioned policies, where the context is implicitly encoded into a vector . During the training phase, their proposed algorithm improves on the performance of the policy while learning a probabilistic mapping from trajectory data to context vectors. At test time, the learned mapping is used for online inference of the context vector. This is similar in spirit to the Universal Policies with Online System Identification method (Yu et al., 2017), which instead uses deterministic context inference with an explicit context encoding. Again, these methods use a fixed DR distribution and could benefit from adapting it during training, as we propose in this work.

3 Problem Statement

We consider parametric Markov Decision Processes (MDPs)

(Sutton & Barto, 2018). An MDP is defined by the tuple , where is the set of possible states and is the set of actions, : , encodes the state transition dynamics, : is the task-dependent reward function, is a discount factor, and : is the initial state distribution. Let and be the state and action taken at time . At the beginning of each episode, . Trajectories are obtained by iteratively sampling actions using the current policy and evaluating next states according to the transition dynamics . Given an MDP , the goal is then to learn policy to maximize the expected sum of rewards , where .

In our work, we aim to maximize performance over a distribution of MDPs, each described by a context vector representing the variables that change over the distribution: changes in transition dynamics, rewards, initial state distribution, etc. Thus, our objective is to maximize , where is the domain randomization distribution. Similar to (Yu et al., 2018; Chen et al., 2018; Rakelly et al., 2019), we condition the policy on the context vector, . In the experiments reported in this paper, we let encode the parameters of the transition model in a physically based simulator; e.g. mass, friction or damping.

4 Proposed Method

In practice, making the context distribution

as wide as possible may be detrimental to the objective of maximizing performance. For instance, if the distribution has infinite support and wide variance, there may be more environments sampled from the context distribution for which the desired task is impossible (e.g. reaching a target state). Thus, sampling trajectories from a wide context distribution results in high variance on the directions of improvement, slowing progress on policy learning. On the other hand, if we make the context distribution to be too narrow, policy learning can progress more rapidly but may not generalize to the whole set of possible contexts.

We introduce LSDR (Learning the Sweet-spot Distribution Range) algorithm for concurrently learning a domain randomization distribution and a robust policy that maximizes performance over it. Instead of directly sampling from , we use a surrogate distribution , with trainable parameters . Our goal is to find appropriate parameters to optimize . LSDR proceeds by updating the policy with trajectories sampled from , and updating the based on the performance of the policy . To avoid the collapse of the learned distribution, we propose using a regularizer that encourages the distribution to be diverse. The idea is to sample more data from environments where improvement of the policy is possible, without collapsing to environments that are trivial to solve. We summarize our training and testing procedure in Algorithm 1 and  2. In our experiments, we use Proximal Policy Optimization (PPO) (Schulman et al., 2017) for the UpdatePolicy procedure in Algorithm (1).

0:  testing distribution , initial parameters of the learned distribution , initial policy , buffer size , total iterations
  for  do
     while   do
        append to
        if  is terminal then
        end if
     end while
  end for
Algorithm 1 Learning the policy and training distribution
0:  learned distribution parameters , testing distribution , policy , total iterations , total trajectory samples
  for  do
     sample from
     Obtain Monte-Carlo estimate of by executing on environments with
  end for
Algorithm 2 UpdateDistribution
0:  learned distribution parameters , policy , buffer size , total iterations
  Initialize guess for context vector
  for  do
     Collect samples into by executing policy
  end for
Algorithm 3 Fine-tuning the policy at test-time

4.1 Learning the Sweet-spot Distribution Range

The goal of our method is to find a training distribution to maximize the expected reward of the policy under the test distribution, while gradually reducing the sampling frequency of environments that make the task unsolvable111In this work, we consider a task solvable if there exists a policy that brings the environment to a set of desired goal states.. Such situation is common in physics-based simulations, where a bad selection of simulation parameters may lead to environments where the task is impossible due to physical limits or unstable simulations.

We start by assuming that the test distribution has wide but bounded support, such that we get a distribution of solvable and unsolvable tasks. To update the training distribution, we use an objective of the following form


where the first term is designed to encourage improvement on environments that are more likely to be solvable, while the second term is a regularizer that keeps the distribution from collapsing. In our experiments, we set . Optimizing this objective encourages focusing on environments where the current policy performs the best. Other suitable objectives are the improvement over the previous policy , or an estimate of the performance of the context dependent optimal policy .

If we use the performance of the policy as a way of determining whether the task is solvable for a given context , then a trivial solution would be to make concentrate on few easy environments. The second term in Eq. (1) helps to avoid this issue by penalizing distributions that deviate too much from , which is assumed to be wide. When is uniform this is equivalent to maximizing the entropy of .

To estimate the gradient of Eq. (1) with respect to , we use the log-derivative score function gradient estimator (Fu, 2006), resulting in the following Monte-Carlo update :


where . Updating

with samples from the distribution we are learning has the problem that we never get information about the performance of the policy in low probability contexts under

. This is problematic since if context were assigned a low probability early in training, we would require a large number of samples to update its probability–even if the policy performs well on during later stages in training. To address this issue, we use samples from to evaluate the gradient of . While changing the sampling distribution introduces bias, which could be corrected by using importance sampling, we find that both the second term in Eq. (1) and sampling from are crucial to avoid the collapse of the learned distribution (see Fig. 4). To ensure that the two terms in Eq. (1) have similar scale, we standardize the evaluations of with exponentially averaged batch statistics and set to the fixed value of .

5 Experiments

Figure 1: Illustrations of the 2D simulated robot models used in the experiments. The hopper (a) and half-cheetah (b) tasks, present more challenging environments when varying dynamics.

We evaluate the impact of learning the DR distribution on two standard benchmark locomotion tasks: Hopper and Half-cheetah from the MuJoCo tasks (illustrated in Fig. 1) in the OpenAI Gym suite (Brockman et al., 2016). We use an explicit encoding of the context vector , corresponding to the torso size, density, foot friction and joint damping of the environments. In this work, we focus on uni-dimensional domain randomization contexts. In this work, we run experiments for each context variable independently222

To enable distribution learning multi-dimensional contexts, we are exploring the use of parameterizations, different from the discrete distribution, that do not suffer from the curse of dimensionality

. We selected

as an uniform distribution over ranges that include both solvable and unsolvable environments. We initialize

to be the same as . In these experiments, both distributions are implemented as discrete distributions with 100 bins. When sampling from this distribution, we first select a bin according to the discrete probabilities, then select a continuous context value uniformly at random from the ranges of the corresponding bin.

We compare the test-time jump-start and asymptotic performance of policies learned with (learned domain randomization) and (fixed domain randomization). At test time, we sample (uniformly at random) a test set of 50 samples from the support of and run policy search optimization, initializing the policy with the parameters obtained at training time. The questions we aim to answer with our experiments are: 1) does learning policies with wide DR distributions affect the performance of the policy in the environments where the task is solvable? 2) does learning the DR distribution converge to the actual ranges where the task is solvable? 3) Is learning the DR distribution beneficial?

5.1 Results

Learned Distribution Ranges: Table 1 shows the ranges for and the final equivalent ranges for the distributions found by our method. Figure 2 and 3 show the evolution of during training, using our method. Each plot corresponds to a separate domain randomization experiment, where we randomized one different simulator parameter while keeping the rest fixed. Initially, each of these distribution is uniform. As the agent becomes better over the training distribution, it becomes easier to discriminate between promising environments (where the task is solvable) and impossible ones where rewards stay at low values. After around epochs, the distributions have converged to their final distributions. For Hopper, the learned distributions corresponds closely with the environments where we can find a policy using vanilla policy gradient methods from scratch. To determine the consistency of these results, we ran the Hopper torso size experiment 7 times, and fitted the parameters of a uniform distribution to the resulting . The mean ranges (

one standard deviation) across the 7 experiments were

, which provides some evidence for the reproducibility of our method.

(a) Torso size
(b) Density
(c) Friction
(d) Damping
Figure 2: Evolution over time of the learned domain randomization distribution for Hopper. is implemented as a discrete distribution with 100 bins. Each plot corresponds to a different experiments where we kept the other simulator parameters fixed at their default values. Lighter color corresponds to higher probabilities. The labels on the right correspond to the simulator parameter being varied.
(a) Torso size
(b) Density
(c) Friction
(d) Damping
Figure 3: Evolution over time of the learned domain randomization distribution for Half-Cheetah. Experimental details are the same as for the Hopper experiments.
Figure 4: Learned torso size distribution for Hopper. Figure  3(a) shows distribution learned without entropy regularizer and Figure  3(b) shows the distribution learned while sampling from train distribution.
Environment Parameters Initial Train/Test Distribution Converged Ranges
Hopper Torso size
Joint Damping
Half-Cheetah Torso size
Joint Damping
Table 1: Ranges of parameters for each environment, in the beginning of training and the equivalent ranges found by the algorithm, obtained by fitting an uniform distribution to the final learned distribution.

Learned vs Fixed Domain Randomization: We compare the jumpstart and asymptotic performance between learning the domain randomization distribution and keeping it fixed. Our results show our method, using PPO as the policy optimizer (LSDR) vs keeping the domain randomization distribution fixed (Fixed-DR) which corresponds to keeping the domain randomization distribution fixed. For these methods, we also compare whether training a context-conditioned policy or a robust policy is better at generalization. We ran the same experiments for Hopper and Half-Cheetah.

Figures 5 and 6 depict learning curves when fine tuning the policy at test-time, for torso size randomization. All the methods start with the same random seed at training time. The policies are trained for epochs, where we collect samples per epoch for the policy update. For the distribution update, we collect additional trajectories and run the gradient update times (without re-sampling new trajectories). We report averages over 50 different environments (corresponding to samples from , one random seed per environment). For clarity of presentation, we report the comparison over a “reasonable” torso size range (where the locomotion task is feasible) and a “hard” range, where the policy fails to make the robot move forward. For Hopper, the reasonable torso size range corresponds to (Figure 5), while the hard range to ((Figure 7). For Half-Cheetah the torso size ranges are (Figure 6) and (Figure 8), respectively. From these results, it is clear that learning the domain randomization distribution improves on the jump-start and asymptotic performance over using fixed domain randomization, within the reasonable range. On the hard ranges, LSDR performs slightly worse. But in most of the contexts in this range the task is not actually solvable; i.e. the optimal policy found by vanilla-PPO does not result in successful locomotion on the hard range333Successful policies on Hopper obtain cumulative rewards of at least . For Half-Cheetah, the rewards are greater than 0 when the robot successfully moves forward..

Contextual policy Figures 5 and 6 also show the performance of a contextual policy to that of a non-contextual policy. Our results show, training a contextual policy boosts the performance for both scenarios, where the domain randomization distribution is fixed, and when the distribution is being learned.

Figure 5: Comparison of test-time performance on Hopper between fixed vs learned domain randomization and context vs no context in the policy inputs. Tested with random seeds from the range .

Figure 6: Comparison of test-time performance on Half-Cheetah between fixed vs learned domain randomization and context vs no context in the policy inputs. Tested with random seeds from the range .

Figure 7: Comparison of test-time performance on Hopper between fixed vs learned domain randomization and context vs no context in the policy inputs. Tested with random seeds from the range .

Figure 8: Comparison of test-time performance on Half-Cheetah between fixed vs learned domain randomization and context vs no context in the policy inputs. Tested with random seeds from the range .

Using a different policy optimizer We also experimented with using EPOpt-PPO (Rajeswaran et al., 2016) as the policy optimizer in Algorithm (1). The motivation for this is to mitigate the bias towards environments with higher cumulative rewards early during training. EPOpt encourages improving the policy on the worst performing environments, at the expense of collecting more data per epoch. At each training epoch, we obtain samples from the training distribution and obtain trajectories by executing the policy once on each of the corresponding environments. From the resulting trajectories, we use the trajectories that resulted in the lowest rewards to fill the buffer for a PPO policy update, discarding the rest of the trajectories. Figure 9 compares the effect of learning the domain randomization distribution vs using a fixed wide range in this setting. We found that learning the domain randomization distribution resulted in faster convergence to high reward policies over the evaluation range [0.01, 0.09], while resulting in a slightly better asymptotic performance. We believe this could be a consequence of lower variance in the policy gradient estimates, as the the learned has lower variance than . Interestingly, using EPOpt resulted in a distribution with a wider torso size range than vanilla PPO, from approximately to , demonstrating that optimizing worst case performance does help in alleviating the bias towards high reward environments.

Figure 9: Learning curves for torso randomization on the Hopper task, using EPOpt as the policy optimizer. Lines represent mean performance, while the shaded regions correspond to the maximum and minimum performance over the [0.01, 0.09] torso size range. Learning the domain randomization distribution results in faster convergence and higher asymptotic performance.

6 Discussion

By allowing the agent to learn a good representative distribution, we are able to learn to solve difficult control tasks that heavily rely on a good initial domain randomization range. Our main experimental validation of domain randomization distribution learning is in the domain of simulated robotic locomotion. As shown in our experiments, our method is not sensitive to the initial domain randomization distribution and is able to converge to a more diverse range, while staying within the feasible range.

In this work, we study uni-dimensional context distribution learning. Due to the curse of dimensionality, there are limitations in using a discrete distribution - we are currently experimenting with alternative distributions such as truncated normal distributions, approximation of discrete distributions, etc. Using multidimensional contexts should enable an agent trained in simulation to obtain experience that’s closer to a real world robot, which is the goal of this work. An issue that requires further investigation is the fact that we use the same reward function over all environments, without considering the effect of the simulation parameters on the reward scale. For instance, in a challenging environment, the agent may obtain low rewards but still manage to produce a policy that successfully solves the task; e.g. successful forward locomotion in the Hopper task. A poorly constructed reward may not only lead to undesirable behavior, but may complicate distribution learning if the scale of the rewards for successful policies varies across contexts.