1 Introduction
Deep Reinforcement Learning (DeepRL) is a powerful technique for synthesizing locomotion controllers for robot systems. Inspired by successes in video games
(Mnih et al., 2015) and board games (Silver et al., 2016), recent work has demonstrated the applicability of DeepRL in robotics (Levine et al., 2016). Since the data requirements for DeepRL make their direct application to real robot systems costly, or even infeasible, a large body of recent work has focused on training controllers in simulation and deploying them in a real robot system. This is particularly challenging, but it is crucial to realize real world development of these systems.Robot simulators provide a solution to the data requirements of DeepRL. Except for simple robot systems in controlled environments, however, real robot experience may not correspond to situations that were used in simulation; an issue known as the reality gap (Jakobi et al., 1995). One way to address the reality gap is to perform system identification to tune the simulation parameters. This approach works if collecting data on the target system is not prohibitively expensive and the number of parameters of the simulation are small. The reality gap may still exist, however, due to a misspecification of the simulation model.
Another method to shrink the reality gap is to train policies to maximize performance over a diverse set of simulation models, where the parameters of each model are sampled randomly, an approach known as domain randomization (DR). This aims to address the issue of model misspecification by providing diverse simulated experience. Domain randomization has been demonstrated to effectively produce controllers that can be trained in simulation with high likelihood of successful outcomes on a real robot system after deployment (Andrychowicz et al., 2018) and finetuning with real world data (Chen et al., 2018).
While successful, an aspect that has not been addressed in depth is the selection of the domain randomization distribution. For visionbased components, DR should be tuned so that features learned in simulation do not depend strongly on the appearance of simulated environments. For the control components, the focus of this work, there is a dependency between optimal behaviour and the dynamics of the environment. In this case, the DR distribution should be selected carefully to ensure that the real robot experience is represented in the simulated experience sampled under DR. If real robot data is available, one could use gradientfree search (Chebotar et al., 2018)
(Rajeswaran et al., 2016) to update the DR distribution after executing the learned policy on the target system. These methods are based on the assumption that there is a a set of simulators from which real world experience can be synthesized.In this work we propose to learn the parameters of the simulator distribution, such that the policy is trained over the most diverse set of simulator parameters in which it can plausibly succeed. By making the simulation distribution as wide as possible, we aim to encode the largest set of behaviours that is possible on a single policy, with fixed capacity. As shown in our experiments, training on the widest distribution possible has two problems: our models usually have finite capacity and picking a domain randomization that is too varied slows down convergence as shown in Figure 9.
Instead, we let the optimization process focus on environments where the task is feasible. We propose an algorithm that simultaneously learns the domain randomization distribution while optimizing the policy to maximize performance over the learned distribution. To operate over a wide range of possible simulator parameters, we train contextaware policies which take as input the current state of the environment, alongside contextual information describing the sampled parameters of the simulator. This enables our policies to learn contextspecific strategies which consider the current dynamics of the environment, rather than an average over all possible simulator parameters. When deployed on the target environment, we concurrently finetune the policy parameters while searching for the context that maximizes performance. We evaluate our method on a variety of control problems from the OpenAI Gym suite of benchmarks. We find that our method is able to improve on the performance of fixed domain randomization. Furthermore, we demonstrate our model’s robustness to initial simulator distribution parameters, showing that our method repeatably converges to similar domain randomization distributions across different experiments.
2 Related Work
(Packer et al., 2018) present an empirical study of generalization in DeepRL, testing interpolation and extrapolation performance of stateoftheart algorithms when varying simulation parameters in control tasks. The authors provide an experimental assessment of generalization under varying training and testing distributions. Our work extends these results by providing results for the case when the training distribution parameters are learned and change during policy training.
(Chebotar et al., 2018) propose training policies on a distribution of simulators, whose parameters are fit to realworld data. Their proposed algorithm switches back and forth between optimizing the policy under the DR distribution and updating the DR distribution by minimizing the discrepancy between simulated and real world trajectories. In contrast, we aim to learn policies that maximize performance over a diverse distribution of environments where the task is feasible, as a way of minimizing the interactions with the real robot system.
(Rajeswaran et al., 2016) propose a related approach for learning robust policies over a distribution of simulator models. The proposed approach, based on the the percentile conditional value at risk (CVaR) (Tamar et al., 2015) objective, improves the policy performance on a small proportion of environments where the policy performs the worst. The authors propose an algorithm that updates the distribution of simulation models to maximize the likelihood of realworld trajectories, via Bayesian inference. The combination of worstcase performance optimization and Bayesian updates ensures that the resulting policy is robust to errors in the estimation of the simulation model parameters. Our method can be combined with the CVaR objective to encourage diversity of the learned DR distribution.
Related to learning the DR distribution, (Paul et al., 2018) propose using Bayesian Optimization (BO) to update from the simulation model distribution. This is done by evaluating the improvement over the current policy by using a policy gradient algorithm with data sampled from the current simulator distribution. The parameters of the simulator distribution for the next iteration are selected to maximize said improvement.
Other related methods rely on policies that are conditioned on context: variables used to represent samples from the simulator distribution, either explicitly or implicitly. For example, (Chen et al., 2018)
propose learning a policy conditioned on the hardware properties of the robot, encoded as a vector
. These represent variations on the dynamics of the environment (i.e. friction, mass), drawn from a fixed simulator distribution. When explicit, the context is equal to the simulator parameters. When implicit, the mapping between context vectors and simulator environments is learned during training, using policy optimization. At test time, when the true context is unknown, the context vector that is fed as input to the policy is obtained by gradient descent on the task performance objective. Similarly, (Yu et al., 2018) propose training policies conditioned on simulator parameters (explicitly), then optimizing the context vector alone to maximize performance at test time. Training is done by collecting trajectories on a fixed simulation model distribution. The argument of the authors is that searching over context vectors is easier than searching over policy parameters. The proposed method relies on populationbased gradientfree search for optimizing the context vector to maximize task performance. Our method follow a similar approach to these methods, but we focus on learning the training distribution.(Rakelly et al., 2019) also use contextconditioned policies, where the context is implicitly encoded into a vector . During the training phase, their proposed algorithm improves on the performance of the policy while learning a probabilistic mapping from trajectory data to context vectors. At test time, the learned mapping is used for online inference of the context vector. This is similar in spirit to the Universal Policies with Online System Identification method (Yu et al., 2017), which instead uses deterministic context inference with an explicit context encoding. Again, these methods use a fixed DR distribution and could benefit from adapting it during training, as we propose in this work.
3 Problem Statement
We consider parametric Markov Decision Processes (MDPs)
(Sutton & Barto, 2018). An MDP is defined by the tuple , where is the set of possible states and is the set of actions, : , encodes the state transition dynamics, : is the taskdependent reward function, is a discount factor, and : is the initial state distribution. Let and be the state and action taken at time . At the beginning of each episode, . Trajectories are obtained by iteratively sampling actions using the current policy and evaluating next states according to the transition dynamics . Given an MDP , the goal is then to learn policy to maximize the expected sum of rewards , where .In our work, we aim to maximize performance over a distribution of MDPs, each described by a context vector representing the variables that change over the distribution: changes in transition dynamics, rewards, initial state distribution, etc. Thus, our objective is to maximize , where is the domain randomization distribution. Similar to (Yu et al., 2018; Chen et al., 2018; Rakelly et al., 2019), we condition the policy on the context vector, . In the experiments reported in this paper, we let encode the parameters of the transition model in a physically based simulator; e.g. mass, friction or damping.
4 Proposed Method
In practice, making the context distribution
as wide as possible may be detrimental to the objective of maximizing performance. For instance, if the distribution has infinite support and wide variance, there may be more environments sampled from the context distribution for which the desired task is impossible (e.g. reaching a target state). Thus, sampling trajectories from a wide context distribution results in high variance on the directions of improvement, slowing progress on policy learning. On the other hand, if we make the context distribution to be too narrow, policy learning can progress more rapidly but may not generalize to the whole set of possible contexts.
We introduce LSDR (Learning the Sweetspot Distribution Range) algorithm for concurrently learning a domain randomization distribution and a robust policy that maximizes performance over it. Instead of directly sampling from , we use a surrogate distribution , with trainable parameters . Our goal is to find appropriate parameters to optimize . LSDR proceeds by updating the policy with trajectories sampled from , and updating the based on the performance of the policy . To avoid the collapse of the learned distribution, we propose using a regularizer that encourages the distribution to be diverse. The idea is to sample more data from environments where improvement of the policy is possible, without collapsing to environments that are trivial to solve. We summarize our training and testing procedure in Algorithm 1 and 2. In our experiments, we use Proximal Policy Optimization (PPO) (Schulman et al., 2017) for the UpdatePolicy procedure in Algorithm (1).
4.1 Learning the Sweetspot Distribution Range
The goal of our method is to find a training distribution to maximize the expected reward of the policy under the test distribution, while gradually reducing the sampling frequency of environments that make the task unsolvable^{1}^{1}1In this work, we consider a task solvable if there exists a policy that brings the environment to a set of desired goal states.. Such situation is common in physicsbased simulations, where a bad selection of simulation parameters may lead to environments where the task is impossible due to physical limits or unstable simulations.
We start by assuming that the test distribution has wide but bounded support, such that we get a distribution of solvable and unsolvable tasks. To update the training distribution, we use an objective of the following form
(1) 
where the first term is designed to encourage improvement on environments that are more likely to be solvable, while the second term is a regularizer that keeps the distribution from collapsing. In our experiments, we set . Optimizing this objective encourages focusing on environments where the current policy performs the best. Other suitable objectives are the improvement over the previous policy , or an estimate of the performance of the context dependent optimal policy .
If we use the performance of the policy as a way of determining whether the task is solvable for a given context , then a trivial solution would be to make concentrate on few easy environments. The second term in Eq. (1) helps to avoid this issue by penalizing distributions that deviate too much from , which is assumed to be wide. When is uniform this is equivalent to maximizing the entropy of .
To estimate the gradient of Eq. (1) with respect to , we use the logderivative score function gradient estimator (Fu, 2006), resulting in the following MonteCarlo update :
(2) 
where . Updating
with samples from the distribution we are learning has the problem that we never get information about the performance of the policy in low probability contexts under
. This is problematic since if context were assigned a low probability early in training, we would require a large number of samples to update its probability–even if the policy performs well on during later stages in training. To address this issue, we use samples from to evaluate the gradient of . While changing the sampling distribution introduces bias, which could be corrected by using importance sampling, we find that both the second term in Eq. (1) and sampling from are crucial to avoid the collapse of the learned distribution (see Fig. 4). To ensure that the two terms in Eq. (1) have similar scale, we standardize the evaluations of with exponentially averaged batch statistics and set to the fixed value of .5 Experiments
We evaluate the impact of learning the DR distribution on two standard benchmark locomotion tasks: Hopper and Halfcheetah from the MuJoCo tasks (illustrated in Fig. 1) in the OpenAI Gym suite (Brockman et al., 2016). We use an explicit encoding of the context vector , corresponding to the torso size, density, foot friction and joint damping of the environments. In this work, we focus on unidimensional domain randomization contexts. In this work, we run experiments for each context variable independently^{2}^{2}2
To enable distribution learning multidimensional contexts, we are exploring the use of parameterizations, different from the discrete distribution, that do not suffer from the curse of dimensionality
. We selectedas an uniform distribution over ranges that include both solvable and unsolvable environments. We initialize
to be the same as . In these experiments, both distributions are implemented as discrete distributions with 100 bins. When sampling from this distribution, we first select a bin according to the discrete probabilities, then select a continuous context value uniformly at random from the ranges of the corresponding bin.We compare the testtime jumpstart and asymptotic performance of policies learned with (learned domain randomization) and (fixed domain randomization). At test time, we sample (uniformly at random) a test set of 50 samples from the support of and run policy search optimization, initializing the policy with the parameters obtained at training time. The questions we aim to answer with our experiments are: 1) does learning policies with wide DR distributions affect the performance of the policy in the environments where the task is solvable? 2) does learning the DR distribution converge to the actual ranges where the task is solvable? 3) Is learning the DR distribution beneficial?
5.1 Results
Learned Distribution Ranges: Table 1 shows the ranges for and the final equivalent ranges for the distributions found by our method. Figure 2 and 3 show the evolution of during training, using our method. Each plot corresponds to a separate domain randomization experiment, where we randomized one different simulator parameter while keeping the rest fixed. Initially, each of these distribution is uniform. As the agent becomes better over the training distribution, it becomes easier to discriminate between promising environments (where the task is solvable) and impossible ones where rewards stay at low values. After around epochs, the distributions have converged to their final distributions. For Hopper, the learned distributions corresponds closely with the environments where we can find a policy using vanilla policy gradient methods from scratch. To determine the consistency of these results, we ran the Hopper torso size experiment 7 times, and fitted the parameters of a uniform distribution to the resulting . The mean ranges (
one standard deviation) across the 7 experiments were
, which provides some evidence for the reproducibility of our method.Environment  Parameters  Initial Train/Test Distribution  Converged Ranges 

Hopper  Torso size  
Density  
Friction  
Joint Damping  
HalfCheetah  Torso size  
Density  
Friction  
Joint Damping 
Learned vs Fixed Domain Randomization: We compare the jumpstart and asymptotic performance between learning the domain randomization distribution and keeping it fixed. Our results show our method, using PPO as the policy optimizer (LSDR) vs keeping the domain randomization distribution fixed (FixedDR) which corresponds to keeping the domain randomization distribution fixed. For these methods, we also compare whether training a contextconditioned policy or a robust policy is better at generalization. We ran the same experiments for Hopper and HalfCheetah.
Figures 5 and 6 depict learning curves when fine tuning the policy at testtime, for torso size randomization. All the methods start with the same random seed at training time. The policies are trained for epochs, where we collect samples per epoch for the policy update. For the distribution update, we collect additional trajectories and run the gradient update times (without resampling new trajectories). We report averages over 50 different environments (corresponding to samples from , one random seed per environment). For clarity of presentation, we report the comparison over a “reasonable” torso size range (where the locomotion task is feasible) and a “hard” range, where the policy fails to make the robot move forward. For Hopper, the reasonable torso size range corresponds to (Figure 5), while the hard range to ((Figure 7). For HalfCheetah the torso size ranges are (Figure 6) and (Figure 8), respectively. From these results, it is clear that learning the domain randomization distribution improves on the jumpstart and asymptotic performance over using fixed domain randomization, within the reasonable range. On the hard ranges, LSDR performs slightly worse. But in most of the contexts in this range the task is not actually solvable; i.e. the optimal policy found by vanillaPPO does not result in successful locomotion on the hard range^{3}^{3}3Successful policies on Hopper obtain cumulative rewards of at least . For HalfCheetah, the rewards are greater than 0 when the robot successfully moves forward..
Contextual policy Figures 5 and 6 also show the performance of a contextual policy to that of a noncontextual policy. Our results show, training a contextual policy boosts the performance for both scenarios, where the domain randomization distribution is fixed, and when the distribution is being learned.
Using a different policy optimizer We also experimented with using EPOptPPO (Rajeswaran et al., 2016) as the policy optimizer in Algorithm (1). The motivation for this is to mitigate the bias towards environments with higher cumulative rewards early during training. EPOpt encourages improving the policy on the worst performing environments, at the expense of collecting more data per epoch. At each training epoch, we obtain samples from the training distribution and obtain trajectories by executing the policy once on each of the corresponding environments. From the resulting trajectories, we use the trajectories that resulted in the lowest rewards to fill the buffer for a PPO policy update, discarding the rest of the trajectories. Figure 9 compares the effect of learning the domain randomization distribution vs using a fixed wide range in this setting. We found that learning the domain randomization distribution resulted in faster convergence to high reward policies over the evaluation range [0.01, 0.09], while resulting in a slightly better asymptotic performance. We believe this could be a consequence of lower variance in the policy gradient estimates, as the the learned has lower variance than . Interestingly, using EPOpt resulted in a distribution with a wider torso size range than vanilla PPO, from approximately to , demonstrating that optimizing worst case performance does help in alleviating the bias towards high reward environments.
6 Discussion
By allowing the agent to learn a good representative distribution, we are able to learn to solve difficult control tasks that heavily rely on a good initial domain randomization range. Our main experimental validation of domain randomization distribution learning is in the domain of simulated robotic locomotion. As shown in our experiments, our method is not sensitive to the initial domain randomization distribution and is able to converge to a more diverse range, while staying within the feasible range.
In this work, we study unidimensional context distribution learning. Due to the curse of dimensionality, there are limitations in using a discrete distribution  we are currently experimenting with alternative distributions such as truncated normal distributions, approximation of discrete distributions, etc. Using multidimensional contexts should enable an agent trained in simulation to obtain experience that’s closer to a real world robot, which is the goal of this work. An issue that requires further investigation is the fact that we use the same reward function over all environments, without considering the effect of the simulation parameters on the reward scale. For instance, in a challenging environment, the agent may obtain low rewards but still manage to produce a policy that successfully solves the task; e.g. successful forward locomotion in the Hopper task. A poorly constructed reward may not only lead to undesirable behavior, but may complicate distribution learning if the scale of the rewards for successful policies varies across contexts.
References
 Andrychowicz et al. (2018) Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177, 2018.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym, 2016.
 Chebotar et al. (2018) Chebotar, Y., Handa, A., Makoviychuk, V., Macklin, M., Issac, J., Ratliff, N., and Fox, D. Closing the simtoreal loop: Adapting simulation randomization with real world experience. arXiv preprint arXiv:1810.05687, 2018.

Chen et al. (2018)
Chen, T., Murali, A., and Gupta, A.
Hardware conditioned policies for multirobot transfer learning.
In Advances in Neural Information Processing Systems, pp. 9355–9366, 2018.  Fu (2006) Fu, M. C. Gradient estimation. Handbooks in operations research and management science, 13:575–616, 2006.
 Jakobi et al. (1995) Jakobi, N., Husbands, P., and Harvey, I. Noise and the reality gap: The use of simulation in evolutionary robotics. In European Conference on Artificial Life, pp. 704–720. Springer, 1995.
 Levine et al. (2016) Levine, S., Finn, C., Darrell, T., and Abbeel, P. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 Packer et al. (2018) Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., and Song, D. Assessing generalization in deep reinforcement learning. arXiv preprint arXiv:1810.12282, 2018.
 Paul et al. (2018) Paul, S., Osborne, M. A., and Whiteson, S. Contextual policy optimisation. CoRR, abs/1805.10662, 2018. URL http://arxiv.org/abs/1805.10662.
 Rajeswaran et al. (2016) Rajeswaran, A., Ghotra, S., Levine, S., and Ravindran, B. EPOpt: Learning robust neural network policies using model ensembles. CoRR, abs/1610.01283, 2016. URL http://arxiv.org/abs/1610.01283.
 Rakelly et al. (2019) Rakelly, K., Zhou, A., Quillen, D., Finn, C., and Levine, S. Efficient offpolicy metareinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254, 2019.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Silver et al. (2016)
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche,
G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M.,
et al.
Mastering the game of go with deep neural networks and tree search.
nature, 529(7587):484, 2016.  Sutton & Barto (2018) Sutton, R. and Barto, A. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning series. MIT Press, 2018. ISBN 9780262039246. URL https://books.google.ca/books?id=6DKPtQEACAAJ.

Tamar et al. (2015)
Tamar, A., Glassner, Y., and Mannor, S.
Optimizing the cvar via sampling.
In
TwentyNinth AAAI Conference on Artificial Intelligence
, 2015.  Yu et al. (2017) Yu, W., Tan, J., Liu, C. K., and Turk, G. Preparing for the unknown: Learning a universal policy with online system identification. arXiv preprint arXiv:1702.02453, 2017.
 Yu et al. (2018) Yu, W., Liu, C. K., and Turk, G. Policy transfer with strategy optimization. CoRR, abs/1810.05751, 2018. URL http://arxiv.org/abs/1810.05751.