Active Domain Randomization

04/09/2019 ∙ by Bhairav Mehta, et al. ∙ 0

Domain randomization is a popular technique for improving domain transfer, often used in a zero-shot setting when the target domain is unknown or cannot easily be used for training. In this work, we empirically examine the effects of domain randomization on agent generalization. Our experiments show that domain randomization may lead to suboptimal, high-variance policies, which we attribute to the uniform sampling of environment parameters. We propose Active Domain Randomization, a novel algorithm that learns a parameter sampling strategy. Our method looks for the most informative environment variations within the given randomization ranges by leveraging the discrepancies of policy rollouts in randomized and reference environment instances. We find that training more frequently on these instances leads to better overall agent generalization. In addition, when domain randomization and policy transfer fail, Active Domain Randomization offers more insight into the deficiencies of both the chosen parameter ranges and the learned policy, allowing for more focused debugging. Our experiments across various physics-based simulated and a real-robot task show that this enhancement leads to more robust, consistent policies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

Code Repositories

active-domainrand

Code repository for Active Domain Randomization (https://arxiv.org/abs/1904.04762)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent trends in Deep Reinforcement Learning (DRL) exhibit a growing interest in zero-shot domain transfer, i.e. when a policy is learned in a source domain and is then tested without finetuning in an unseen target domain. Zero-shot transfer is particularly useful when the task in the target domain is inaccessible, complex, or expensive, such as gathering rollouts from a real-world robot. An ideal agent would learn to generalize across domains; it would accomplish the task without exploiting irrelevant features or deficiencies in the source domain (i.e., approximate physics in simulators), which may vary dramatically after transfer.

Figure 1: Agent generalization, expressed as performance across different engine strength settings in LunarLander. We compare the following approaches: Baseline, i.e. default environment dynamics; Uniform Domain Randomization (UDR); Active Domain Randomization (ADR, our approach which actively searches for difficult MDP instances to train on); and Oracle, i.e. a handpicked randomization range. For evaluation, we take each sampling strategy’s final policies and evaluate them across the full range of environment parameters (i.e. vary main engine strength, which affects the responsiveness and landing speed of the simulated lander). ADR learns a sampling strategy that allows for near-expert levels of generalization, while both Baseline and UDR fail to solve lower MES environments.
Figure 2: ADR shows benefits over UDR on a wide range of tasks, including in a sim2real reaching task. The 4 DoF simulated robot must learn an efficient policy to reach a virtual point (shown in pink), and the final policies are evaluated on the real robot. We show that ADR policies transfer more robustly during zero-shot transfer to the more difficult real-world robot environment.

One promising approach for zero-shot transfer has been domain randomization (Tobin et al., 2017). In Domain Randomization (DR), we uniformly randomize environment parameters of the simulation (i.e. friction, motor torque) across predefined ranges after every training episode. By randomizing everything that might vary in the target environment, the hope is that the agent will view the target domain as just another variation. However, recent works suggest that the sample complexity grows exponentially with the number of randomization parameters, even when dealing only with transfer between simulations (i.e. in Andrychowicz et al. (2018) Figure 8). In addition, when using domain randomization unsuccessfully

, policy transfer fails as a black box. After a failed transfer, randomization ranges are tweaked heuristically via trial-and-error. Repeating this process iteratively, researchers are often left with arbitrary ranges that do (or do not) lead to policy convergence without any insight into how those settings may affect to the learned behavior.

In this work, we demonstrate that the strategy of uniformly sampling environment parameters is suboptimal and propose an alternative method, Active Domain Randomization. Active Domain Randomization (ADR) formulates domain randomization as a search for randomized environments that maximize utility for the agent policy. Concretely, we aim to find environments that currently pose difficulties for the agent policy and dedicate more training time to these troublesome parameter settings. We cast this active search as a Reinforcement Learning (RL) problem where the ADR sampling policy is parameterized with Stein Variational Policy Gradient (SVPG). ADR hones in on problematic regions of the randomization space by learning a discriminative reward computed from discrepancies in policy rollouts generated in randomized and reference environments.

We showcase ADR on a simple environment where the benefits of training on more challenging variations are apparent and interpretable (Figure 1), and demonstrate that ADR learns to preferentially select parameters from these regions while still adapting to the policy’s current deficiencies. We then apply ADR to more complex environments and real robot settings (Figure 2) and show that even with high-dimensional search spaces and unmodeled dynamics, policies trained with ADR exhibit superior generalization and lower overall variance than their Uniform Domain Randomization (UDR) counterparts.

The key contributions of our work can be summarized as:

  1. [noitemsep, nosep]

  2. Our proposed ADR method learns an adaptive randomization strategy that finds problematic environments within the given randomization ranges. Across a wide variety of tasks, we find that training preferentially on these environments leads to better generalization.

  3. ADR can provide insight into which dimensions and parameter ranges are most influential before transfer, which can aid the tuning of randomization ranges before expensive experiments are undertaken.

2 Background

In this section, we briefly cover the basics of RL (used to train both the agent policy and the ADR policy), domain randomization, and Stein Variational Policy Gradient (parameterizes the ADR policy).

2.1 Reinforcement Learning

We consider a RL framework (Sutton & Barto, 2018) where some task is defined by a Markov Decision Process (MDP) consisting of a state space , action space , state transition function , reward function , and discount factor . The goal for an agent trying to solve is to learn a policy with parameters that maximizes the expected total discounted reward. We define a rollout to be the sequence of states and actions executed by a policy in the environment.

2.2 Domain Randomization

Domain randomization (DR) is a technique to train policies completely in simulation and transfer them in a zero-shot manner to the real world. DR requires a prescribed set of simulation parameters to randomize, as well as corresponding ranges to sample them from. A set of parameters is sampled from randomization space , where each randomization parameter is bounded on a closed interval .

When a configuration is passed to a non-differentiable simulator , it generates an environment . At the start of each episode, the parameters are uniformly sampled from the ranges, and the environment generated from those values is used to train the agent policy .

DR may perturb any to all elements of the task ’s underlying MDP111The effects of DR on action space are usually implicit or are carried out on the simulation side., with the exception of keeping and constant. DR therefore generates a set of MDPs that are superficially similar, but can vary greatly in difficulty depending on the character of the randomization. Upon transfer to the target domain, the expectation is that the agent policy has learned to generalize across MDPs, and sees the final domain as just another variation of parameters.

The most common instantiation of DR, UDR is summarized in Algorithm 2 in Appendix B. UDR generates randomized environment instances by uniformly sampling . The agent policy is then trained on rollouts produced in randomized environments .

2.3 Stein Variational Policy Gradient

Sufficient exploration in high-dimensional state spaces has always been a difficult problem in RL. Recently, Liu et al. (2017) proposed SVPG, which learns an ensemble of policies in a maximum-entropy RL framework (Ziebart, 2010).

(1)

with entropy being controlled by temperature parameter . SVPG uses Stein Variational Gradient Descent (Liu & Wang, 2016) to iteratively update an ensemble of policies or particles with an update rule:

(2)

with step size and positive definite kernel . This update rule balances exploitation (the first term moves particles towards high-reward regions) and exploration (the second term repulses similar policies).

3 Method

Drawing analogies with Bayesian Optimization (BO) literature, one can think of the randomization space as a search space. We aim to look for points (environment instances) that maximize utility, or provide the most improvement to our agent policy when used for training. Traditionally, in BO, the search for where to evaluate an objective is handled by acquisition functions, which trade off exploitation of the objective with exploration in the uncertain regions of the space (Brochu et al., 2010). However, unlike the stationary objectives seen in BO, training the agent policy makes our optimization nonstationary: the environment with highest utility at time is likely not the same as the maximum-utility environment at time . With this dynamic objective, we need to actively search the space for the most fruitful training environments given the current state of the agent policy.

3.1 Motivating Experiment

However, as this nonstationary search adds its own complexity, it is important to investigate if uniform sampling across the entire space is actually detrimental to agent performance. Concretely, we begin by investigating the validity of the following claim: uniformly sampling of environment parameters does not generate equally useful MDPs. To test the hypothesis, we use LunarLander-v2

, where the agent’s task is to ground a lander in a designated zone and reward is based on the quality of landing (fuel used, impact velocity, etc). Parameterized by an 8D state vector and actuated by a 2D continuous action space,

LunarLander-v2 has one main axis of randomization that we vary: the main engine strength (MES).

We aim to determine if certain environment instances (different values of the MES) are more informative - more efficient than others in terms of aiding generalization. We set the total range of variation for the MES to be (the default is 13, and lower than 7.5 makes the environment unsolvable when all other physics parameters are held constant) and find through empirical tests that lower engine strengths generate harder MDPs to solve. Under this assumption, we show the effects of focused domain randomization by editing the ranges that the MES is uniformly sampled from.

We train multiple agents, with the only difference between them being the randomization ranges for MES. The randomization ranges define what types of environments the agent is exposed to during training. Figure 1 shows the final generalization performance of each agent by sweeping across the entire randomization range of [8, 20] and rolling out the policy in the generated environments. We see that focusing on harder MDPs improves generalization over uniformly sampling the whole space, even when the evaluation environment is outside of the training distribution.

3.2 Active Domain Randomization

The experiment in the previous section shows that preferential training on more informative environments provides tangible benefits, but in general, finding these environments is diffcult because:

  1. [noitemsep, nosep]

  2. It is rare that such intuitively hard MDP instances or parameter ranges are known beforehand.

  3. DR is used mostly when the space of randomized parameters is high-dimensional or noninterpretable.

An ideal randomization scheme would find the most informative environment instances in the randomization space, rather than uniformly sampling from the entire space. While seemingly just an instantiation of the traditional BO problem, the nonstationarity of the objective (the environment utility) requires us to redefine the notion of an acquisition function while simultaneously dealing with BO’s deficiencies with higher-dimensional inputs (Wang et al., 2013).

1:  Input: : Randomization space, : Simulator, : reference parameters
2:  Initialize : agent policy, : SVPG particles, : discriminator, : reference environment
3:  while  not  do
4:     for  each particle do
5:        rollout
6:     end for
7:     for each  do
8:         // Generate, rollout in randomized env.
9:        
10:        rollout ,
11:        
12:        
13:        for  each gradient step do
14:           // Agent policy update
15:           with update:
16:           
17:        end for
18:     end for
19:     // Calculate reward for each proposed environment
20:     for each  do
21:        Calculate reward with associated and using Eq. (3)
22:     end for
23:     // Update randomization sampling strategy
24:     for each particle  do
25:        Update particles using Eq. (2)
26:     end for
27:     // Update discriminator
28:     for each gradient step do
29:        Update with and using SGD.
30:     end for
31:  end while
Algorithm 1 Active Domain Randomization

To this end, we propose ADR, summarized in Algorithm 1 and Figure 3. ADR provides a framework for manipulating a more general analog of an acquisition function, selecting the most informative MDPs for the agent within the randomization space. By formulating the search as an RL problem, ADR learns a policy where the states are proposed randomization configurations and actions are continuous changes to those parameters.

We learn a discriminator-based reward for , similar to the one originally proposed in Eysenbach et al. (2018):

(3)

where is a boolean variable denoting the discriminator’s prediction of which type of environment (a randomized environment or reference environment ) the trajectory was generated from. We assume that the is provided with the original task definition.

Figure 3: Overview of our proposed framework: ADR proposes randomized environments (c) or simulation instances from a simulator (b) and rolls out an agent policy (d) in those instances. The discriminator (e) learns a reward (f) as a proxy for environment difficulty by distinguishing between rollouts in the reference environment (a) and randomized instances, which is used to train Stein Variational Policy Gradient (SVPG) particles (g). Enforced through the SVPG formulation, the particles propose a diverse set of environment dynamics, and try to find the environment parameters (h) that are currently causing the agent the most difficulty.

Intuitively, we reward the policy for finding regions of the randomization space that produce environment instances where the same agent policy acts differently than in the reference environment. The agent policy sees and trains only on the randomized environments (as it would in traditional DR), using the environment’s task-specific reward for updates. As the agent improves on the proposed, problematic environments, it becomes more difficult to differentiate whether any given state transition was generated from the reference or randomized environment. Thus, ADR can find what parts of the randomization space the agent is currently performing poorly on, and can actively update its sampling strategy throughout the training process.

3.3 Architecture Walkthrough

In this section, we walk through the diagram shown in Figure 3. All line references refer to Algorithm 1.

3.3.1 SVPG Sampler

To encourage sufficient exploration in high dimensional randomization spaces, we parameterize with SVPG. Since each particle proposes its own environment settings (lines 4-6, Fig. 3h), all of which are passed to the agent for training, the agent policy benefits from the same environment variety seen in UDR. However, unlike UDR, can use the learned reward to focus on problematic MDP instances while still being efficiently parallelizable.

3.3.2 Simulator

After receiving each particle’s proposed parameter settings , we generate randomized environments (line 9, Fig. 3b).

3.3.3 Generating Trajectories

We proceed to train the agent policy on the randomized instances , just as in UDR. We roll out on each randomized instance and store each trajectory . For every randomized trajectory generated, we use the same policy to collect and store a reference trajectory by rolling out in the default environment (lines 10-12, Fig. 3a, c). We store all trajectories (lines 11-12) as we will use them to score each parameter setting and update the discriminator.

The agent policy is a black box: although in our experiments we train with Deep Deterministic Policy Gradients (Lillicrap et al., 2015), the policy can be trained with any other on or off-policy algorithm by introducing only minor changes to Algorithm 1 (lines 13-17, Fig. 3d).

3.3.4 Scoring Environments

We now generate a score for each environment (lines 20-22) using each stored randomized trajectory by passing them through the discriminator , which predicts the type of environment (reference or randomized) each trajectory was generated from. We use this score as a reward to update each SVPG particle using Equation 2 (lines 24-26, Fig. 3f).

After scoring each according to Equation 3, we use the randomized and reference trajectories to train the discriminator (lines 28-30, Fig. 3e).

4 Results

4.1 Experiment Details

To test ADR, we experiment on OpenAI Gym environments (Brockman et al., 2016) across various tasks, both simulated and real.

  • [noitemsep, nosep]

  • LunarLander-v2222https://gym.openai.com/envs/LunarLander-v2/, a 2 degrees of freedom (DoF) environment where the agent has to softly land a spacecraft, implemented in Box2D (detailed in Section 3.2),

  • Pusher-3DOF-v0333Originally developed for Haarnoja et al. (2018), a 3 DoF arm that has to push a puck to a target, implemented in Mujoco (Todorov et al., 2012), and

  • ErgoReacher-v0444Originally developed for Golemo et al. (2018), a 4 DoF arm which has to touch a goal with its end effector, implemented in the Bullet Physics Engine (Coumans, 2015). For sim2real experiments, we recreate this environment setup on a real Poppy Ergo Jr. robot (Lapeyre, 2014) shown in Fig.2.

In addition to the randomization of the main engine strength of LunarLander-v2, both other environments randomize various physics parameters that change the environments drastically. Pusher-3DOF-v0 has two axes of randomization that make the puck slide more or less when pushed. ErgoReacher-v0 randomizes the max torque and gain for each degree of freedom (joint) for eight randomization parameters. We provide a detailed account of the randomizations used in Table 1 in Appendix C.

All simulated experiments are run with five seeds each with five random resets, totaling 25 independent trials per evaluation point. All experimental results are plotted mean-averaged with one standard deviation shown. Detailed experiment information can be found in Appendix

E.

4.2 Toy Experiments

To investigate whether ADR’s learned sampling strategy provides a tangible benefit for agent generalization, we start by comparing it against traditional DR (labeled as UDR) on LunarLander-v2 and vary only the main engine strength (MES). In Figure 1, we see that ADR approaches expert-levels of generalization whereas UDR fails to generalize on lower MES ranges.

From Figure 4, we see that ADR solves the reference environment () more consistently than UDR, never dipping below the Solved line once that level of performance is reached. Figure 4 shows that ADR is the only agent out of both the baseline (trained only on MES of 13) and the UDR agent (trained seeing environments with ) that makes significant progress on the hard environment instances ().

Figure 5 explains the adaptability of ADR by showing generalization and sampling distribution at various stages of training. ADR starts by sampling approximately uniformly for the first 650K steps, but then finds a deficiency in the policy on higher ranges of the MES. As those areas become more frequently sampled between 650K-800K steps, the agent learns to solve all of the higher-MES environments, as shown by the generalization curve for 800K steps. As a result, the discriminator is no longer able to differentiate reference and randomized trajectories from the higher MES regions, and starts to reward environment instances generated in the lower end of the MES range, which improves generalization towards the completion of training.

Figure 4: Learning curves over time in LunarLander. Higher is better. (a) Performance on the default environment settings; (b) Performance on particularly difficult settings - our approach outperforms both the policy trained on a single simulator instance (”baseline”) and the UDR approach.
Figure 5: Agent generalization, expressed as performance across different engine strength settings in LunarLander. (a) Change in performance during training; (b) Change in dynamics sampling during training. As training proceeds, ADR begins preferentially sampling the more challenging environmental instances.

4.3 Randomization in High Dimensions

If the intuitions that drive ADR are correct, we should see increased benefit of a learned sampling strategy with larger due to the increasing sparsity of informative environments when sampling uniformly. We first explore ADR’s performance on Pusher3DOF-v0, an environment where . Both randomization dimensions (puck damping, puck friction loss) affect whether or not the puck retains momentum and continues to slide after making contact with the agent’s end effector. Lowering the values of these parameters simultaneously creates an intuitively-harder environment, where the puck continues to slide after being hit. In the reference environment, the puck retains no momentum and must be continuously pushed in order to move. We qualitatively visualize the effect of these parameters on puck sliding in Figure 6.

Figure 6: Sampling behavior of ADR in Pusher3Dof. The environment dynamics are characterized by friction and damping of the sliding puck. We have identified dynamics settings which exhibit easier or harder to learn puck behavior (as highlighted by cyan, purple, and pink - from easy to hard). (a) During training, the algorithm only had access to a limited, easier range of dynamics (black outline). We observed that our approach will converge to the hardest settings within this limited range. (b) Performance measured by distance to target, lower is better. The results show the higher performance and lower variance of our approach, safe for one exception.

From Figure 6, we see ADR’s improved robustness to extrapolation - or when the target domain lies outside the training region. We train two agents, one using ADR and one using UDR, and show them only the training regions encapsulated by the dark, outlined box in the top-right of Figure 6. Qualitatively, only 25% of the environments have dynamics which cause the puck to slide, which are the hardest environments to solve in the training region. We see that from the sampling histogram overlaid on Figure 6 that ADR prioritizes the single, harder purple region more than the light blue regions, allowing for better generalization to the unseen test domains, as shown in Figure 6. ADR outperforms UDR in all but one test region and produces policies with less variance than their UDR counterparts.

Figure 7: Learning curves over time in Pusher3Dof. Lower is better. (a) Performance on the default environment settings; (b) Performance on particularly difficult settings - our approach outperforms both the policy trained with the UDR approach both in terms of performance and variance.

From both Figure 7 and Figure 7, which are learning curves for UDR and ADR on the reference environment and hard environment (the pink square in Figure 6) respectively, we observe an interesting phenomenon: not only does ADR solve both environments more consistently (i.e. doesn’t pop up above the Solved line), but UDR also unlearns

the good behaviors it acquired in the beginning of training. When training neural networks in both supervised and reinforcement learning settings, this phenomenon has been dubbed as

catastrophic forgetting (Kirkpatrick et al., 2016). ADR seems to exhibit this slightly (leading to ”hills” in the curve), but due to the adaptive nature the algorithm, it is able to adjust quickly and retain better performance across all environments.

4.4 Randomization in Uninterpretable Dimensions

We further show the significance of ADR over UDR on ErgoReacher-v0, where . It is now impossible to infer intuitively which environments are hard due to the complex interactions between the eight randomization parameters (gains and maximum torques for each joint). For demonstration purposes, we test extrapolation by creating a held-out target environment with extremely low values for torque and gain, which causes certain states in the environment to lead to catastrophic failure - gravity pulls the robot end effector down, and the robot is not strong enough to pull itself back up. We show an example of an agent getting trapped in a catastrophic failure state in Figure 12, Appendix C.1.

To generalize effectively, the sampling policy should prioritize environments with lower torque and gain values in order for the agent to operate in such states precisely. However, since the hard evaluation environment is not seen during training, ADR must learn to prioritize the hardest environments that it can see, while still learning behaviors that can operate well across the entire training region.

In Figure 8, we see that when evaluated on the reference environment, the policy learned using ADR has much lower variance than one learned using UDR and can solve the environment much more effectively. In addition, it generalizes better to the unseen target domain as shown in Figure 8, again which much less variance in the learned agent policy.

Figure 8: Learning curves over time in ErgoReacher. Lower is better. (a) Performance on the default environment settings; (b) Performance on particularly difficult settings - our approach outperforms both the policy trained with the UDR approach both in terms of performance and variance.

UDR’s high variance on ErgoReacher-v0 is indicative of some of its issues: by continuously training on a random mix of hard and easy MDP instances, both beneficial and detrimental agent behaviors can be learned and unlearned throughout training. As shown in ErgoReacher-v0, this mixing can lead to high-variance, inconsistent, and unpredictable behavior upon transfer. By focusing on those harder environments and allowing the definition of hard to adapt over time, ADR shows more consistent performance and better overall generalization than UDR in all environments tested.

4.5 Sim2Real Transfer Experiments

In this section, we present results of simulation-trained policies transferred zero-shot onto the real Poppy Ergo Jr. robot.

In sim2real (simulation to reality) transfer, many policies fail due to unmodeled dynamics within the simulators, as policies may have overfit to or exploited simulation-specific details of their training environments. While the deficiencies and high variance of UDR are clear even in simulated environments, one of the most impressive results of domain randomization was zero-shot transfer out of simulation onto robots. However, we find that the same issues of unpredictable performance apply to UDR-trained policies in the real world as well.

We take each method’s (ADR and UDR) five independent simulation-trained policies from Section 4.4 and transfer them without fine tuning onto the real robot. We rollout only the final policy on the robot, and show performance in Figure 9. To evaluate generalization, we alter the robot by changing the values of the torques (higher torque means the arm moves at higher speed and accelerates faster), and evaluate each of the policies with 25 random goals (125 independent evaluations per torque setting). As shown in Figure 9, ADR policies obtain overall better or similar performance than UDR policies trained in the same conditions. More importantly, ADR policies are more consistent and display lower spread across all environments, which is crucial when safely evaluating reinforcement learning policies on real-world robots.

Figure 9: Results of the ErgoReacher policies evaluated on the physical robot over various torque settings, measured by final distance to the target. Lower is better. Our approach has a performance that is equal or better than UDR while the spread of ADR is consistently smaller than UDR. Smaller spread points to more consistent performance, which is important when considering the potentially dangerous transfer onto real world robots.

4.6 Interpretability

One of the secondary benefits of ADR is its insight into incompatibilities between the task and randomization ranges. We demonstrate the simple effects of this phenomenon in a one-dimensional LunarLander-v2, where we only randomize the main engine strength. Our initial experiments varied this parameter between 6 and 20, which lead to ADR learning degenerate agent policies by learning to propose the lopsided blue distribution in Figure 10. Upon inspection of the simulation, we see that when the parameter has a value of less than approximately 8, the task becomes almost impossible to solve due to the other environment factors (in this case the lander always hits the ground too fast, which it is penalized for).

After adjusting the parameter ranges to more sensible values, we see a better sampled distribution in pink, which still gives more preference to the hard environments in the lower engine strength range. Most importantly, ADR allows for analysis that is both focused - we know exactly what part of the simulation is causing trouble - and pre-transfer, i.e. done before a more expensive experiment such as real robot transfer has taken place. With UDR, the agents would be equally trained on these degenerate environments, leading to policies with potentially undefined behavior (or, as seen in Section 4.4, unlearn good behaviors) in these truly out-of-distribution simulations.

Figure 10: Sampling frequency across engine strengths when varying the randomization ranges. The updated, red distribution shows a much milder unevenness in the distribution, while still learning to focus on the harder instances. This can be used for debugging the randomization ranges before transferring a learned policy onto a physical system.

5 Related Work

5.1 Dynamic and Adversarial Simulators

Simulators have played a crucial role in transferring learned policies onto real robots, and many different strategies have been proposed. Randomizing simulation parameters for better generalization or transfer performance is a well-established idea in evolutionary robotics

(Zagal et al., 2004; Bongard & Lipson, 2004), but recently has emerged as an effective way to perform zero-shot transfer of deep reinforcement learning policies in difficult tasks (Andrychowicz et al., 2018; Tobin et al., 2017; Peng et al., 2018; Sadeghi & Levine, 2016).

Learnable simulations are also an effective way to adapt a simulation to a particular target environment. Chebotar et al. (2018) and Ruiz et al. (2018) use RL for effective transfer by learning parameters of a simulation that accurately describes the target domain, but require the target domain for reward calculation, which can lead to overfitting. In contrast, our approach requires no target domain, but rather only a reference domain (the default simulation parameters) and a general range for each parameter. ADR encourages diversity, and as a result gives the agent a wider variety of experience. In addition, unlike Chebotar et al. (2018), our method does not requires carefully-tuned (co-)variances or task-specific cost functions. Concurrently, Khirodkar et al. (2018) also showed the advantages of learning adversarial simulations and disadvantages of purely uniform randomization distributions in object detection tasks.

To improve policy robustness, Robust Adversarial Reinforcement Learning (RARL) Pinto et al. (2017) jointly trains both an agent and an adversary who applies environment forces that disrupt the agent’s task progress. ADR removes the zero-sum game dynamics, which have been known to decrease training stability (Mescheder et al., 2018). More importantly, our method’s final outputs - the SVPG-based sampling strategy and discriminator - are reusable and can be used to train new agents as shown in Appendix A, whereas a trained RARL adversary would overpower any new agent and impede learning progress.

5.2 Active Learning and Informative Samples

Active learning methods in supervised learning try to construct a representative, sometimes time-variant, dataset from a large pool of unlabelled data by proposing elements to be labeled. The chosen samples are labelled by an oracle and sent back to the model for use. Similarly, ADR searches for what environments may be most useful to the agent at any given time. Active learners, like BO methods discussed in Section 3, often require an acquisition function (derived from a notion of model uncertainty) to chose the next sample. Since ADR handles this decision through the explore-exploit framework of RL and the in SVPG, ADR sidesteps the well-known scalability issues of both active learning and BO (Tong, 2001).

Recently, Toneva et al. (2018)

showed that certain examples in popular computer vision datasets are harder for networks to learn, and that some examples generalize (or are forgotten) much quicker than others. We explore the same phenomenon in the space of

MDPs defined by our randomization ranges, and try to find the ”examples” that cause our agent the most trouble. Unlike the active learning setting or Toneva et al. (2018), we have no oracle or supervisory loss signal in RL, and instead attempt to learn a proxy signal for ADR via a discriminator.

5.3 Generalization in Reinforcement Learning

Generalization in RL has long been one of the holy grails of the field, and recent work like Packer et al. (2018), Cobbe et al. (2018), and Farebrother et al. (2018) highlight the tendency of deep RL policies to overfit to details of the training environment. Our experiments exhibit the same phenomena, but our method improves upon the state of the art by explicitly searching for and varying the environment aspects that our agent policy may have overfit to. We find that our agents, when trained more frequently on these problematic samples, show better generalization while also improving interpretability in both the randomization ranges’ and agent policy’s weaknesses.

6 Conclusion

In this work, we highlight failure cases of traditional domain randomization, and propose active domain randomization (ADR), a general method capable of finding the most informative parts of the randomization parameter space for a reinforcement learning agent to train on. ADR does this by posing the search as a reinforcement learning problem, and optimizes for the most informative environments using a learned reward and multiple policies. We show on a wide variety of simulated environments that this method efficiently trains agents with better generalization than traditional domain randomization, extends well to high dimensional parameter spaces, and produces more robust policies when transferring to the real world.

Acknowledgements

The authors gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), the Fonds de Recherche Nature et Technologies Quebec (FQRNT) and the Open Philanthropy Project for supporting this work. In addition, the authors would like to thank Kyle Kastner and members of the REAL Lab for their helpful comments.

References

Appendix A Bootstrapping Training of New Agents

Unlike DR, ADR’s learned sampling strategy and discriminator can be reused to train new agents from scratch. To test the transferability of the sampling strategy, we first train an instance of ADR on LunarLander-v2, and then extract the SVPG particles and discriminator. We then replace the agent policy with an random network initialization, and once again train according the the details in Section 4.1. From Figure 11, it can be seen that the bootstrapped agent generalization is even better than the one learned with ADR from scratch. However, its training speed on the default environment () is relatively lower.

Figure 11: Generalization and default environment learning progression on LunarLander-v2 when using ADR to bootstrap a new policy. Higher is better.

Appendix B Uniform Domain Randomization

Here we review the algorithm for Uniform Domain Randomization (UDR), first proposed in (Tobin et al., 2017), shown in Algorithm 2.

1:  Input: : Randomization space, : Simulator
2:  Initialize : agent policy
3:  for each episode do
4:     // Uniformly sample parameters
5:     for  to  do
6:        
7:     end for
8:      // Generate, rollout in randomized env.
9:     
10:     rollout
11:     
12:     for  each gradient step do
13:        // Agent policy update
14:        with update:
15:        
16:     end for
17:  end for
Algorithm 2 Uniform Sampling Domain Randomization

Appendix C Environment Details

Please see Table 1.

c.1 Catastrophic Failure States in ErgoReacher

In Figure 12, we show an example progression to a catastrophic failure state in the held-out, simulated target environment of ErgoReacher-v0, with extremely low torque and gain values.

Figure 12: An example progression (left to right) of an agent moving to a catastrophic failure state (Panel 4) in the hard ErgoReacher-v0 environment.
Environment Types of Randomizations Train Ranges Test Ranges
LunarLander-v2 1 Main Engine Strength
Pusher-3DOF-v0 2 Puck Friction Loss & Puck Joint Damping default default
ErgoReacher-v0 8 Joint Damping default default
Joint Max Torque default default
Table 1: We summarize the environments used, as well as characteristics about the randomizations performed in each environment.

Appendix D Untruncated Plots for Lunar Lander

Figure 13: Generalization on LunarLander-v2 for an expert interval selection, ADR, and UDR. Higher is better.

All policies on Lunar Lander described in our paper receive a Solved score when the engine strengths are above 12, which is why truncated plots are shown in the main document. For clarity, we show the full, untruncated plot in Figure 13.

Appendix E Network Architectures and Experimental Hyperparameters

All experiments can be reproduced using our Github repository555https://github.com/montrealrobotics/active-domainrand.

All of our experiments use the same network architectures and experiment hyperparameters, except for the number of particles

. For any experiment with LunarLander-v2, we use . For both other environments, we use . All other hyperparameters and network architectures remain constant, which we detail below. All networks use the Adam optimizer (Kingma & Ba, 2014).

We run Algorithm 1 until 1 million agent timesteps are reached - i.e. the agent policy takes 1M steps in the randomized environments. We also cap each episode off a particular number of timesteps according to the documentation associated with (Brockman et al., 2016). In particular, LunarLander-v2 has an episode time limit of 1000 environment timesteps, whereas both Pusher-3DOF-v0 and ErgoReacher-v0 use an episode time limit of 100 timesteps.

For our agent policy, we use an implementation of DDPG (particularly, OurDDPG.py) from the Github repository associated with (Fujimoto et al., 2018)

. The actor and critic both have two hidden layers of 400 and 300 neurons respectively, and use

ReLU activations. Our discriminator-based rewarder is a two-layer neural network, both layers having 128 neurons. The hidden layers use tanh activation, and the network outputs a sigmoid for prediction.

The agent particles in SVPG are parameterized by a two-layer actor-critic architecture, both layers in both networks having 100 neurons. We use Advantage Actor-Critic (A2C

) to calculate unbiased and low variance gradient estimates. All of the hidden layers use

tanh activation and are orthogonally initialized, with a learning rate of and discount factor . They operate on a continuous space, with each axis bounded between . We allow for set the max step length to be 0.05, and every 50 timesteps, we reset each particle and randomly initialize its state using a

-dimensional uniform distribution. We use a temperature

with an RBF-Kernel as was done in (Liu et al., 2017). In our work we use an Radial Basis Function (RBF) kernel with median baseline as described in Liu et al. (2017) and an A2C policy gradient estimator (Mnih et al., 2016), although both the kernel and estimator could be substituted with alternative methods (Gangwani et al., 2019). To ensure diversity of environments throughout training, we always roll out the SVPG particles using a non-deterministic sample.

For DDPG, we use a learning rate , target update coefficient of , discount factor , and batch size of 1000. We let the policy run for 1000 steps before any updates, and clip the max action of the actor between as prescribed by each environment.

Our discriminator-based reward generator is a network with two, 128-neuron layers with a learning rate of and a binary cross entropy loss (i.e. is this a randomized or reference trajectory). To calculate the reward for a trajectory for any environment, we split each trajectory into its constituents, pass each tuple through the discriminator, and average the outputs, which is then set as the reward for the trajectory. Our batch size is set to be 128, and most importantly, as done in (Eysenbach et al., 2018), we calculate the reward for examples before using those same examples to train the discriminator.