Robust Domain Randomization for Reinforcement Learning

by   Reda Bahi Slaoui, et al.

Producing agents that can generalize to a wide range of environments is a significant challenge in reinforcement learning. One method for overcoming this issue is domain randomization, whereby at the start of each training episode some parameters of the environment are randomized so that the agent is exposed to many possible variations. However, domain randomization is highly inefficient and may lead to policies with high variance across domains. In this work, we formalize the domain randomization problem, and show that minimizing the policy's Lipschitz constant with respect to the randomization parameters leads to low variance in the learned policies. We propose a method where the agent only needs to be trained on one variation of the environment, and its learned state representations are regularized during training to minimize this constant. We conduct experiments that demonstrate that our technique leads to more efficient and robust learning than standard domain randomization, while achieving equal generalization scores.


page 7

page 8

page 18


Active Domain Randomization

Domain randomization is a popular technique for improving domain transfe...

Pre-training of Deep RL Agents for Improved Learning under Domain Randomization

Visual domain randomization in simulated environments is a widely used m...

Not Only Domain Randomization: Universal Policy with Embedding System Identification

Domain randomization (DR) cannot provide optimal policies for adapting t...

Cyclic Policy Distillation: Sample-Efficient Sim-to-Real Reinforcement Learning with Domain Randomization

Deep reinforcement learning with domain randomization learns a control p...

Wield: Systematic Reinforcement Learning With Progressive Randomization

Reinforcement learning frameworks have introduced abstractions to implem...

Policy Transfer via Kinematic Domain Randomization and Adaptation

Transferring reinforcement learning policies trained in physics simulati...

BayesSimIG: Scalable Parameter Inference for Adaptive Domain Randomization with IsaacGym

BayesSim is a statistical technique for domain randomization in reinforc...

Code Repositories


Code associated with our paper "Robust Domain Randomization for Reinforcement Learning"

view repo


Code associated with our paper "Robust Visual Domain Randomization for Reinforcement Learning"

view repo

1 Introduction

Deep Reinforcement Learning (RL) has proven very successful on complex high-dimensional problems ranging from games like Go (Silver et al., 2017) and Atari games (Mnih et al., 2015) to robot control tasks (Levine et al., 2016). However, one prominent issue is that of overfitting, illustrated in figure 1: agents trained on one domain fail to generalize to other domains that differ only in small ways from the original domain (Sutton, 1996; Cobbe et al., 2018; Zhang et al., 2018b; Packer et al., 2018; Zhang et al., 2018a; Witty et al., 2018; Farebrother et al., 2018). Good generalization is essential for problems such as robotics and autonomous vehicles, where the agent is often trained in a simulator and is then deployed in the real world where novel conditions will certainly be encountered. Transfer from such simulated training environments to the real world is known as crossing the reality gap, and is well known to be difficult, thus providing an important motivation for studying generalization.

To close the reality gap in reinforcement learning, prior work has studied both domain adaptation and domain randomization. Domain adaptation techniques aim to update the data distribution in simulation to match the real distribution through some form of canonical mapping or using regularization methods (James et al., 2018; Bousmalis et al., 2017; Gamrian and Goldberg, 2018). Alternatively, domain randomization (DR), in which the visual and physical properties of the training domains are randomized at the start of each episode during training, has also been shown to lead to improved generalization and transfer to the real world (Tobin et al., 2017; Sadeghi and Levine, 2016; Antonova et al., 2017; Peng et al., 2017; Mordatch et al., 2015; Rajeswaran et al., 2016; OpenAI, 2018). Domain randomization relies on the expectation that the agent will perceive the difference between the train and test domains as just another variation of the train domain. However, domain randomization has been empirically shown to often lead to suboptimal policies with high variance in performance over different randomizations (Mehta et al., 2019). This issue can cause the learned policy to underperform in any given target domain.

We propose a method for learning policies that are robust to changes in the randomization space, producing agents that ignore irrelevant aspects of the environment. Our work combines aspects from both domain adaptation and domain randomization, in that we maintain the notion of randomized environments but use a regularization method to achieve good generalization over the randomization space. Our contributions are the following:

  • We formalize the domain randomization problem, and show that the Lipschitz constant of the agent’s policy over the randomization parameters provides an upper bound on the agent’s robustness to variations in the environment.

  • We propose a method whereby the agent is only trained on one variation of the environment but its learned representations are regularized so that the Lipschitz constant is minimized.

  • We experimentally show that our method is more efficient and leads to lower-variance policies than standard domain randomization, while achieving equal or better returns and generalization ability.

This paper is structured as follows. We first review other work related to ours, formalize the domain randomization problem, and present our theory contributions. We then describe our regularization method, and illustrate its application to a toy gridworld problem. Finally, we compare our method with standard domain randomization in complex visual environments.

Figure 1: Illustration of the generalization challenge in reinforcement learning. In this visual cartpole domain, the agent must learn to keep the pole upright. However, changes in the background color can completely throw off a trained agent.

2 Related Work

2.1 Generalization in Deep Reinforcement Learning

Generalization to novel samples is well studied in supervised learning, where evaluating generalization through train/test splits is ubiquitous. However, evaluating for generalization to novel conditions through such train/test splits is not common practice in deep RL.

Zhang et al. (2018b) study overfitting of deep RL in discrete maze tasks. Testing environments are generated with the same maze configuration but different initial positions from training. Deep RL algorithms are shown to suffer from overfitting to training configurations and to memorize training scenarios. Packer et al. (2018) study performance under train-test domain shift by modifying environmental parameters such as robot mass and length to generate new domains. Farebrother et al. (2018) propose using different game modes of Atari 2600 games to measure generalization. They turn to supervised learning for inspiration, finding that both L2 regularization and dropout can help agents learn more generalizable features. These works all show that standard deep RL algorithms tend to overfit to the environment used during training, hence the urgent need for designing agents that can generalize better. Domain randomized training has been shown to be a promising way of addressing this challenge.

2.2 Domain Randomization

We distinguish between two types of domain randomization: visual randomization, in which the variability between domains should not affect the agent’s policy, and dynamics randomization, in which the agent should learn to adjust its behavior to achieve its goal. Visual domain randomization has been successfully used to directly transfer RL agents from simulation to the real world without requiring any real images (Tobin et al., 2017; Sadeghi and Levine, 2016; Kang et al., 2019). These approaches used low fidelity rendering and randomized scene properties such as lighting, textures, camera position, and colors, which led to improved generalization. Dynamics randomization has been successfully used to develop agents that are more robust to uncertainty in the system’s dynamics (Antonova et al., 2017; Peng et al., 2017; Mordatch et al., 2015; Rajeswaran et al., 2016; OpenAI, 2018). In this paper, we focus on the visual domain randomization setting.

The work most reminiscent to our proposed method combine domain randomization and domain adaptation techniques (James et al., 2018; Chebotar et al., 2018; Gamrian and Goldberg, 2018). The main idea of these approaches is to both randomize the simulated environment and penalize the gap between the trajectories in the simulations and the real world, either by adding a term to the loss, or learning a mapping between the states of the simulation and the real world. This approach requires a large number of samples of real world trajectories, which can be expensive to collect.

Prior work has, however, noted the inefficiency of Domain Randomization. Mehta et al. (2019) show that domain randomization may lead to suboptimal policies that vary a lot between domains, due to uniform sampling of the environment’s parameters. They propose a method to guide domain randomization, by predicting the most informative environment variations within the given randomization ranges. Zakharov et al. (2019) also guide the domain randomization procedure by training a DeceptionNet, that learns which randomizations are actually useful to bridge the domain gap for image classification tasks.

2.3 Learning Domain-Invariant Features and Domain Adaptation

Learning domain-invariant features has emerged as a promising approach for taking advantage of the commonalities between domains. This is usually done by minimizing some measure of distance between the source and target domains. In the semi-supervised learning context, there is a large body of work relating to learning domain-invariant features. For instance,

Bachman et al. (2014); Sajjadi et al. (2016); Coors et al. (2018); Miyato et al. (2018); Xie et al. (2019) enforce that predictions of their networks be similar for original and augmented data points, with the objective of reducing the required amount of labelled data for training. Our work extends such methods to reinforcement learning.

In the reinforcement learning context, several other papers have also explored this topic. Tzeng et al. (2015) and Gupta et al. (2017) add constraints to encourage networks to learn similar embeddings for samples from both a simulated and a target domain. Daftry et al. (2016) apply a similar approach to transfer policies for controlling aerial vehicles to different environments. Bousmalis et al. (2017) compare different domain adaptation methods in a robot grasping task, and show that they improve generalization. Wulfmeier et al. (2017) use an adversarial loss to train RL agents in such a way that similar policies are learned in both a simulated domain and the target domain. While promising, these methods are designed for cases when simulated and target domains are both known, and cannot straightforwardly be applied when the target domain is only known to be within a distribution of domains.

3 Problem Formulation

We consider Markov decision processes (MDP) defined by

, where is the state space, the action space, the reward function, the transition dynamics, and the discount factor. In reinforcement learning, an agent’s objective is to find a policy that maps states to distributions over actions such that the cumulative discounted reward yielded by its interactions with the environment is optimized.

3.1 Domain Randomization

Domain randomization requires a set of simulation parameters to randomize, and a randomization space from which these parameters are sampled. When a configuration is passed to a simulator, it generates a new MDP with a potentially new set of states and transitions indexed by . At the start of each episode, the parameters are sampled from the chosen randomization space, and the generated MDP is used to train the agent during this episode. Denoting the cumulative returns of a policy , the goal is thus to solve the optimization problem defined by . When the randomization parameters affect the transitions in the MDP, we refer to dynamics randomization. When the randomization parameters affect only the states, we refer to visual randomization.

Domain randomization empirically produce policies with strongly varying performance over different regions of the randomization space, as demonstrated by Mehta et al. (2019) for the case of dynamics randomization. Our own experiments, which we discuss later in this paper, also corroborate this observation for visual randomization. This high variance can cause the learned policy to underperform in any given target domain.

To yield insight into the robustness of policies learned by domain randomization, we start by formalizing the notion of a randomized MDP. Although domain randomization may perturb any element of the underlying MDP such as rewards or transitions, in this work we only consider the case where we modify the state space . This is often used to close the visual gap between simulation and reality, for example by randomizing colors or textures during the training in simulation. Contrary to dynamics randomization, such randomizations don’t change the transition or reward functions of the underlying MDP.

Definition 1

Let be an MDP. A randomizer function of is a mapping where is a new set of states. The Randomized MDP is defined as, for , :

Given a policy on MDP and a randomization , we also define the agent’s policy on as .

Despite all randomized MDPs sharing the same underlying rewards and transitions, the agent’s policy can vary between domains. For example, in policy-based algorithms (Williams, 1992), if there are several optimal policies then the agent may adopt different policies for different . Furthermore, for value-based algorithms such as DQN (Mnih et al., 2015), two scenarios can lead to there being different policies for different

. First, the (unique) optimal Q-function may correspond to several possible policies. Second, imperfect function approximation can lead to different value estimates for different randomizations and thus to different policies. To compare the ways in which policies can differ between randomized domains, we introduce the notion of Lipschitz continuity of a policy over a set of randomizations.

Definition 2

We assume the state space is equipped with a distance metric. A policy is Lipschitz continuous over a set of randomizations if for all randomizations and in ,

is finite. Here, is the total variation distance between distributions (given by when the action space is discrete).

The following inequality shows that this Lipschitz constant is crucial in quantifying the robustness of RL agents over a randomization space. The smaller the Lipschitz constant, the less a policy is affected by different randomization parameters. Informally, if a policy is Lipschitz continuous over randomized MDPs, then in the visual domain randomization context this implies for example that small changes in the background color in an environment will have a small impact on the policy.

Proposition 1

We consider an MDP and a set of randomizations of this MDP. Let be a K-Lipschitz policy over . Suppose the rewards are bounded by such that . Then for all and in , the following inequalities hold :


Where is the expected cumulative return of policy on MDP , for , and .

Proof. See appendix.

These inequalities shows that the smaller the Lipschitz constant, the smaller the maximum variations of the policy over the randomization space can be. In the following, we present a regularization technique that produces low-variance policies over the randomization space by minimizing the Lipschitz constant of the policy.

4 Proposed regularization

We propose a simple regularization method to produce an agent with policies that vary little over randomized environments, despite being trained on only one environment. We start by choosing one variation of the environment on which to train an agent with a policy parameterized by , and during training we minimize the loss


where is a regularization parameter, is the loss corresponding to the chosen reinforcement learning algorithm, the first expectation is taken over the distribution of states visited by the current policy which we assume to be fixed when optimizing this loss, and is a feature-extractor

used by the agent’s policy. In our experiments, we choose the output of the last hidden layer of the value or policy network as our feature extractor; we note that other choices could also be made. Minimizing the second term in this loss function minimizes the Lipschitz constant as defined above over the states visited by the agent, and causes the agent to learn representations of states that ignore variations caused by the randomization.

Our method can be applied to many RL algorithms, since it involves simply adding an additional term to the learning loss. In the following, we experimentally demonstrate applications to both value-based and policy-based reinforcement learning algorithms.

5 Experiments

Implementation details for the following experiments can be found in the appendix, and our code is available at

5.1 Illustration on a gridworld

We first conduct experiments on a simple gridworld to illustrate the theory described above.

   Agent Same path probability Randomized 86% Regularized 100% (ours)

Figure 2: Left: a simple gridworld, in which the agent must make its way to the goal while avoiding the fire. Center: empirical differences between regularized agents’ policies on two randomizations of the gridworld compared to our theoretical bound (the dashed line), shown for 20 training seeds per value of . Right: probability that different agents will choose the same path for different randomizations of this domain, obtained from 100 training seeds. Our regularization method leads to more consistent behavior.

The environment we use is the gridworld shown in figure 2, in which two optimal policies exist. The agent starts in the bottom left of the grid and must reach the goal while avoiding the fire. The agent can move either up or right, and in addition to the rewards shown in figure 2 receives -1 reward for invalid actions that would case it to leave the grid. We set a time limit of 10 steps and . We introduce randomization into this environment by describing the state of the agent as a tuple , where is the agent’s position and is a randomization parameter with no impact on the underlying MDP. For this toy problem, we consider only two possible values for : and . The agents we consider use the REINFORCE algorithm (Sutton et al., 2000)

with a baseline (see appendix), and a multi-layer perceptron as the policy network.

First, we observe that even in a simple environment such as this one, a randomized agent regularly learns different paths for different randomizations (figure 2). An agent trained only on and regularized with our technique, however, consistently learns the same path regardless of . Although both agents easily solve the problem, the variance of the randomized agent’s policy can be problematic in more complex environments in which identifying similarities between domains and ignoring irrelevant differences is important.

Next, we compare the empirical difference between the policies learned by regularized agents on the two domains to the smallest of our theoretical bounds in equation 1, which in this simple environment can be directly calculated. Our results for different values of are shown in figure 2. We observe that increasing does lead to decreases in both the empirical difference in returns and in the theoretical bound.

5.2 Visual Cartpole with DQN

We compare standard domain randomization to our regularization method on a more challenging visual environment, in terms of 1) training stability, 2) returns and variance of the learned policies, and 3) state representations learned by the agents.

5.2.1 Experimental Setting

We consider the visual Cartpole environment shown in figure 1, where the states consist of raw pixels of the images. The agent must keep a pole upright as long as possible on a cart that can move left or right. The episode terminates either after 200 time steps, if the cart leaves the track, or if the pole falls over. The randomization consists of changing the color of the background. Each randomized domain corresponds to a color , where . Our implementation of this environment is based on the OpenAI Gym (Brockman et al., 2016).

For training, we use the DQN algorithm with a CNN architecture similar to that used by Mnih et al. (2015). In principle, such a value-based algorithm should learn a unique value function independently of the randomization parameters we consider. This is in contrast to the policy-based algorithm used in our previous experiment, in which there is no unique optimal policy. However, as we will show function approximation errors cause different value functions to be learned for different background colors.

We compare the performance of three agents. The Normal agent is trained on only one domain (with a white background). The Randomized agent is trained on a chosen randomization space . The Regularized agent is trained on a white background using our regularization method with respect to randomization space

. The training of all three agents is done using the same hyperparameters, and over the same number of steps.

5.2.2 Performance during training

Figure 3: Training curves over randomization spaces (left) and

(right). Shaded areas indicate the 95% confidence interval of the mean, obtained over 10 training seeds.

We first compare the performance of our agents during training. We train all three agents over two randomization spaces (environments with different background colors), having the following sizes :

  • : of the unit cube.

  • : half the unit cube.

We obtain the training curves shown in figure 3. We find that the normal and regularized agents have similar training curves and are not affected by the size of the randomization space. However, the randomized agent learns more slowly on the small randomization space (left), and also achieves worse performance on the bigger randomization space (right). In high-dimensional problems, we would like to pick the randomization space to be as large as possible to increase the chances of transferring to the target domain. We find that standard domain randomization scales poorly with the size of the randomization space , whereas our regularization method is more robust to a larger randomization space.

5.2.3 Generalization and variance

Figure 4: Comparison of the average scores of different agents over different domains. The scores are calculated over a plane of the (r,g,b) cube in , where is fixed, averaged over 1000 steps. The training domain for both the regularized and normal agents is located at the top right. The regularized agent learns more stable policies than the randomized agent over these domains.

We compare the returns of the policies learned by the different agents in different domains within the randomization space. We select a plane within obtained by varying only the R and B channels but keeping G fixed. For each agent and a single training seed, we plot the scores obtained on this plane in figure 4. We see that despite having only been trained on one domain, the regularized agent achieves consistently high scores on the other domains. On the other hand, the randomized agent’s policy exhibits returns with high variance between domains, which indicates that different policies were learned for different domains. We also observe that the regularized agent has much smaller variance and generally higher scores that the randomized agent.

5.2.4 Representations learned by the agents

Figure 5:

Left: Visualization of the representations learned by the agents for pink and green background colors and for the same set of states. We observe that the randomized agent learns different representations for the two domains. Right: Standard deviation of estimated value functions over randomized domains, averaged over 10 training seeds.

To understand what causes this difference in behavior between the two agents, we study the representations learned by the agents by analyzing the activations of the final hidden layer. We consider the agents trained on , and a sample of states obtained by performing a greedy rollout on a white background (which is included in ). For each of these states, we calculate the representation corresponding to that state for another background color in . We then visualize these representations using t-SNE plots, where each color corresponds to a domain. A representative example of such a plot is shown in figure 5. We see that the regularized agent learns a similar representation for both backgrounds, whereas the randomized agent clearly separates them. This result indicates that the regularized agent learns to ignore the background color, whereas the randomized agent is likely to learn a different policy for a different background color. Further experiments comparing the representations of both agents can be found in the appendix.

To further study the effect of our regularization method on the representations learned by the agents, we compare the variations in the estimated value function for both agents over . Figure 5 shows the standard deviation of the estimated value function over different background colors, averaged over 10 training seeds and a sample of states obtained by the same procedure as described above. We observe that our regularization technique successfully reduces the variance of the value function over the randomization domain. This can be seen as a consequence of the fact that the representations learned by the agent vary less between domains, and also explains the lower variance of agents trained with our method.

5.3 Car Racing with PPO

Figure 6: Left: frames from the original and randomized CarRacing environment. Right: training curves of our agents, averaged over 5 seeds. Shaded areas indicate the 95% confidence interval of the mean.

To demonstrate the applicability of our regularization method to other domains and algorithms, we also perform experiments with the PPO algorithm (Schulman et al., 2017) on the CarRacing environment (Brockman et al., 2016), in which an agent must drive a car around a racetrack. An example state from this environment and a randomized version in which part of the background changes color are shown in figure 6. We train 4 agents on this domain: a normal agent on the original background, a randomized agent, and two regularized agents for two different values of . Randomization in this experiment occurs over the entire RGB cube.

Agent Return (original) Return (all colors)
Table 1: Average returns on the original environment and its randomizations over all colors, with 95% confidence intervals calculated from 5 training seeds.

Training curves are shown in figure 6. Training curves for both regularized agents are very similar, so only the curve for is shown here. We see that the randomized agent fails to learn a successful policy, whereas the other agents successfully learn. We also compare the generalization ability of the other agents to other background colors, with our results shown in table 1. These results confirm that our regularization leads to agents that are both successful in training and successfully generalize to a wide range of backgrounds. Moreover, a larger value of yields higher generalization scores.

6 Conclusion

In this paper we studied domain randomization in deep reinforcement learning. We formalized the problem, illustrated the inefficiencies of standard domain randomization, and proposed a method that leads to robust, low-variance policies that are domain invariant. We conducted several experiments in different environments of differing complexities using both on-policy and off-policy algorithms to support our claims.


  • M. Anthony and P. L. Bartlett (2009) Neural network learning: theoretical foundations. cambridge university press. Cited by: Appendix C.
  • R. Antonova, S. Cruciani, C. Smith, and D. Kragic (2017) Reinforcement learning for pivoting task. CoRR abs/1703.00472. External Links: Link, 1703.00472 Cited by: §1, §2.2.
  • P. Bachman, O. Alsharif, and D. Precup (2014) Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems, pp. 3365–3373. Cited by: §2.3.
  • P. L. Bartlett, D. J. Foster, and M. Telgarsky (2017) Spectrally-normalized margin bounds for neural networks. CoRR abs/1706.08498. External Links: Link, 1706.08498 Cited by: Appendix C.
  • P. L. Bartlett (1997) For valid generalization the size of the weights is more important than the size of the network. In Advances in neural information processing systems, pp. 134–140. Cited by: Appendix C.
  • K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke (2017) Using simulation and domain adaptation to improve efficiency of deep robotic grasping. CoRR abs/1709.07857. External Links: Link, 1709.07857 Cited by: §1, §2.3.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §5.2.1, §5.3.
  • Y. Chebotar, A. Handa, V. Makoviychuk, M. Macklin, J. Issac, N. D. Ratliff, and D. Fox (2018) Closing the sim-to-real loop: adapting simulation randomization with real world experience. CoRR abs/1810.05687. External Links: Link, 1810.05687 Cited by: §2.2.
  • K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman (2018) Quantifying generalization in reinforcement learning. arXiv preprint arXiv:1812.02341. Cited by: §1.
  • B. Coors, A. Condurache, A. Mertins, and A. Geiger (2018) Learning transformation invariant representations with weak supervision.. In VISIGRAPP (5: VISAPP), pp. 64–72. Cited by: §2.3.
  • S. Daftry, J. A. Bagnell, and M. Hebert (2016) Learning transferable policies for monocular reactive MAV control. CoRR abs/1608.00627. External Links: Link, 1608.00627 Cited by: §2.3.
  • J. Farebrother, M. C. Machado, and M. Bowling (2018) Generalization and regularization in DQN. CoRR abs/1810.00123. External Links: Link, 1810.00123 Cited by: §1, §2.1.
  • S. Gamrian and Y. Goldberg (2018) Transfer learning for related reinforcement learning tasks via image-to-image translation. CoRR abs/1806.07377. External Links: Link, 1806.07377 Cited by: §1, §2.2.
  • A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine (2017) Learning invariant feature spaces to transfer skills with reinforcement learning. CoRR abs/1703.02949. External Links: Link, 1703.02949 Cited by: §2.3.
  • S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis (2018) Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. CoRR abs/1812.07252. External Links: Link, 1812.07252 Cited by: §1, §2.2.
  • K. Kang, S. Belkhale, G. Kahn, P. Abbeel, and S. Levine (2019) Generalization through simulation: integrating simulated and real data into deep reinforcement learning for vision-based autonomous flight. CoRR abs/1902.03701. External Links: Link, 1902.03701 Cited by: §2.2.
  • S. Levine, C. Finn, T. Darrell, and P. Abbeel (2016) End-to-end training of deep visuomotor policies.

    The Journal of Machine Learning Research

    17 (1), pp. 1334–1373.
    Cited by: §1.
  • B. Mehta, M. Diaz, F. Golemo, C. J. Pal, and L. Paull (2019) Active domain randomization. CoRR abs/1904.04762. External Links: Link, 1904.04762 Cited by: §1, §2.2, §3.1.
  • T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §2.3.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §B.1, §1, §3.1, §5.2.1.
  • I. Mordatch, K. Lowrey, and E. Todorov (2015) Ensemble-CIO: full-body dynamic motion planning that transfers to physical humanoids. 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5307–5314. Cited by: §1, §2.2.
  • B. Neyshabur, R. Tomioka, and N. Srebro (2015) Norm-based capacity control in neural networks. In Conference on Learning Theory, pp. 1376–1401. Cited by: Appendix C.
  • A. M. Oberman and J. Calder (2018) Lipschitz regularized deep neural networks converge and generalize. CoRR abs/1808.09540. External Links: Link, 1808.09540 Cited by: Appendix C.
  • OpenAI (2018) Learning dexterous in-hand manipulation. CoRR abs/1808.00177. External Links: Link, 1808.00177 Cited by: §1, §2.2.
  • C. Packer, K. Gao, J. Kos, P. Krähenbühl, V. Koltun, and D. Song (2018) Assessing generalization in deep reinforcement learning. CoRR abs/1810.12282. External Links: Link, 1810.12282 Cited by: §1, §2.1.
  • X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2017) Sim-to-real transfer of robotic control with dynamics randomization. CoRR abs/1710.06537. External Links: Link, 1710.06537 Cited by: §1, §2.2.
  • A. Rajeswaran, S. Ghotra, S. Levine, and B. Ravindran (2016) EPOpt: learning robust neural network policies using model ensembles. CoRR abs/1610.01283. External Links: Link, 1610.01283 Cited by: §1, §2.2.
  • F. Sadeghi and S. Levine (2016) (CAD)RL: real single-image flight without a single real image. CoRR abs/1611.04201. External Links: Link, 1611.04201 Cited by: §1, §2.2.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In Advances in Neural Information Processing Systems, pp. 1163–1171. Cited by: §2.3.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §5.3.
  • D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §5.1.
  • R. S. Sutton (1996) Generalization in reinforcement learning: successful examples using sparse coarse coding. In Advances in neural information processing systems, pp. 1038–1044. Cited by: §1.
  • J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel (2017) Domain randomization for transferring deep neural networks from simulation to the real world. CoRR abs/1703.06907. External Links: Link, 1703.06907 Cited by: §1, §2.2.
  • E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine, K. Saenko, and T. Darrell (2015) Towards adapting deep visuomotor representations from simulated to real environments. CoRR abs/1511.07111. External Links: Link, 1511.07111 Cited by: §2.3.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.1.
  • S. Witty, J. K. Lee, E. Tosch, A. Atrey, M. L. Littman, and D. Jensen (2018) Measuring and characterizing generalization in deep reinforcement learning. CoRR abs/1812.02868. External Links: Link, 1812.02868 Cited by: §1.
  • M. Wulfmeier, I. Posner, and P. Abbeel (2017) Mutual alignment transfer learning. CoRR abs/1707.07907. External Links: Link, 1707.07907 Cited by: §2.3.
  • Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation. arXiv preprint arXiv:1904.12848. Cited by: §2.3.
  • S. Zakharov, W. Kehl, and S. Ilic (2019) DeceptionNet: network-driven domain randomization. CoRR abs/1904.02750. External Links: Link, 1904.02750 Cited by: §2.2.
  • A. Zhang, N. Ballas, and J. Pineau (2018a) A dissection of overfitting and generalization in continuous reinforcement learning. CoRR abs/1806.07937. External Links: Link, 1806.07937 Cited by: §1.
  • C. Zhang, O. Vinyals, R. Munos, and S. Bengio (2018b) A study on overfitting in deep reinforcement learning. CoRR abs/1804.06893. External Links: Link, 1804.06893 Cited by: §1, §2.1.

Appendix A Proof of Proposition 1

The proof presented in the following applies to MDPs with a discrete action space. However, it can straightforwardly be generalized to continuous action spaces by replacing sums over actions with integrals over actions.

The proof uses the following lemma :

Lemma 1

For two distributions and

, we can bound the total variation distance of the joint distribution :

Proof of the Lemma.

We have that :

Proof of the proposition.

Let be the probability of being in state at time , and executing action , for . Since both MDPs have the same reward function, we have by definition that , so we can write :


But and , Thus (Lemma 1) :

We still have to bound . For we have that :

Summing over we have that

But by marginalizing over actions : , and using the fact that , we have that

And using we have that :

Thus, by induction, and assuming :

Plugging this into inequality 3, we get

We also note that the total variation distance takes values between 0 and 1, so we have

Plugging this into inequality 3 leads to our first bound,

Our second, looser bound can now be achieved as follows,

Appendix B Experimental details

All code used for our experiments is available at

b.1 State preprocessing

For our implementation of the visual cartpole environment, each image consists of pixels with RGB channels. To include momentum information in our state description, we stack frames, so the shape of the state that is sent to the agent is .

In CarRacing, each state consists of pixels with RGB channels. We introduce frame skipping as is often done for Atari games (Mnih et al. (2015)), with a skip parameter of 5. This restricts the length of an episode to 200 action choices. We then stack 2 frames to include momentum information into the state description. The shape of the state that is sent to the agent is thus . We find that although these modifications prevent the agents from obtaining scores above 900 which are achievable in the original environment, training is considerably faster.

b.2 Visual Cartpole

b.2.1 Extrapolation

Figure 7: Generalization scores, with 95% confidence intervals obtained over 10 training seeds. The normal agent is trained on white , corresponding to a distance to train. The other domains correspond to , for

Given that regularized agents are stronger in interpolation over their training domain, it is natural to wonder what the performance of these agents is in extrapolation to colors not within the range of colors sampled within training. For this purpose, we consider randomized and regularized agents trained on

, and test them on the set . None of these agents was ever exposed to during training.

Our results are shown in figure 7. We find that although the regularized agent consistently outperforms the randomized agent in interpolation, both agents fail to extrapolate well outside the train domain. Since we only regularize with respect to the training space, there is indeed no guarantee that our regularization method can produce an agent that extrapolates well. Since the objective of domain randomization often is to achieve good transfer to an a priori unknown target domain, this result suggests that it is important that the target domain lie within the randomization space, and that the randomization space be made as large as possible during training.

b.2.2 Further study of the representations learned by different agents

We perform further experiments to demonstrate that the randomized agent learns different representations for different domains, whereas the regularized agent learns similar representations. We consider agents trained on , the union of darker, and lighter backgrounds. We then rollout each agent on a single episode of the domain with a white background and, for each state in this episode, calculate the representations learned by the agent for other background colors. We visualize these representations using the t-SNE plot shown in figure 8. We observe that the randomized agent clearly separates the two training domains, whereas the regularized agent learns similar representations for both domains.

Figure 8: t-SNE of the representations over of the Regularized (Left) and Randomized (Right) agents. Each color corresponds to a domain. The randomized agent learns very different representations for and .

We are interested in how robust our agents are to unseen values . To visualize this, we rollout both agents in domains having different background colors : , i.e ranging from black to white, and collect their features over an episode. We then plot the t-SNEs of these features for both agents in figure 9, where each color corresponds to a domain.

Figure 9: t-SNE of the features of the Regularized (Left) and Randomized (Right) agents. Each color corresponds to a domain.

We observe once again that the regularized agent has much lower variance over unseen domains, whereas the randomized agent learns different features for different domains. This shows that the regularized agent is more robust to domain shifts than the randomized agent.

Appendix C Further related work: Lipschitz continuity and generalization in Deep Learning

Lipschitz-sensitive bounds on the generalization abilities of neural networks have a long history (Bartlett (1997); Anthony and Bartlett (2009); Neyshabur et al. (2015)). Recently, Bartlett et al. (2017) proved a generalization bound in terms of the norms of each layer, which is proportional to the Lipschitz constant of the network. Oberman and Calder (2018) studied generalization through a general empirical risk minimization procedure with Lipschitz regularization, and provides generalization bounds. Similarly, we show that the Lipschitz constant of the network with respect to the randomization parameters plays an important role in achieving zero-shot transfer to a target domain.

Appendix D Algorithms

  Initialize replay memory to capacity
  Initialize action-value function with random weights
  Initialize the randomization space , and a reference MDP to train on.
  Initialize a regularization parameter
  Define a feature extractor
  for episode =  do
     Sample a randomizer function uniformly from .
     for  do
        With probability select a random action
        otherwise select
        Execute action in and observe reward and both the reference and randomized states :
        Store transition in
        Sample random minibatch of transitions from
        Perform a gradient descent step on
     end for
  end for
Algorithm 1 Deep Q-learning with our regularization method
  Initialize policy network function with random weights , baseline
  Initialize the randomization space , and a reference MDP to train on.
  Initialize a regularization parameter
  Define a feature extractor
  for episode =  do
     Sample a randomizer function uniformly from .
     Collect a set of trajectories by executing on .
     for  in each trajectory  do
        Compute the return
        Estimate the advantage
     end for
     Perform a gradient descent step on
  end for
Algorithm 2 Policy Gradient with a baseline using our regularization method

Note that algorithm 2 can be straightforwardly adapted to several state of the art policy gradient algorithms such as PPO.

Appendix E A dynamics randomization experiment

Figure 10: Training curves of the agents on the Cartpole domain, averaged over seeds.

We also perform an experiment to demonstrate that learning domain-invariant features can be helpful not only for visual randomization, but also for some instances of dynamics randomization. We consider once again the Cartpole domain, where this time we randomize some physical dynamics of the environment. Specifically, we choose to randomize the pole length , and the gravity . The state for our reinforcement learning agent is the

-dimensional vector

, where is the position of the cart, its velocity, the angle between the pole and the horizontal plane, and the angular velocity. We train all three agents (Normal, Regularized, Randomized) using only two randomized domains : and . We use the DQN algorithm with a Multi-Layer Perceptron architecture for this experiment.

We first compare the training speed of the agents. The training curves averaged over seeds are plotted in figure 10. We observe once again that the randomized agent is significantly slower than the regularized one, and is more unstable.

Figure 11: Generalization scores averaged over 5 training seeds and 4 test episodes per seed. Red dots correspond to training environments

Next, we examine the agents’ generalization ability. We test the agents on environments having values of pole length and gravity unseen during training. We plot their scores in figure 11. The randomized agent clearly specializes on the two different training domains, corresponding to the two clearly distinguishable regions where high scores are achieved, whereas the regularized agents achieves more consistent scores across domain. This result can be understood as follows. Although the different dynamics between the two domains lead to there being different sets of optimal policies, our regularization method forces the agent to only learn policies that do not depend on the specific values of the randomized dynamics parameters. These policies are therefore more likely to also work when those dynamics are different.

e.1 Representations learned by the agent

We analyze the representations learned by each agent in our dynamics randomization experiment. Once the agents are trained, we rollout their policies in both randomized environments with an -greedy strategy, where we use to reach a larger number of states of the MDP, over steps. We collect the representations (the activations of the last hidden layer) corresponding to the visited states. These features are -dimensional, so in order to visualize them, we use the t-SNE plots shown in figure 12. We emphasize that although this figure corresponds to a single training seed, the general aspect of these results is repeatable.

Figure 12: t-sne of the representations learned by the regularized and randomized agents on the two training environments.

The randomized agent learns completely different representations for the two randomized environments. This explains its high variance during the training, since it tries to learn a different strategy for each domain. On the other hand, our regularized agent has the same representation for both domains, which allows it to learn much faster, and to learn policies that are robust to changes in the environment’s dynamics.