Invariant Policy Optimization: Towards Stronger Generalization in Reinforcement Learning

06/01/2020 ∙ by Anoopkumar Sonar, et al. ∙ 10

A fundamental challenge in reinforcement learning is to learn policies that generalize beyond the operating domain experienced during training. In this paper, we approach this challenge through the following invariance principle: an agent must find a representation such that there exists an action-predictor built on top of this representation that is simultaneously optimal across all training domains. Intuitively, the resulting invariant policy enhances generalization by finding causes of successful actions. We propose a novel learning algorithm, Invariant Policy Optimization (IPO), that explicitly enforces this principle and learns an invariant policy during training. We compare our approach with standard policy gradient methods and demonstrate significant improvements in generalization performance on unseen domains for Linear Quadratic Regulator (LQR) problems and our own benchmark in the MiniGrid Gym environment.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the fundamental challenges with state-of-the-art reinforcement learning (RL) approaches is their limited ability to generalize beyond the specific domains they were trained on. The problem of generalization is particularly acute in complex robotics applications. Deploying an RL policy on a robot outside of the laboratory requires learning a policy that can generalize to a wide range of operating domains — especially when the application is safety-critical. For example, autonomous vehicles must contend with unfamiliar obstacles, lighting, and road conditions when deployed at scale; robotic manipulators deployed in homes must deal with unfamiliar objects and environment geometries; and robots operating in close proximity to humans must be able to handle new patterns of human motion.

As a simple example, consider the problem shown in Figure 1. A robot is placed in a grid-world and must learn to navigate to a goal located in a different room. In order to do this, it must learn to first navigate to a key, use this key to open the door, and then navigate to the goal. During the training phase, the robot is presented with environments containing red and green keys. A policy trained using standard RL techniques demonstrates strong performance when deployed in test environments with key colors seen during training. However, its performance significantly degrades when deployed in test environments with different key colors (see Section 5.2 for a thorough exploration of this problem).

Learning policies capable of such generalization remains challenging for a number of reasons. Primarily, RL algorithms have a tendency to memorize solutions to training environments, thereby achieving high training rewards with a brittle policy that will not generalize to novel environments. Moreover, learned policies often fail to ignore distractors in their sensor observations (e.g., the key colors) and are highly sensitive to changes in such irrelevant factors. The goal of this paper is to address these challenges and learn policies that achieve strong generalization across new operating domains given a limited set of training domains.

Statement of Contributions. We approach the problem of generalizing across domains (formalized in Section 2) with the following principle: a policy will generalize well if it exploits invariances resulting from causal relationships present across domains (e.g. key color does not cause rewards). To embody this principle, we leverage a close connection between causality and invariance (Section 3) in an approach we refer to as Invariant Policy Optimization (IPO). The key idea is to learn a representation that makes the optimal policy built on top of this representation invariant across training domains. Effectively, this approach attempts to learn and exploit the causes of successful actions. We demonstrate that the resulting algorithm demonstrates significantly stronger generalization compared to traditional on-policy methods in two different scenarios (Section 5): a linear-quadratic output feedback problem with distracting observations and an instantiation of the colored-key problem.

1.1 Related Work

Figure 1: A depiction of the Colored-Key problem described in Section 1 on a grid. The color of the keys in the environment corresponds to a different operating domain. The agent (red triangle) must learn to use the key to open the door and reach the goal (green square). The agent is trained on domains with red and green keys. At test time, the learned policy is deployed on a domain with differently-colored keys (e.g., grey keys). Our results in Section 5 demonstrate that our algorithm generalizes to this novel testing domain significantly better than one trained using standard techniques.

Quantifying generalization. The problem of finding policies that generalize beyond their training domain is becoming an increasingly popular topic as reinforcement learning continues to mature and a number of recent studies have attempted to quantify and understand the generalization challenge in RL. In [29], the authors quantify the effects of observational overfitting, where learned policies are sensitive to irrelevant observational features. Benchmark suites including Sonic [21] and Atari 2600 games [19] have also been proposed to quantify generalization. Recently, CoinRun [10] and the broader Procgen Benchmark [9] use procedural generalization of environments at controllable levels of difficulty to demonstrate that effective generalization can require an extremely large number of training environments. Another manifestation of the generalization gap is the sim2real problem in robotics: agents trained in simulation overfit to this domain and fail to operate in hardware [34, 24, 32].

Regularization and domain randomization.

The most common approach for improving the out-of-domain generalization of a learning algorithm is to add different forms of regularization. Popular ones borrowed from supervised learning include

regularization, dropout [30]

, and batch normalization

[14]; each of these has been shown to improve generalization on novel CoinRun levels [10]. While practical and easy to implement, these methods typically do not explicitly exploit any structure of the RL problem. Another approach is to constrain the agent’s policy to only depend on a set of learned task-relevant variables, which are found by introducing an information-theoretic regularizer [22]. This method has been shown to generalize to new domains that contain task-irrelevant features not present during training. However, the task-relevant variables are not guaranteed to exploit causal relationships in the environment, which is the focus of this paper. Data augmentation and domain randomization have also been shown to be particularly useful in crossing the sim2real barrier [24, 3]. These methods are complementary to the approach presented here and could potentially be used to generate a diverse set of training domains for our method.

Distributional robustness. The PAC-Bayes Control approach [17, 36] provides a way to make provable generalization guarantees under distributional shifts. This approach is particularly useful in safety-critical applications where it is important to quantify the impact of switching between a training domain and a test domain. Another approach that provides robustness guarantees is to train in a manner that allows adversarial perturbations to the underlying data distribution [28]. However, the challenge with both of these approaches is that they require an a priori bound on how much the test domain differs from the training domain (e.g., in terms of an -divergence). In contrast, the recently proposed risk-extrapolation method [16] promotes out-of-distribution generalization by encouraging robustness of hypotheses over affine combinations of training risks. This method is shown to improve performance of RL agents when their state space is augmented with noisy copies of true system states.

Causality and invariance. Recently, the task of learning causal predictors has drawn interest in the supervised learning setting. An approach formalized in [25] attempts to find features that are causally linked to a target variable by exploiting the invariance of causal relationships [23, 26]. This approach was expanded upon in the invariant risk minimization (IRM) approach [5]

, which formulates the problem in terms of finding a representation such that the optimal classifier built on top of this representation is invariant across domains. This results in classifiers that ignore spurious correlations that may exist in any single domain. The formulation leads to a challenging bilevel optimization problem and is tackled via a regularizer that approximates its solution. More recently,

[2] presented a game-theoretic reformulation of the IRM principle and proposed a new algorithm, known as IRM-Games, which offers better empirical results. Our approach adapts ideas from causality and invariance to RL settings by learning representations that invariantly predict actions. We provide more background on invariance, causality, and IRM in Section 3.

Causality in RL. Lastly, there are a number of recent methods that attempt to exploit causality in RL. For example, [11] observed that, in some instances, causal reasoning can emerge in agents trained via meta-learning. Other approaches explicitly attempt to learn causal graphs that describe the dynamics of the agent’s environment. Along these lines, [20] proposes a two-phase training process where interactions with the environment are first used to learn the causal graph of the environment, and then a policy that exploits this graph is trained. Finally, the IRM method has recently been applied to RL problems. In [37]

, the authors attempt to learn a causal Markov decision process (MDP) that is bisimilar to the full MDP present during training. This formulation requires learning a model for both the causal and full dynamics of the system, a mapping between the two, and a causal model of the rewards. Standard RL algorithms are then used in conjunction with these causal models to produce a final policy. This approach is distinct from the one in this paper, which does not seek to find complete dynamical models (causal or otherwise). Instead, we focus on

identifying the causes of successful actions, which is a simpler problem.

2 Problem Formulation

We are interested in the problem of zero-shot generalization to environments that can be significantly different from environments seen during training. We formalize this as follows. Let denote the joint state of the agent and environment. In the colored-keys example discussed in Section 1, corresponds to the location of the agent at time-step , while corresponds to locations of the obstacles, key, door, and goal. In our formulation, different environments correspond to different (initial) states of the environment (e.g., different configurations of obstacles, key, door, and goal). We denote the agent’s actions, observations, and rewards by , , and respectively.

During training, we assume access to multiple sets of environments from different domains . We assume that the action space and observation space are shared across all domains (state spaces need not be shared). Each domain corresponds to a partially observable Markov decision process (POMDP) [33] with dynamics mapping , observation mapping , and reward mapping . In the colored-keys example, domains differ (only) in terms of the observation mapping; in particular, each domain assigns a particular color to keys. Each domain also defines a distribution over environments.

Our goal is to learn a policy that generalizes to domains beyond the training domains (e.g., generalizing to domains with key colors not seen during training). More specifically, let denote the expected cumulative reward (over either a finite or infinite horizon) when policy is executed in environment . We would then like to maximize the worst-case rewards over all domains:


Without further assumptions on the relationship between and , finding a policy that performs well on domains may be impossible. We discuss this further in Section 4.

3 Background: Invariance and Causality

In this section, we provide a brief exposition of causality and its relationship to invariance. We refer the reader to [26, 5, 23, 2] for a more thorough introduction.

Definition 1 (Structural Causal Model [26]).

A structural causal model (SCM)

governing the random vector

is a collection of assignments:


where are the parents of , are independent noise variables, and each is any mapping from these variables to . The graph of an SCM is obtained by associating one vertex for each and edges from each parent in to . We assume acyclic causal graphs. We refer to the elements of as the direct causes of .

Figure 2: (a) Causal graph corresponding to the RL setting we consider. Here, is the state, is the observation, and is the action. The reward depends only on and . (b) We seek a representation such that there exists that is simultaneously optimal across domains.

Figure 2 shows the causal graph for the RL problem formulation in Section 2. Here, the reward depends on the action and a set of “reward-relevant” variables . Thus, . In our running colored-keys example, is purely a function of the agent’s state and the goal location.

Definition 2 (Intervention [26]).

Consider an SCM . An intervention replaces one or more of the structural assignments to obtain a new SCM with assignments:


We say that the variables whose structural assignments we have replaced have been intervened on.

Modularity principle [6, 26]. This principle establishes a close relationship between causality and invariance: a set of variables are the direct causes of

if and only if the conditional probability

remains invariant for all interventions where has not been intervened on. This is also related to the notion of “autonomy” and the principle of independent mechanisms [26, Ch. 2.1]. As an example, consider the reward to be the variable of interest in Figure 2. Then, are the direct causes of if and only if for all interventions where has not been intervened on, remains invariant. Thus, in the context of the colored-keys example, does not contain any color-related information.

Invariant Risk Minimization (IRM). This approach [5] exploits the modularity principle in the context of supervised learning. One assumes datasets from multiple training domains111We note that [5] uses the term “environment” instead of “domain”. However, we use “domain” since “environment” has a different meaning in RL contexts. corresponding to different interventions on the data-generating process that do not intervene on the target variable . Here and . The goal is to learn a data representation that elicits an invariant predictor across training domains, i.e., a representation such that there exists a classifier that is simultaneously optimal for all training domains . Intuitively, the representation should capture the direct causes of and thus eliminate any features in that spuriously correlate with .

The optimization problem associated with IRM is a challenging bi-leveled one. The authors of [5] propose IRM-v1, where one fixes a “dummy” linear classifier and learns a representation that is approximately locally optimal in all training domains:


where is the loss incurred by on domain .

IRM Games. Inspired by IRM, the authors of [2] demonstrate that the set of invariant predictors corresponds to the set of pure Nash equilibria of a game played among players. Each player (corresponding to a training domain ) can choose its own classifier and is trying to maximize a utility function: , where

. While finding Nash equilibria for continuous games is challenging in general, the game theory literature contains several heuristic schemes. In

[2], the authors propose a strategy based on best response dynamics [7]

, where players take turns maximizing their utility functions. The resulting algorithm achieves similar or better empirical performance as compared to IRM-v1, with significantly reduced variance.

4 Invariant Policy Optimization

We now describe our novel reinforcement learning algorithm, which we refer to as invariant policy optimization (IPO). The key insight behind this algorithm is to implement the following invariance principle: learn a representation that maps observations to in a manner that supports invariant action prediction (see Figure 2). More precisely, the goal is to learn a representation such that there exists an “action-predictor” built on top of this representation that is simultaneously optimal across all training domains222For the ease of exposition, we discuss the case of memoryless policies. However, it is straightforward to handle policies with memory (e.g., by augmenting observations with a memory state).. We will refer to the resulting policy as an invariant policy. This invariance principle can be formally embodied as the following optimization problem:


Here, is the reward associated with domain , as defined in Section 2. Intuitively, given a set of training domains, IPO attempts to learn a representation that corresponds to the “causes of successful actions”. This interpretation elucidates the role of the different training domains; these must correspond to different interventions on the causal graph shown in Figure 2 that leave optimal actions unaffected. Assuming a diverse set of training domains, one learns a representation that eliminates features that spuriously correlate with good actions (i.e., actions that achieve high rewards). For example, in the colored-keys problem, such a representation corresponds to one that eliminates color from observations. By eliminating such features, an invariant policy generalizes well to novel domains corresponding to unseen interventions on the spurious/irrelevant features.

Our algorithmic approach for IPO is inspired by the game-theoretic formulation of [2] (see Section 3). We endow each domain with its own policy and define an overall ensemble policy . The optimization problem behind IPO then becomes:


Next, we relate the optimization problem (6) to a game played between players. Each player corresponds to a domain and chooses a policy to maximize its own utility function . Since Problem (6) is identical to the one in [2] for finding invariant representations (with policies playing the role of classifiers), the results from [2] carry over to our setting. In particular, under mild technical assumptions on the policies, the set of pure Nash equilibria of the game correspond to the set of invariant policies. We refer the reader to [2]

for details on the technical assumptions, but note that these are satisfied by a wide range of function classes (e.g., ReLu networks with arbitrary depth, linear functions, and functions in


While finding Nash equilibria for continuous games such as the one above is difficult in general, the game theory literature has developed several approximate approaches that demonstrate good performance in practice. Here, we adapt the strategy based on best response dynamics [7] proposed in [2] to our setting. The resulting IPO training procedure is presented in Algorithm 1. The for-loop in lines 8–11 implement the best-response dynamics; the players (corresponding to the different domains) take turns choosing in order to optimize their own objective . We choose to implement the updates using proximal policy optimization (PPO) [27]. However, this choice is not fundamental and one may implement the updates using other policy gradient methods. Line 5 of the algorithm periodically updates the representation . However, as demonstrated in [2], simply choosing = I is an effective approach and can (under certain conditions) recover invariant predictors. Finally, we note that Algorithm 1 can also accommodate actor-critic versions of PPO (or other policy optimization methods). In this version, each domain has both an actor and a critic . In the policy-update steps, one updates both the actor and the critic using PPO.

1:  for iter = 1, 2, … do
2:     if Fixed- then
3:          I
4:     else
5:          Update via an iteration of proximal policy optimization
6:     end if
7:     for  do
8:         for  do
11:         end for
12:     end for
13:  end for
Algorithm 1 Invariant Policy Optimization (IPO)

5 Examples

5.1 Linear Quadratic Regulator with Distractors

We first apply our approach to the linear quadratic regulator (LQR) problem [4] modified to include high-dimensional “distractor” observations. There has been a growing interest in LQR as a simplified surrogate for deep RL problems [12, 35, 13, 1]. Here we consider the output-feedback control [31] problem proposed in [29] as a benchmark for assessing generalization with respect to changes in the observation model. The dynamics of the system are described by , where , and are fixed matrices. The agent receives a high-dimensional sensor observation , where and are semi-orthogonal matrices. This ensures that the portion of the observation corresponding to contains full information about the state, while is a high-dimensional “distractor”. The goal is to choose policies of the form in order to minimize the infinite-horizon LQR cost .

In this setting, a domain corresponds to a particular choice of ; all other system parameters () are shared across domains and unknown to the agent. During training time, one learns a policy using domains. At test time, the learned policy is assessed on a new domain. In the case where there is a single domain (used for both training and test) and , one can find the globally optimal policy via gradient descent (even though the corresponding optimization problem is non-convex) [13]. However, as demonstrated in [29], simple policy gradient using the combined costs of multiple training domains finds a policy that overfits to the training domains in the more general setting considered here. Intuitively, this is because the learned policy fails to ignore the distractors.

For our numerical experiments, we choose and . The matrices and are random orthogonal matrices, is , and the are random semi-orthogonal matrices (different for each domain). For IPO, we employ Algorithm 1 with the Fixed- option. We choose a policy that averages policies

corresponding to the training domains. Instead of PPO, we simply use gradient descent to perform policy updates. Optimization hyperparameters are provided in Appendix


We compare our approach with two baselines: (i) gradient descent on using the combined cost of training domains, and (ii) gradient descent using an overparameterized class of policies with two layers (i.e., ) and hidden dimension of . Interestingly, [29] found that this form of overparameterization induces an implicit regularization towards “simpler” policies (i.e., ones that are less “dependent” on the distractors). Table 1 compares the generalization performance of the learned policies to new domains as we vary the number of training domains. Here, the distractors have a dimensionality of . Consistent with [29], we find that overparameterization forms a strong baseline for this problem. However, IPO significantly outperforms both baselines. As expected, performance improves with increasing number of training domains and tends towards the performance achieved by an “oracle” policy that has access to the full state on the test domain. Table 2 assesses the impact of changing the dimensionality of the distractors. Here, we fix the number of training domains to two. Again, we observe that IPO demonstrates significantly improved performance (i.e., lower costs).

Number of training domains 2 3 4 5 10
Gradient descent 97.75.4 90.09.2 82.64.3 78.65.4 68.83.9
Overparameterization 86.23.0 75.34.4 69.41.8 64.51.9 51.41.0
IPO (ours) 78.83.5 64.82.3 57.71.0 52.31.8 43.21.1
LQR oracle 32.1 32.1 32.1 32.1 32.1
Table 1: LQR with distractors: comparison of IPO with two baselines (gradient descent and overparameterization) with varying number of training domains and distractor dimension . IPO demonstrate stronger generalization (i.e., lower costs) compared to the two baselines. The reported mean and std. dev. are across 10 different seeds.
Distractor dimension 100 500 1000 1500 2000
Gradient descent 65.93.9 88.010.6 97.75.4 106.87.3 115.02.7
Overparameterization 50.91.4 72.52.4 86.23.0 97.43.9 105.13.8
IPO (ours) 56.52.2 68.82.7 78.83.5 87.23.0 93.03.2
LQR oracle 32.1 32.1 32.1 32.1 32.1
Table 2: LQR with distractors: comparison of IPO with two baselines (gradient descent and overparameterization) with varying dimensionality of distractors and . IPO demonstrates stronger generalization (i.e., lower costs) compared to the two baselines. The reported mean and std. dev. are across 10 different seeds.

5.2 Colored-Key Domains

We now consider the colored-keys problem introduced in Section 1. In this example, a robot is placed in a grid-world that contains a goal (located in a room), a door, and a key (see Figure 1). The robot is presented with a reward if it reaches the goal. Using this sparse reward signal, it must learn to first navigate to the key, use this to open the door, and then navigate to the goal. In this setting, an environment corresponds to a particular configuration of the key, door, goal, and obstacles. Different domains correspond to different key colors.

We implement our approach on grid-worlds using MiniGrid [8]. Observations in MiniGrid correspond to values; the three channels encode the object type (e.g, door), object color, and object state (e.g., open/closed) for a neighborhood around the robot. The robot receives a sparse reward of , where is the time taken to reach the goal and is the time-limit for completing the task. During training, the robot has access to environments from two domains corresponding to red and green keys. We use 48 training environments split evenly between these domains. At test-time, the robot is placed in environments with grey keys. This color choice is motivated by the fact that in MiniGrid, colors are encoded using integers (e.g., red: 0, green: 1), and grey corresponds to the color that is “furthest away” in terms of this encoding (grey: 5). For any given environment, the color of the key and the color of door are the same. This ensures that problem is always feasible, i.e. the robot will always be able to reach the goal if it learns the optimal policy. We implement IPO with the fixed- option and an actor-critic architecture; details and hyperparameters are provided in Appendix A.3. Table 3 reports the average rewards on 50 test environments from the training and test domains. We compare our approach to PPO [27] trained to maximize rewards combined across training environments. As the table illustrates, IPO achieves better generalization to the new domain and is also more consistent across training seeds.

Key color Red (training) Green (training) Grey (testing)
PPO 0.940.004 0.940.005 0.800.12
IPO (ours) 0.940.003 0.940.003 0.850.03
Table 3: Colored-key domains: comparison of the average reward on 50 test environments drawn from different domains. The reported mean and std. dev. are across 10 different seeds.

6 Discussion and Conclusions

We have considered the problem of learning policies with strong generalization beyond training domains. The key idea behind our Invariant Policy Optimization (IPO) approach is to learn representations that support invariant action prediction across different domains. We implemented the proposed techniques on: (i) linear quadratic regulator (LQR) problems with distractor observations, and (ii) an example where an agent must learn to navigate to a goal by opening a door using different colored keys in its environment. We compared our approach with standard policy gradient methods (e.g., PPO) and demonstrated significant improvements in generalization performance on unseen domains.

Future work. On the theoretical front, an important direction for future work is to provide rigorous guarantees on generalization to novel domains. One potential avenue is to combine the algorithmic techniques presented here with recent results on PAC-Bayes generalization theory applied to control and RL settings [17, 18]. On the algorithmic front, an interesting direction is to use domain randomization techniques to automatically generate new training domains that can be used to improve invariant policy learning (e.g., automatically generating domains with different colored keys in the colored-keys example). Finally, a particularly promising future direction is to explore the application of IPO to robotics problems involving sim2real transfer, where one thinks of simulation and reality as different domains to learn an invariant policy across them.

The authors are grateful to Kartik Ahuja for helpful clarifications on the training procedure for IRM-Games. The authors would also like to thank Richard Song and Behnam Neyshabur for providing access to their code from [29] for the LQR example in Section 5.1.

This work is partially supported by the Office of Naval Research [Award Number: N00014-18-1-2873], the Google Faculty Research Award, the Amazon Research Award, and the National Science Foundation [IIS-1755038].


  • [1] N. Agarwal, B. Bullins, E. Hazan, S. M. Kakade, and K. Singh (2019) Online control with adversarial disturbances. arXiv preprint arXiv:1902.08721. Cited by: §5.1.
  • [2] K. Ahuja, K. Shanmugam, K. Varshney, and A. Dhurandhar (2020) Invariant risk minimization games. arXiv preprint arXiv:2002.04692. Cited by: §1.1, §3, §3, §4, §4.
  • [3] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, et al. (2019) Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113. Cited by: §1.1.
  • [4] B. Anderson and J. Moore (2007) Optimal control: linear quadratic methods. Courier Corporation. Cited by: §5.1.
  • [5] M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §1.1, §3, §3, §3, footnote 1.
  • [6] E. Bareinboim, C. Brito, and J. Pearl (2012)

    Local characterizations of causal bayesian networks

    In Graph Structures for Knowledge Representation and Reasoning, pp. 1–17. Cited by: §3.
  • [7] E. Barron, R. Goebel, and R. Jensen (2010) Best response dynamics for continuous games. Proceedings of the American Mathematical Society 138 (3), pp. 1069–1083. Cited by: §3, §4.
  • [8] M. Chevalier-Boisvert, L. Willems, and S. Pal (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: Cited by: §A.3, §5.2.
  • [9] K. Cobbe, C. Hesse, J. Hilton, and J. Schulman (2019) Leveraging procedural generation to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588. Cited by: §1.1.
  • [10] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman (2019) Quantifying generalization in reinforcement learning. In

    Proceedings of the International Conference on Machine Learning

    pp. 1282–1289. Cited by: §1.1, §1.1.
  • [11] I. Dasgupta, J. Wang, S. Chiappa, J. Mitrovic, P. Ortega, D. Raposo, E. Hughes, P. Battaglia, M. Botvinick, and Z. Kurth-Nelson (2019) Causal reasoning from meta-reinforcement learning. arXiv preprint arXiv:1901.08162. Cited by: §1.1.
  • [12] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu (2019) On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pp. 1–47. Cited by: §5.1.
  • [13] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi (2018) Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039. Cited by: §5.1, §5.1.
  • [14] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1.1.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2.
  • [16] D. Krueger, E. Caballero, J. Jacobsen, A. Zhang, J. Binas, R. L. Priol, and A. Courville (2020) Out-of-distribution generalization via risk extrapolation (rex). arXiv preprint arXiv:2003.00688. Cited by: §1.1.
  • [17] A. Majumdar, A. Farid, and A. Sonar (2019) PAC-Bayes Control: learning policies that provably generalize to novel environments. arXiv preprint arXiv:1806.04225. Cited by: §1.1, §6.
  • [18] A. Majumdar and M. Goldstein (2018) PAC-Bayes Control: synthesizing controllers that provably generalize to novel environments. In Proceedings of the Conference on Robot Learning (CoRL), Cited by: §6.
  • [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.1.
  • [20] S. Nair, Y. Zhu, S. Savarese, and L. Fei-Fei (2019) Causal induction from visual observations for goal directed tasks. arXiv preprint arXiv:1910.01751. Cited by: §1.1.
  • [21] A. Nichol, V. Pfau, C. Hesse, O. Klimov, and J. Schulman (2018) Gotta learn fast: a new benchmark for generalization in reinforcement learning. arXiv preprint arXiv:1804.03720. Cited by: §1.1.
  • [22] V. Pacelli and A. Majumdar (2020) Learning task-driven control policies via information bottlenecks. arXiv preprint arXiv:2002.01428. Cited by: §1.1.
  • [23] J. Pearl (2009) Causality. Cambridge university press. Cited by: §1.1, §3.
  • [24] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel (2018) Sim-to-real transfer of robotic control with dynamics randomization. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §1.1, §1.1.
  • [25] J. Peters, P. Bühlmann, and N. Meinshausen (2016)

    Causal inference by using invariant prediction: identification and confidence intervals

    Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), pp. 947–1012. Cited by: §1.1.
  • [26] J. Peters, D. Janzing, and B. Schölkopf (2017) Elements of causal inference: foundations and learning algorithms. MIT press. Cited by: §1.1, §3, §3, Definition 1, Definition 2.
  • [27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4, §5.2.
  • [28] A. Sinha, H. Namkoong, and J. Duchi (2017) Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571. Cited by: §1.1.
  • [29] X. Song, Y. Jiang, Y. Du, and B. Neyshabur (2019) Observational overfitting in reinforcement learning. arXiv preprint arXiv:1912.02975. Cited by: §1.1, §5.1, §5.1, §5.1, §6.
  • [30] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014)

    Dropout: a simple way to prevent neural networks from overfitting

    The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.1.
  • [31] V. L. Syrmos, C. T. Abdallah, P. Dorato, and K. Grigoriadis (1997) Static output feedback—a survey. Automatica 33 (2), pp. 125–137. Cited by: §5.1.
  • [32] J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-real: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. Cited by: §1.1.
  • [33] S. Thrun, W. Burgard, and D. Fox (2005) Probabilistic robotics. MIT press. Cited by: §2.
  • [34] J. Tobin, W. Zaremba, and P. Abbeel (2017) Domain randomization and generative models for robotic grasping. arXiv preprint arXiv:1710.06425. Cited by: §1.1.
  • [35] S. Tu and B. Recht (2018) The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint. arXiv preprint arXiv:1812.03565. Cited by: §5.1.
  • [36] S. Veer and A. Majumdar (2020) Probably approximately correct vision-based planning using motion primitives. arXiv preprint arXiv:2002.12852. Cited by: §1.1.
  • [37] A. Zhang, C. Lyle, S. Sodhani, A. Filos, M. Kwiatkowska, J. Pineau, Y. Gal, and D. Precup (2020) Invariant causal prediction for block mdps. arXiv preprint arXiv:2003.06016. Cited by: §1.1.

Appendix A Appendix

a.1 Computing platform

The examples presented in Section 5 are implemented on a desktop computer with six 3.50GHz Intel i7-7800X processors, 32GB RAM, and four Nvidia GeForce RTX 2080 GPUs.

a.2 Hyperparameters for the LQR example

We use the Adam optimizer [15] for our experiments with the learning rates shown in Table 4.

Learning rate
Gradient descent 0.001
Overparameterization 0.001
IPO 0.0005
Table 4: Learning rates for LQR problem.

a.3 Hyperparameters for Colored-Key Domains

We use the default actor-critic architecture (with no memory) used to train agents using PPO in MiniGrid [8]. This is shown in Figure 3. The hyperparameters for PPO are also the default ones used for MiniGrid (see Table 5). For IPO, the policy associated with each domain utilizes the same architecture shown in Figure 3. The parameters used for the policy-update step in IPO are shown in Table 5. These are identical to the ones used for PPO, with the exception of a lower learning rate.

Figure 3: Actor-critic architecture used for colored-keys example.
# time-steps per rollout on environment 128 128
Epochs per rollout 4 4
Discount 0.99 0.99
GAE 0.95 0.95
Batch size 256 256
Entropy bonus 0.01 0.01
PPO clip range 0.2 0.2
Learning rate 0.001 0.0005
Total time-steps 120K 120K
Table 5: Hyperparameters for colored-keys example.