One of the fundamental challenges with state-of-the-art reinforcement learning (RL) approaches is their limited ability to generalize beyond the specific domains they were trained on. The problem of generalization is particularly acute in complex robotics applications. Deploying an RL policy on a robot outside of the laboratory requires learning a policy that can generalize to a wide range of operating domains — especially when the application is safety-critical. For example, autonomous vehicles must contend with unfamiliar obstacles, lighting, and road conditions when deployed at scale; robotic manipulators deployed in homes must deal with unfamiliar objects and environment geometries; and robots operating in close proximity to humans must be able to handle new patterns of human motion.
As a simple example, consider the problem shown in Figure 1. A robot is placed in a grid-world and must learn to navigate to a goal located in a different room. In order to do this, it must learn to first navigate to a key, use this key to open the door, and then navigate to the goal. During the training phase, the robot is presented with environments containing red and green keys. A policy trained using standard RL techniques demonstrates strong performance when deployed in test environments with key colors seen during training. However, its performance significantly degrades when deployed in test environments with different key colors (see Section 5.2 for a thorough exploration of this problem).
Learning policies capable of such generalization remains challenging for a number of reasons. Primarily, RL algorithms have a tendency to memorize solutions to training environments, thereby achieving high training rewards with a brittle policy that will not generalize to novel environments. Moreover, learned policies often fail to ignore distractors in their sensor observations (e.g., the key colors) and are highly sensitive to changes in such irrelevant factors. The goal of this paper is to address these challenges and learn policies that achieve strong generalization across new operating domains given a limited set of training domains.
Statement of Contributions. We approach the problem of generalizing across domains (formalized in Section 2) with the following principle: a policy will generalize well if it exploits invariances resulting from causal relationships present across domains (e.g. key color does not cause rewards). To embody this principle, we leverage a close connection between causality and invariance (Section 3) in an approach we refer to as Invariant Policy Optimization (IPO). The key idea is to learn a representation that makes the optimal policy built on top of this representation invariant across training domains. Effectively, this approach attempts to learn and exploit the causes of successful actions. We demonstrate that the resulting algorithm demonstrates significantly stronger generalization compared to traditional on-policy methods in two different scenarios (Section 5): a linear-quadratic output feedback problem with distracting observations and an instantiation of the colored-key problem.
1.1 Related Work
Quantifying generalization. The problem of finding policies that generalize beyond their training domain is becoming an increasingly popular topic as reinforcement learning continues to mature and a number of recent studies have attempted to quantify and understand the generalization challenge in RL. In , the authors quantify the effects of observational overfitting, where learned policies are sensitive to irrelevant observational features. Benchmark suites including Sonic  and Atari 2600 games  have also been proposed to quantify generalization. Recently, CoinRun  and the broader Procgen Benchmark  use procedural generalization of environments at controllable levels of difficulty to demonstrate that effective generalization can require an extremely large number of training environments. Another manifestation of the generalization gap is the sim2real problem in robotics: agents trained in simulation overfit to this domain and fail to operate in hardware [34, 24, 32].
Regularization and domain randomization.
The most common approach for improving the out-of-domain generalization of a learning algorithm is to add different forms of regularization. Popular ones borrowed from supervised learning includeregularization, dropout 
, and batch normalization; each of these has been shown to improve generalization on novel CoinRun levels . While practical and easy to implement, these methods typically do not explicitly exploit any structure of the RL problem. Another approach is to constrain the agent’s policy to only depend on a set of learned task-relevant variables, which are found by introducing an information-theoretic regularizer . This method has been shown to generalize to new domains that contain task-irrelevant features not present during training. However, the task-relevant variables are not guaranteed to exploit causal relationships in the environment, which is the focus of this paper. Data augmentation and domain randomization have also been shown to be particularly useful in crossing the sim2real barrier [24, 3]. These methods are complementary to the approach presented here and could potentially be used to generate a diverse set of training domains for our method.
Distributional robustness. The PAC-Bayes Control approach [17, 36] provides a way to make provable generalization guarantees under distributional shifts. This approach is particularly useful in safety-critical applications where it is important to quantify the impact of switching between a training domain and a test domain. Another approach that provides robustness guarantees is to train in a manner that allows adversarial perturbations to the underlying data distribution . However, the challenge with both of these approaches is that they require an a priori bound on how much the test domain differs from the training domain (e.g., in terms of an -divergence). In contrast, the recently proposed risk-extrapolation method  promotes out-of-distribution generalization by encouraging robustness of hypotheses over affine combinations of training risks. This method is shown to improve performance of RL agents when their state space is augmented with noisy copies of true system states.
Causality and invariance. Recently, the task of learning causal predictors has drawn interest in the supervised learning setting. An approach formalized in  attempts to find features that are causally linked to a target variable by exploiting the invariance of causal relationships [23, 26]. This approach was expanded upon in the invariant risk minimization (IRM) approach 
, which formulates the problem in terms of finding a representation such that the optimal classifier built on top of this representation is invariant across domains. This results in classifiers that ignore spurious correlations that may exist in any single domain. The formulation leads to a challenging bilevel optimization problem and is tackled via a regularizer that approximates its solution. More recently, presented a game-theoretic reformulation of the IRM principle and proposed a new algorithm, known as IRM-Games, which offers better empirical results. Our approach adapts ideas from causality and invariance to RL settings by learning representations that invariantly predict actions. We provide more background on invariance, causality, and IRM in Section 3.
Causality in RL. Lastly, there are a number of recent methods that attempt to exploit causality in RL. For example,  observed that, in some instances, causal reasoning can emerge in agents trained via meta-learning. Other approaches explicitly attempt to learn causal graphs that describe the dynamics of the agent’s environment. Along these lines,  proposes a two-phase training process where interactions with the environment are first used to learn the causal graph of the environment, and then a policy that exploits this graph is trained. Finally, the IRM method has recently been applied to RL problems. In 
, the authors attempt to learn a causal Markov decision process (MDP) that is bisimilar to the full MDP present during training. This formulation requires learning a model for both the causal and full dynamics of the system, a mapping between the two, and a causal model of the rewards. Standard RL algorithms are then used in conjunction with these causal models to produce a final policy. This approach is distinct from the one in this paper, which does not seek to find complete dynamical models (causal or otherwise). Instead, we focus onidentifying the causes of successful actions, which is a simpler problem.
2 Problem Formulation
We are interested in the problem of zero-shot generalization to environments that can be significantly different from environments seen during training. We formalize this as follows. Let denote the joint state of the agent and environment. In the colored-keys example discussed in Section 1, corresponds to the location of the agent at time-step , while corresponds to locations of the obstacles, key, door, and goal. In our formulation, different environments correspond to different (initial) states of the environment (e.g., different configurations of obstacles, key, door, and goal). We denote the agent’s actions, observations, and rewards by , , and respectively.
During training, we assume access to multiple sets of environments from different domains . We assume that the action space and observation space are shared across all domains (state spaces need not be shared). Each domain corresponds to a partially observable Markov decision process (POMDP)  with dynamics mapping , observation mapping , and reward mapping . In the colored-keys example, domains differ (only) in terms of the observation mapping; in particular, each domain assigns a particular color to keys. Each domain also defines a distribution over environments.
Our goal is to learn a policy that generalizes to domains beyond the training domains (e.g., generalizing to domains with key colors not seen during training). More specifically, let denote the expected cumulative reward (over either a finite or infinite horizon) when policy is executed in environment . We would then like to maximize the worst-case rewards over all domains:
Without further assumptions on the relationship between and , finding a policy that performs well on domains may be impossible. We discuss this further in Section 4.
3 Background: Invariance and Causality
Definition 1 (Structural Causal Model ).
A structural causal model (SCM) governing the random vector
governing the random vectoris a collection of assignments:
where are the parents of , are independent noise variables, and each is any mapping from these variables to . The graph of an SCM is obtained by associating one vertex for each and edges from each parent in to . We assume acyclic causal graphs. We refer to the elements of as the direct causes of .
Figure 2 shows the causal graph for the RL problem formulation in Section 2. Here, the reward depends on the action and a set of “reward-relevant” variables . Thus, . In our running colored-keys example, is purely a function of the agent’s state and the goal location.
Definition 2 (Intervention ).
Consider an SCM . An intervention replaces one or more of the structural assignments to obtain a new SCM with assignments:
We say that the variables whose structural assignments we have replaced have been intervened on.
if and only if the conditional probabilityremains invariant for all interventions where has not been intervened on. This is also related to the notion of “autonomy” and the principle of independent mechanisms [26, Ch. 2.1]. As an example, consider the reward to be the variable of interest in Figure 2. Then, are the direct causes of if and only if for all interventions where has not been intervened on, remains invariant. Thus, in the context of the colored-keys example, does not contain any color-related information.
Invariant Risk Minimization (IRM). This approach  exploits the modularity principle in the context of supervised learning. One assumes datasets from multiple training domains111We note that  uses the term “environment” instead of “domain”. However, we use “domain” since “environment” has a different meaning in RL contexts. corresponding to different interventions on the data-generating process that do not intervene on the target variable . Here and . The goal is to learn a data representation that elicits an invariant predictor across training domains, i.e., a representation such that there exists a classifier that is simultaneously optimal for all training domains . Intuitively, the representation should capture the direct causes of and thus eliminate any features in that spuriously correlate with .
The optimization problem associated with IRM is a challenging bi-leveled one. The authors of  propose IRM-v1, where one fixes a “dummy” linear classifier and learns a representation that is approximately locally optimal in all training domains:
where is the loss incurred by on domain .
IRM Games. Inspired by IRM, the authors of  demonstrate that the set of invariant predictors corresponds to the set of pure Nash equilibria of a game played among players. Each player (corresponding to a training domain ) can choose its own classifier and is trying to maximize a utility function: , where2], the authors propose a strategy based on best response dynamics 
, where players take turns maximizing their utility functions. The resulting algorithm achieves similar or better empirical performance as compared to IRM-v1, with significantly reduced variance.
4 Invariant Policy Optimization
We now describe our novel reinforcement learning algorithm, which we refer to as invariant policy optimization (IPO). The key insight behind this algorithm is to implement the following invariance principle: learn a representation that maps observations to in a manner that supports invariant action prediction (see Figure 2). More precisely, the goal is to learn a representation such that there exists an “action-predictor” built on top of this representation that is simultaneously optimal across all training domains222For the ease of exposition, we discuss the case of memoryless policies. However, it is straightforward to handle policies with memory (e.g., by augmenting observations with a memory state).. We will refer to the resulting policy as an invariant policy. This invariance principle can be formally embodied as the following optimization problem:
Here, is the reward associated with domain , as defined in Section 2. Intuitively, given a set of training domains, IPO attempts to learn a representation that corresponds to the “causes of successful actions”. This interpretation elucidates the role of the different training domains; these must correspond to different interventions on the causal graph shown in Figure 2 that leave optimal actions unaffected. Assuming a diverse set of training domains, one learns a representation that eliminates features that spuriously correlate with good actions (i.e., actions that achieve high rewards). For example, in the colored-keys problem, such a representation corresponds to one that eliminates color from observations. By eliminating such features, an invariant policy generalizes well to novel domains corresponding to unseen interventions on the spurious/irrelevant features.
Our algorithmic approach for IPO is inspired by the game-theoretic formulation of  (see Section 3). We endow each domain with its own policy and define an overall ensemble policy . The optimization problem behind IPO then becomes:
Next, we relate the optimization problem (6) to a game played between players. Each player corresponds to a domain and chooses a policy to maximize its own utility function . Since Problem (6) is identical to the one in  for finding invariant representations (with policies playing the role of classifiers), the results from  carry over to our setting. In particular, under mild technical assumptions on the policies, the set of pure Nash equilibria of the game correspond to the set of invariant policies. We refer the reader to 
for details on the technical assumptions, but note that these are satisfied by a wide range of function classes (e.g., ReLu networks with arbitrary depth, linear functions, and functions inspaces).
While finding Nash equilibria for continuous games such as the one above is difficult in general, the game theory literature has developed several approximate approaches that demonstrate good performance in practice. Here, we adapt the strategy based on best response dynamics  proposed in  to our setting. The resulting IPO training procedure is presented in Algorithm 1. The for-loop in lines 8–11 implement the best-response dynamics; the players (corresponding to the different domains) take turns choosing in order to optimize their own objective . We choose to implement the updates using proximal policy optimization (PPO) . However, this choice is not fundamental and one may implement the updates using other policy gradient methods. Line 5 of the algorithm periodically updates the representation . However, as demonstrated in , simply choosing = I is an effective approach and can (under certain conditions) recover invariant predictors. Finally, we note that Algorithm 1 can also accommodate actor-critic versions of PPO (or other policy optimization methods). In this version, each domain has both an actor and a critic . In the policy-update steps, one updates both the actor and the critic using PPO.
5.1 Linear Quadratic Regulator with Distractors
We first apply our approach to the linear quadratic regulator (LQR) problem  modified to include high-dimensional “distractor” observations. There has been a growing interest in LQR as a simplified surrogate for deep RL problems [12, 35, 13, 1]. Here we consider the output-feedback control  problem proposed in  as a benchmark for assessing generalization with respect to changes in the observation model. The dynamics of the system are described by , where , and are fixed matrices. The agent receives a high-dimensional sensor observation , where and are semi-orthogonal matrices. This ensures that the portion of the observation corresponding to contains full information about the state, while is a high-dimensional “distractor”. The goal is to choose policies of the form in order to minimize the infinite-horizon LQR cost .
In this setting, a domain corresponds to a particular choice of ; all other system parameters () are shared across domains and unknown to the agent. During training time, one learns a policy using domains. At test time, the learned policy is assessed on a new domain. In the case where there is a single domain (used for both training and test) and , one can find the globally optimal policy via gradient descent (even though the corresponding optimization problem is non-convex) . However, as demonstrated in , simple policy gradient using the combined costs of multiple training domains finds a policy that overfits to the training domains in the more general setting considered here. Intuitively, this is because the learned policy fails to ignore the distractors.
For our numerical experiments, we choose and . The matrices and are random orthogonal matrices, is , and the are random semi-orthogonal matrices (different for each domain). For IPO, we employ Algorithm 1 with the Fixed- option. We choose a policy that averages policies
corresponding to the training domains. Instead of PPO, we simply use gradient descent to perform policy updates. Optimization hyperparameters are provided in AppendixA.2.
We compare our approach with two baselines: (i) gradient descent on using the combined cost of training domains, and (ii) gradient descent using an overparameterized class of policies with two layers (i.e., ) and hidden dimension of . Interestingly,  found that this form of overparameterization induces an implicit regularization towards “simpler” policies (i.e., ones that are less “dependent” on the distractors). Table 1 compares the generalization performance of the learned policies to new domains as we vary the number of training domains. Here, the distractors have a dimensionality of . Consistent with , we find that overparameterization forms a strong baseline for this problem. However, IPO significantly outperforms both baselines. As expected, performance improves with increasing number of training domains and tends towards the performance achieved by an “oracle” policy that has access to the full state on the test domain. Table 2 assesses the impact of changing the dimensionality of the distractors. Here, we fix the number of training domains to two. Again, we observe that IPO demonstrates significantly improved performance (i.e., lower costs).
|Number of training domains||2||3||4||5||10|
5.2 Colored-Key Domains
We now consider the colored-keys problem introduced in Section 1. In this example, a robot is placed in a grid-world that contains a goal (located in a room), a door, and a key (see Figure 1). The robot is presented with a reward if it reaches the goal. Using this sparse reward signal, it must learn to first navigate to the key, use this to open the door, and then navigate to the goal. In this setting, an environment corresponds to a particular configuration of the key, door, goal, and obstacles. Different domains correspond to different key colors.
We implement our approach on grid-worlds using MiniGrid . Observations in MiniGrid correspond to values; the three channels encode the object type (e.g, door), object color, and object state (e.g., open/closed) for a neighborhood around the robot. The robot receives a sparse reward of , where is the time taken to reach the goal and is the time-limit for completing the task. During training, the robot has access to environments from two domains corresponding to red and green keys. We use 48 training environments split evenly between these domains. At test-time, the robot is placed in environments with grey keys. This color choice is motivated by the fact that in MiniGrid, colors are encoded using integers (e.g., red: 0, green: 1), and grey corresponds to the color that is “furthest away” in terms of this encoding (grey: 5). For any given environment, the color of the key and the color of door are the same. This ensures that problem is always feasible, i.e. the robot will always be able to reach the goal if it learns the optimal policy. We implement IPO with the fixed- option and an actor-critic architecture; details and hyperparameters are provided in Appendix A.3. Table 3 reports the average rewards on 50 test environments from the training and test domains. We compare our approach to PPO  trained to maximize rewards combined across training environments. As the table illustrates, IPO achieves better generalization to the new domain and is also more consistent across training seeds.
|Key color||Red (training)||Green (training)||Grey (testing)|
6 Discussion and Conclusions
We have considered the problem of learning policies with strong generalization beyond training domains. The key idea behind our Invariant Policy Optimization (IPO) approach is to learn representations that support invariant action prediction across different domains. We implemented the proposed techniques on: (i) linear quadratic regulator (LQR) problems with distractor observations, and (ii) an example where an agent must learn to navigate to a goal by opening a door using different colored keys in its environment. We compared our approach with standard policy gradient methods (e.g., PPO) and demonstrated significant improvements in generalization performance on unseen domains.
Future work. On the theoretical front, an important direction for future work is to provide rigorous guarantees on generalization to novel domains. One potential avenue is to combine the algorithmic techniques presented here with recent results on PAC-Bayes generalization theory applied to control and RL settings [17, 18]. On the algorithmic front, an interesting direction is to use domain randomization techniques to automatically generate new training domains that can be used to improve invariant policy learning (e.g., automatically generating domains with different colored keys in the colored-keys example). Finally, a particularly promising future direction is to explore the application of IPO to robotics problems involving sim2real transfer, where one thinks of simulation and reality as different domains to learn an invariant policy across them.
The authors are grateful to Kartik Ahuja for helpful clarifications on the training procedure for IRM-Games. The authors would also like to thank Richard Song and Behnam Neyshabur for providing access to their code from  for the LQR example in Section 5.1.
This work is partially supported by the Office of Naval Research [Award Number: N00014-18-1-2873], the Google Faculty Research Award, the Amazon Research Award, and the National Science Foundation [IIS-1755038].
-  (2019) Online control with adversarial disturbances. arXiv preprint arXiv:1902.08721. Cited by: §5.1.
-  (2020) Invariant risk minimization games. arXiv preprint arXiv:2002.04692. Cited by: §1.1, §3, §3, §4, §4.
-  (2019) Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113. Cited by: §1.1.
-  (2007) Optimal control: linear quadratic methods. Courier Corporation. Cited by: §5.1.
-  (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §1.1, §3, §3, §3, footnote 1.
Local characterizations of causal bayesian networks. In Graph Structures for Knowledge Representation and Reasoning, pp. 1–17. Cited by: §3.
-  (2010) Best response dynamics for continuous games. Proceedings of the American Mathematical Society 138 (3), pp. 1069–1083. Cited by: §3, §4.
-  (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gym-minigrid Cited by: §A.3, §5.2.
-  (2019) Leveraging procedural generation to benchmark reinforcement learning. arXiv preprint arXiv:1912.01588. Cited by: §1.1.
Quantifying generalization in reinforcement learning.
Proceedings of the International Conference on Machine Learning, pp. 1282–1289. Cited by: §1.1, §1.1.
-  (2019) Causal reasoning from meta-reinforcement learning. arXiv preprint arXiv:1901.08162. Cited by: §1.1.
-  (2019) On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pp. 1–47. Cited by: §5.1.
-  (2018) Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039. Cited by: §5.1, §5.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2.
-  (2020) Out-of-distribution generalization via risk extrapolation (rex). arXiv preprint arXiv:2003.00688. Cited by: §1.1.
-  (2019) PAC-Bayes Control: learning policies that provably generalize to novel environments. arXiv preprint arXiv:1806.04225. Cited by: §1.1, §6.
-  (2018) PAC-Bayes Control: synthesizing controllers that provably generalize to novel environments. In Proceedings of the Conference on Robot Learning (CoRL), Cited by: §6.
-  (2013) Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.1.
-  (2019) Causal induction from visual observations for goal directed tasks. arXiv preprint arXiv:1910.01751. Cited by: §1.1.
-  (2018) Gotta learn fast: a new benchmark for generalization in reinforcement learning. arXiv preprint arXiv:1804.03720. Cited by: §1.1.
-  (2020) Learning task-driven control policies via information bottlenecks. arXiv preprint arXiv:2002.01428. Cited by: §1.1.
-  (2009) Causality. Cambridge university press. Cited by: §1.1, §3.
-  (2018) Sim-to-real transfer of robotic control with dynamics randomization. In IEEE International Conference on Robotics and Automation (ICRA), Cited by: §1.1, §1.1.
Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78 (5), pp. 947–1012. Cited by: §1.1.
-  (2017) Elements of causal inference: foundations and learning algorithms. MIT press. Cited by: §1.1, §3, §3, Definition 1, Definition 2.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4, §5.2.
-  (2017) Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571. Cited by: §1.1.
-  (2019) Observational overfitting in reinforcement learning. arXiv preprint arXiv:1912.02975. Cited by: §1.1, §5.1, §5.1, §5.1, §6.
Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.1.
-  (1997) Static output feedback—a survey. Automatica 33 (2), pp. 125–137. Cited by: §5.1.
-  (2018) Sim-to-real: learning agile locomotion for quadruped robots. arXiv preprint arXiv:1804.10332. Cited by: §1.1.
-  (2005) Probabilistic robotics. MIT press. Cited by: §2.
-  (2017) Domain randomization and generative models for robotic grasping. arXiv preprint arXiv:1710.06425. Cited by: §1.1.
-  (2018) The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint. arXiv preprint arXiv:1812.03565. Cited by: §5.1.
-  (2020) Probably approximately correct vision-based planning using motion primitives. arXiv preprint arXiv:2002.12852. Cited by: §1.1.
-  (2020) Invariant causal prediction for block mdps. arXiv preprint arXiv:2003.06016. Cited by: §1.1.
Appendix A Appendix
a.1 Computing platform
The examples presented in Section 5 are implemented on a desktop computer with six 3.50GHz Intel i7-7800X processors, 32GB RAM, and four Nvidia GeForce RTX 2080 GPUs.
a.2 Hyperparameters for the LQR example
a.3 Hyperparameters for Colored-Key Domains
We use the default actor-critic architecture (with no memory) used to train agents using PPO in MiniGrid . This is shown in Figure 3. The hyperparameters for PPO are also the default ones used for MiniGrid (see Table 5). For IPO, the policy associated with each domain utilizes the same architecture shown in Figure 3. The parameters used for the policy-update step in IPO are shown in Table 5. These are identical to the ones used for PPO, with the exception of a lower learning rate.
|# time-steps per rollout on environment||128||128|
|Epochs per rollout||4||4|
|PPO clip range||0.2||0.2|