1 Introduction
The canonical reinforcement learning (RL) problem assumes an agent interacting with a single MDP with a fixed observation space and dynamics structure. This assumption is difficult to ensure in practice, where state spaces are often large and infeasible to explore entirely during training. However, there is often a latent structure to be leveraged to allow for good generalization. As an example, a robot’s sensors may be moved, or the lighting conditions in a room may change, but the physical dynamics of the environment are still the same. These are examples of environmentspecific characteristics that current RL algorithms often overfit to. In the worst case, some training environments may contain spurious correlations that will not be present at test time, causing catastrophic failures in generalization (Zhang et al., 2018a; Song et al., 2020). To develop algorithms that will be robust to these sorts of changes, we must consider problem settings that allow for multiple environments with a shared dynamics structure.
Recent prior works (Amit and Meir, 2018; Yin et al., 2019) have developed generalization bounds for the multitask problem, but they depend on the number of tasks seen at training time, which can be prohibitively expensive given how sample inefficient RL is even in the single task regime. To obtain stronger generalization results, we propose to consider a problem which we refer to as ‘multienvironment’ RL: like multitask RL, the agent seeks to maximize return on a set of environments, but only some of which can be trained on. We make the assumption that there exists some latent causal structure that is shared among all of the environments, and that the sources of variability between environments do not affect reward. This family of environments is called a Block MDP (Du et al., 2019), in which the observations may change, but the latent states, dynamics, and reward function are the same. A formal definition of this type of MDP will be presented in Section 3.
Though the setting we consider is a subset of the multitask RL problem, we show in this work that the added assumption of shared structure allows for much stronger generalization results than have been obtained by multitask approaches. Naive application of generalization bounds to the multitask reinforcement learning setting is very loose because the learner is typically given access to only a few tasks relative to the number of samples from each task. Indeed, Cobbe et al. (2018); Zhang et al. (2018b) find that agents trained using standard methods require many thousands of environments before succeeding at ‘generalizing’ to new environments.
The main contribution of this paper is to use tools from causal inference to address generalization in the Block MDP setting, proposing a new method based on the invariant causal prediction literature. In certain linear function approximation settings, we demonstrate that this method will, with high probability, learn an optimal state abstraction that generalizes across all environments using many fewer training environments than would be necessary for standard PAC bounds. We replace this PAC requirement with requirements from causal inference on the types of environments seen at training time. We then draw a connection between bisimulation and the minimal causal set of variables found by our algorithm, providing bounds on the model error and sample complexity of the method. We further show that using analogous invariant prediction methods for the nonlinear function approximation setting can yield improved generalization performance over multitask and singletask baselines. We relate this method to previous work on learning representations of MDPs (Gelada et al., 2019; Luo et al., 2019) and develop multitask generalization bounds for such representations.
2 Background
2.1 State Abstractions and Bisimulation
State abstractions have been studied as a way to distinguish relevant from irrelevant information (Li et al., 2006) in order to create a more compact representation for easier decision making and planning. Bertsekas and Castanon (1989); Roy (2006) provide bounds for approximation errors for various aggregation methods, and Li et al. (2006) discuss the merits of abstraction discovery as a way to solve related MDPs.
Bisimulation relations are a type of state abstraction that offers a mathematically precise definition of what it means for two environments to ‘share the same structure’ (Larsen and Skou, 1989; Givan et al., 2003). We say that two states are bisimilar if they share the same expected reward and equivalent distributions over the next bisimilar states. For example, if a robot is given the task of washing the dishes in a kitchen, changing the wallpaper in the kitchen doesn’t change anything relevant to the task. One then could define a bisimulation relation that equates observations based on the locations and soil levels of dishes in the room and ignores the wallpaper. These relations can be used to simplify the state space for tasks like policy transfer (Castro and Precup, 2010), and are intimately tied to state abstraction. For example, the modelirrelevance abstraction described by Li et al. (2006) is precisely characterized as a bisimulation relation.
Definition 1 (Bisimulation Relations (Givan et al., 2003)).
Given an MDP , an equivalence relation between states is a bisimulation relation if for all states that are equivalent under (i.e. ), the following conditions hold for all actions :
Where denotes the partition of under the relation , the set of all groups of equivalent states, and where
Whereas this definition was originally designed for the single MDP setting to find bisimilar states within an MDP, we are now trying to find bisimilar states across different MDPs, or different experimental conditions. One can intuitively think of this carrying over by imagining all experimental conditions mapped to a single superMDP with state space where we give up the irreducibility assumption, i.e. we can no longer reach every state from any other state . Specifically, we say that two MDPs and are bisimilar if there exist bisimulation relations and such that is isomorphic to . Bisimilar MDPs are therefore MDPs which are behaviourally the same.
2.2 Causal Inference Using Invariant Prediction
Peters et al. (2016) first introduced an algorithm, Invariant Causal Prediction (ICP), to find the causal feature set, the minimal set of features which are causal predictors of a target variable, by exploiting the fact that causal models have an invariance property (Pearl, 2009; Schölkopf et al., 2012). Arjovsky et al. (2019) extend this work by proposing invariant risk minimization (IRM, see Equation 1), augmenting empirical risk minimization to learn a data representation free of spurious correlations. They assume there exists some partition of the training data into experiments , and that the model’s predictions take the form . IRM aims to learn a representation
for which the optimal linear classifier,
, is invariant across , where optimality is defined as minimizing the empirical risk . We can then expect this representation and classifier to have low risk in new experiments , which have the same causal structure as the training set.(1) 
The IRM objective in Equation 1 can be thought of as a constrained optimization problem, where the objective is to learn a set of features for which the optimal classifier in each environment is the same. Conditioned on the environments corresponding to different interventions on the datagenerating process, this is hypothesized to yield features that only depend on variables that bear a causal relationship to the predicted value. Because the constrained optimization problem is not generally feasible to optimize, Arjovsky et al. (2019) propose a penalized optimization problem with a schedule on the penalty term as a tractable alternative.
3 Problem Setup
We consider a family of environments , where is some index set. For simplicity of notation, we drop the subscript when referring to the union over all environments . Our goal is to use a subset of these environments to learn a representation which enables generalization of a learned policy to every environment. We denote the number of training environments as . We assume that the environments share some structure, and consider different degrees to which this structure may be shared.
3.1 The Block MDP
Block MDPs (Du et al., 2019) are described by a tuple with a finite, unobservable state space , finite action space , and possibly infinite, but observable space . denotes the latent transition distribution for , is the (possibly stochastic) emission function that gives the observations from the latent state for , and the reward function. A graphical model of the interactions between the various variables can be found in Figure 1.
Assumption 1 (Block structure (Du et al., 2019)).
Each observation uniquely determines its generating state . That is, the observation space can be partitioned into disjoint blocks , each containing the support of the conditional distribution .
This assumption gives us the Markov property in . We translate the block MDP to our multienvironment setting as follows. If a family of environments satisfies the block MDP assumption, then each corresponds to an emission function , with and shared for all . We will move the potential randomness from into an auxiliary variable , with some probability space, and write . Further, we require that if , then . The objective is to learn a useful state abstraction to promote generalization across the different emission functions , given that only a subset is provided for training. Song et al. (2020) also describes a similar POMDP setting where there is an additional observation function, but assume information can be lost. We note that this problem can be made arbitrarily difficult if each has a disjoint range, but will focus on settings where the overlap in structured ways – for example, where is the concatenation of the noise and state variables: .
3.2 Relaxations
Spurious correlations. Our initial presentation of the block MDP assumes that the noise variable is sampled randomly at every time step, which prevents multitimestep correlations (Figure 1 in black, solid lines). We therefore also consider a more realistic relaxed block MDP, where spurious variables may have different transition dynamics across the different environments so long as these correlations do not affect the expected reward (Figure 1, now including black dashed lines). This is equivalent to augmenting each Block MDP in our family with a noise variable , such that the observation , and
We note that this section still satisfies Assumption 1.
Realizability. Though our analysis will require Assumption 1, we claim that this is a reasonable requirement as it makes the learning problem realizable. Relaxing Assumption 1 means that the value function learning problem may become illposed, as the same observation can map to entirely different states in the latent MDP with different values, making our environment partially observable (a POMDP, Figure 1 with grey lines). We provide a lower bound on the value approximation error attainable in this setting in the appendix (Proposition 2).
3.3 Assumptions on causal structure
State abstraction and causal inference both aim to eliminate spurious features in a learning algorithm’s input. However, these two approaches are applied to drastically different types of problems. Though we demonstrate that causal inference methods can be applied to reinforcement learning, this will require some assumption on how causal mechanisms are observed. Definitions of the notation used in this section are deferred to the appendix, though they are standard in the causal inference literature.
The key assumption we make is that the variables in the environment state at time can only affect the values of the state at time , and can only affect the reward at time . This assumption allows us to consider the state and action at time as the only candidate for causal parents of the state at time and of the reward at time
. This assumption is crucial to the Markov behaviour of the Markov decision process. We refer the reader to
Figure 2 to demonstrate how causal graphical models can be translated to this setting.Assumption 2 (Temporal Causal Mechanisms).
Let and be components of the observation . Then when no intervention is performed on the environment, we have the following independence.
(2) 
Assumption 3 (Environment Interventions).
This assumption allows us to use tools from causal inference to identify candidate modelirrelevance state abstractions that may hold across an entire family of MDPs, rather than only the ones observed, based on using the state at one timestep to predict values at the next timestep. In the setting of Assumption 3, we can reconstruct the block MDP emission function by concatenating the spurious variables from to . We discuss some constraints on interventions necessary to satisfy the block MDP assumption in the appendix.
4 Connecting State Abstractions to Causal Feature Sets
Invariant causal prediction aims to identify a set of causal variables such that a linear predictor with support on will attain consistent performance over all environments. In other words, ICP removes irrelevant variables from the input, just as state abstractions remove irrelevant information from the environment’s observations. An attractive property of the block MDP setting is that it is easy to show that there does exist a modelirrelevance state abstraction for all MDPs in – namely, the function mapping each observation to its generating latent state . The formalization and proof of this statement are deferred to the appendix (see Theorem 4).
We consider whether, under Assumptions 13, such a state abstraction can be obtained by ICP. Intuitively, one would then expect that the causal variables should have nice properties as a state abstraction. The following result confirms this to be the case; a state abstraction that selects the set of causal variables from the observation space of a block MDP will be a modelirrelevance abstraction for every environment .
Theorem 1.
Consider a family of MDPs , with . Let satisfy Assumptions 13. Let be the set of variables such that the reward is a function only of ( restricted to the indices in ). Then let denote the ancestors of in the (fully observable) causal graph corresponding to the transition dynamics of . Then the state abstraction is a modelirrelevance abstraction for every .
An important detail in the previous result is the model irrelevance state abstraction incorporates not just the parents of the reward, but also its ancestors. This is because in RL, we seek to model return rather than solely rewards, which requires a state abstraction that can capture multitimestep interactions. We provide an illustration of this in Figure 2. As a concrete example, we note that in the popular benchmark CartPole, only position and angle are necessary to predict the reward. However, predicting the return requires and , their respective velocities.
Learning a minimal in the setting of Theorem 1 using a single training environment may not always be possible. However, applying invariant causal prediction methods in the multienvironment setting will yield the minimal causal set of variables when the training environment interventions satisfy certain conditions necessary for the identifiability of the causal variables (Peters et al., 2016).
5 Block MDP Generalization Bounds
We continue to relax the assumptions needed to learn a causal representation and look to the nonlinear setting. As a reminder, the goal of this work is to produce representations that will generalize from the training environments to a novel test environment. However, normal PAC generalization bounds require a much larger number of environments than one could expect to obtain in the reinforcement learning setting. The appeal of an invariant representation is that it may allow for theoretical guarantees on learning the right state abstraction with many fewer training environments, as discussed by Peters et al. (2016). If the learned state abstraction is close to capturing the true base MDP, then the model error in the test environment can be bounded by a function of the distance of the test environment’s abstract state distribution to the training environments’. Though the requirements given in the following Theorem 2 are difficult to guarantee in practice, the result will hold for any arbitrary learned state abstraction.
Theorem 2 (Model error bound).
Consider an MDP , with denoting a coarser bisimulation of . Let denote the mapping from states of to states of . Suppose that the dynamics of are Lipschitz w.r.t. and that is some approximate transition model satisfying , for some . Let denote the 1Wasserstein distance. Then
(3) 
Proof found in Appendix B.
Instead of assuming access to a bisimilar MDP , we can provide discrepancy bounds for an MDP produced by a learned state representation , dynamics function , and reward function using the distance in dynamics and reward of to the underlying MDP . We first define these distances,
(4) 
Theorem 3.
Let be a block MDP and the learned invariant MDP with a mapping . For any Lipschitz valued policy the value difference of that policy is bounded by
(5) 
where is the value function for in and is the value function for in .
Proof found in Appendix B. This gives us a bound on generalization performance that depends on the supremum of the dynamics and reward errors, which correspondingly is a regression problem that will depend on , the number of samples we have in aggregate over all training environments rather than the number of training environments,
. Recent generalization bounds for deep neural networks using Rademacher complexity
(Bartlett et al., 2017a; Arora et al., 2018) scale with a factor of where is the number of samples. We can use for our setting, getting generalization bounds for the block MDP setting that scale with the number of samples in aggregate over all environments, an improvement over previous multitask bounds that depend on .6 Methods
Given these theoretical results, we propose two methods to learn invariant representations in the block MDP setting. Both methods take inspiration from invariant causal prediction, with the first being the direct application of linear ICP to select the causal variables in the state in the setting where variables are given. This corresponds to direct feature selection, which with high probability returns the minimal causal feature set. The second method is a gradientbased approach akin to the IRM objective, with no assumption of a linear causal relationship and a learned causal invariant representation. Like the IRM goal (
Equation 1), we aim to learn an invariant state abstraction from stochastic observations across different interventions , and impose an additional invariance constraint.6.1 Variable Selection for Linear Predictors
The following algorithm (Algorithm 1) returns a modelirrelevance state abstraction. We require the presence of a replay buffer , in which transitions are stored and tagged with the environment from which they came. The algorithm then applies ICP to find all causal ancestors of the reward iteratively. This approach has the benefit of inheriting many nice properties from ICP – under suitable identifiability conditions, it will return the exact causal variable set to a specified degree of confidence.
It also inherits inconvenient properties: the ICP algorithm is exponential in the number of variables, and so this method is not efficient for highdimensional observation spaces. We are also restricted to considering linear relationships of the observation to the reward and next state. Further, because we take the union over iterative applications of ICP, the confidence parameter used in each call must be adjusted accordingly. Given observation variables, we give a conservative bound of .
6.2 Learning a Modelirrelevance State Abstraction
We design an objective to learn a dynamics preserving state abstraction , or modelirrelevance abstraction (Li et al., 2006), where the similarity of the model is bounded by the model error in the environment setting shown in Figure 1. This requires disentangling the state space into a minimal representation that causes reward and everything else . Our algorithm proceeds as follows.
We assume the existence of an invariant state embedding, whose mapping function we denote by . We also assume an invariant dynamics model , a taskspecific dynamics model , and an invariant reward model in the embedding space. To incorporate a meaningful objective and ground the learned representation, we need a decoder . We assume training environments are given. The overall dynamics and reward objectives become
under data collected from behavioral policies for each experimental setting.
Of course, this does not guarantee that the representation learned by will be minimal, so we incorporate additional regularization as an incentive. We train a task classifier on the shared latent representation with crossentropy loss and employ an adversarial loss (Tzeng et al., 2017) on to maximize the entropy of the classifier output to ensure task specific information is not passing through to .
This gives us a final objective
(6) 
where and
are hyperparameters and
denotes entropy (Algorithm 2).7 Results
We evaluate both linear and nonlinear versions of MISA, in corresponding Block MDP settings with both linear and nonlinear dynamics. First, we examine model error in environments with lowdimensional (Section 7.1.1) and highdimensional (Section 7.1.2
) observations and demonstrate the ability for MISA to zeroshot generalize to unseen environments. We next look to imitation learning in a rich observation setting (
Section 7.2) and show nonlinear MISA generalize to new camera angles. Finally, we explore endtoend reinforcement learning in the lowdimensional observation setting with correlated noise (Section 7.3) and again show generalization capabilities where single task and multitask methods fail.7.1 Model Learning
7.1.1 Linear Setting
We first evaluate the linear MISA algorithm in Algorithm 1. To empirically evaluate whether eliminating spurious variables from a representation is necessary to guarantee generalization, we consider a simple family of MDPs with state space , with a transition dynamics structure such that , , and . We train on 3 environments with soft interventions on each noise variable. We then run the linear MISA algorithm on batch data from these 3 environments to get a state abstraction , and then train 2 linear predictors on and . We then evaluate the generalization performance for novel environments that correspond to different hard interventions on the value of the variable. We observe that the predictor trained on attains zero generalization error because it zeros out automatically. However, any nonzero weight on in the leastsquares predictor will lead to arbitrarily large generalization error, which is precisely what we observe in Figure 3.
7.1.2 Rich Observation Setting
We next test the gradientbased MISA method (Algorithm 2) in a setting with nonlinear dynamics and rich observations. Instead of having access to observation variables and selecting the minimal causal feature set, we are tasked with learning the invariant causal representation. We randomly initialize the background color of two train environments from Deepmind Control (Tassa et al., 2018) from range . We also randomly initialize another two backgrounds for evaluation. The orange line in Figure 4 shows performance on the evaluation environments in comparison to three baselines. In the first, we only train on a single environment and test on another with our method, (MISA  1 env). Without more than a single experiment to observe at training time, there is no way to disentangle what is causal in terms of dynamics, and what is not. In the second baseline, we combine data from the two environments and train a model over all data (Baseline  1 decoder). The third is another invariancebased method which uses a gradient penalty, IRM (Arjovsky et al., 2019)
. In the second case the error is tempered by seeing variance in the two environments at training time, but it is not as effective as MISA with two environments at disentangling what is invariant, and therefore causal with respect to dynamics, and what is not. With IRM, the loss starts much higher but very slowly decreases, and we find it is very brittle to tune in practice. Implementation details found in
Section C.1.7.2 Imitation Learning
In this setup, we first train an expert policy using the proprioceptive state of Cheetah Run from (Tassa et al., 2018). We then use this policy to collect a dataset for imitation learning in each of two training environments. When rendering these low dimensional images, we alter the camera angles in the different environments (Figure 5). We report the generalization performance as the test error when predicting actions in Figure 6. While we see test error does increase with our method, MISA, the error growth is significantly slower compared to single task and multitask baselines.
7.3 Reinforcement Learning
We go back to the proprioceptive state in the cartpole_swingup environment in Deepmind Control (Tassa et al., 2018) to show that we can learn MISA while training a policy. We use Soft Actor Critic (Haarnoja et al., 2018) with an additional linear encoder, and add spurious correlated dimensions which are a multiplicative factor of the original state space. We also add an additional environment identifier to the observation. This multiplicative factor varies across environments, and we train on two environments with and , and test on . Like Arjovsky et al. (2019), we also incorporate noise on the causal state to make the task harder, specifically Gaussian noise to the true state dimension. This incentivizes the agent to attend to the spuriously correlated dimension instead, which has no noise. In Figure 7 we see the generalization gap drastically improve with our method in comparison to training SAC with data over all environments in aggregate and with IRM (Arjovsky et al., 2019) implemented on the critic loss. Implementation details and more information about Soft Actor Critic can be found in Section C.2.
8 Related Work
8.1 Prior Work on Generalization Bounds
Generalization bounds provide guarantees on the test set error attained by an algorithm. Most of these bounds are probabilistic and targeted at the supervised setting, falling into the PAC (Probably Approximately Correct) framework. PAC bounds give probabilistic guarantees on a model’s true error as a function of its train set error and the complexity of the function class encoded by the model. Many measures of hypothesis class complexity exist: the VapnikChernovenkis (VC) dimension (Vapnik and Chervonenkis, 1971), the Lipschitz constant, and classification margin of a neural network (Bartlett et al., 2017b), and secondorder properties of the loss landscape (Neyshabur et al., 2019) are just a few of many.
Analogous techniques can be applied to Bayesian methods, giving rise to PACBayes bounds (McAllester, 1999). This family of bounds can be optimized to yield nonvacuous bounds on the test error of overparametrized neural networks (Dziugaite and Roy, 2017), and have demonstrated strong empirical correlation with model generalization (Jiang* et al., 2020). More recently, Amit and Meir (2018); Yin et al. (2019) introduce a PACBayes bound for the multitask setting dependent on the number of tasks seen at training time.
Strehl et al. (2006) extend PAC framework to reinforcement learning, defining a new class of bounds called PACMDP. An algorithm is PACMDP if for any and , the sample complexity of the algorithm is less than some polynomial in with probability at least . The authors provide a PACMDP algorithm for modelfree Qlearning. Lattimore and Hutter (2012) offers lower and upper bounds on the sample complexity of learning nearoptimal behavior in MDPs by modifying the Upper Confidence RL (UCRL) algorithm (Jaksch et al., 2010).
8.2 MultiTask Reinforcement Learning
Teh et al. (2017); Borsa et al. (2016) handle multitask reinforcement learning with a shared “distilled” policy (Teh et al., 2017) and shared stateaction representation (Borsa et al., 2016) to capture common or invariant behavior across all tasks. No assumptions are made about how these tasks relate to each other other than a shared state and action space.
D’Eramo et al. (2020) show the benefits of learning a shared representation in multitask settings with an approximate value iteration bound and Brunskill and Li (2013) also demonstrate a PACMDP algorithm with improved sample efficiency bounds through transfer across similar tasks. Again, none of these works look to the multienvironment setting to explicitly leverage environment structure. Barreto et al. (2017) exploit successor features for transfer, making the assumption that the dynamics across tasks are the same, but the reward changes. However, they do not handle the setting where states are latent, and observations change.
9 Discussion
This work has demonstrated that given certain assumptions, we can use causal inference methods in reinforcement learning to learn an invariant causal representation that generalizes across environments with a shared causal structure. We have provided a framework for defining families of environments, and methods, for both the low dimensional linear value function approximation setting and the deep RL setting, which leverage invariant prediction to extract a causal representation of the state. We have further provided error bounds and identifiability results for these representations. We see this paper as a first step towards the more significant problem of learning useful representations for generalization across a broader class of environments. Some examples of potential applications include thirdperson imitation learning, sim2real transfer, and, related to sim2real transfer, the use of privileged information in one task (the simulation) as grounding and generalization to new observation spaces (Salter et al., 2019).
10 Acknowledgements
MK has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 834115). The authors would also like to thank Marlos Machado for helpful feedback in the writing process.
References
 Metalearning by adjusting priors based on extended PACBayes theory. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 205–214. External Links: Link Cited by: §1, §8.1.
 Invariant Risk Minimization. arXiv eprints. External Links: 1907.02893 Cited by: §C.2, §2.2, §2.2, §7.1.2, §7.3.
 Stronger generalization bounds for deep nets via a compression approach. In 35th International Conference on Machine Learning, ICML 2018, A. Krause and J. Dy (Eds.), 35th International Conference on Machine Learning, ICML 2018, pp. 390–418 (English (US)). Cited by: §5.
 Layer normalization. arXiv eprints. Cited by: §C.1.
 Successor features for transfer in reinforcement learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4055–4065. External Links: Link Cited by: §8.2.
 Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6240–6249. External Links: Link Cited by: §5.
 Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240–6249. Cited by: §8.1.
 Adaptive aggregation for infinite horizon dynamic programming. Automatic Control, IEEE Transactions on 34, pp. 589 – 598. External Links: Document Cited by: §2.1.
 Learning shared representations in multitask reinforcement learning. CoRR abs/1603.02041. External Links: Link, 1603.02041 Cited by: §8.2.

Sample complexity of multitask reinforcement learning.
Uncertainty in Artificial Intelligence  Proceedings of the 29th Conference, UAI 2013
, pp. . Cited by: §8.2.  Using bisimulation for policy transfer in mdps. In TwentyFourth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
 Quantifying generalization in reinforcement learning. CoRR abs/1812.02341. External Links: Link, 1812.02341 Cited by: §1.
 Sharing knowledge in multitask deep reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §8.2.
 Provably efficient RL with rich observations via latent state decoding. CoRR abs/1901.09018. External Links: Link, 1901.09018 Cited by: §1, §3.1, Assumption 1.
 Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Cited by: §8.1.
 Interventions and causal inference. Philosophy of Science 74 (5), pp. 981–995. External Links: Document, Link, https://doi.org/10.1086/525638 Cited by: Assumption 3.
 DeepMDP: learning continuous latent space models for representation learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 2170–2179. Cited by: §1.
 Equivalence notions and model minimization in markov decision processes. Artif. Intell. 147, pp. 163–223. Cited by: §2.1, Definition 1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1861–1870. External Links: Link Cited by: §C.2, §7.3.
 Nearoptimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, pp. 1563–1600. External Links: ISSN 15324435 Cited by: §8.1.
 Fantastic generalization measures and where to find them. In International Conference on Learning Representations, External Links: Link Cited by: §8.1.
 Bisimulation through probabilistic testing (preliminary report). In Proceedings of the 16th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, POPL ’89, New York, NY, USA, pp. 344–352. External Links: ISBN 0897912942, Link, Document Cited by: §2.1.
 PAC bounds for discounted mdps. In International Conference on Algorithmic Learning Theory, pp. 320–334. Cited by: §8.1.
 Towards a unified theory of state abstraction for mdps.. pp. . Cited by: §2.1, §2.1, §6.2.
 Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 PACbayesian model averaging. In COLT, Vol. 99, pp. 164–170. Cited by: §8.1.
 The role of overparametrization in generalization of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §8.1.
 Causality: models, reasoning and inference. 2nd edition, Cambridge University Press, New York, NY, USA. External Links: ISBN 052189560X, 9780521895606 Cited by: §2.2, Assumption 3.

Causal inference using invariant prediction: identification and confidence intervals
. Journal of the Royal Statistical Society, Series B (with discussion) 78 (5), pp. 947–1012. Cited by: Appendix B, §2.2, §4, §5, Proposition 1.  Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31 (2), pp. 234–244. External Links: Link, Document Cited by: §2.1.
 Attention privileged reinforcement learning for domain transfer. External Links: 1911.08363 Cited by: §9.
 On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, Madison, WI, USA, pp. 459–466. External Links: ISBN 9781450312851 Cited by: §2.2.
 Observational overfitting in reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §1, §3.1.
 PAC modelfree reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pp. 881–888. Cited by: §8.1.
 DeepMind control suite. Technical report Vol. abs/1504.04804, DeepMind. Note: https://arxiv.org/abs/1801.00690 External Links: Link Cited by: §C.1, §7.1.2, §7.2, §7.3.
 Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 4496–4506. External Links: Link Cited by: §8.2.

Adversarial discriminative domain adaptation.
In
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , Los Alamitos, CA, USA, pp. 2962–2971. External Links: ISSN 10636919, Document, Link Cited by: §6.2.  On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16 (2), pp. 264–280. Cited by: §8.1.

Soft actorcritic (sac) implementation in pytorch
. GitHub. Note: https://github.com/denisyarats/pytorch_sac Cited by: §C.2.  Metalearning without memorization. arXiv preprint arXiv:1912.03820. Cited by: §1, §8.1.
 Natural environment benchmarks for reinforcement learning. CoRR abs/1811.06032. External Links: Link, 1811.06032 Cited by: §1.
 A study on overfitting in deep reinforcement learning. CoRR abs/1804.06893. External Links: Link, 1804.06893 Cited by: §1.
Appendix A Notation
We provide a summary of key notation used throughout the paper here.
Appendix B Proofs
Technical notes and assumptions. In order for the block MDP assumption to be satisfied, we will require that the interventions defining each environment only occur outside of the causal ancestors of the reward. Otherwise, the different environments will have different latent state dynamics, which violates our assumption that the environments are obtained by an noisy emission function from the latent state space . Although ICP will still find the correct causal variables in this setting, this state abstraction will no longer be a model irrelevance state abstraction over the union of the environments.
Theorem 1.
Consider a family of MDPs , with . Let satisfy Assumptions 13. Let be the set of variables such that the reward is a function only of ( restricted to the indices in ). Then let denote the ancestors of in the (fully observable) causal graph corresponding to the transition dynamics of . Then the state abstraction is a modelirrelevance abstraction for every .
Proof.
To prove that is a modelirrelevance abstraction, we must first show that for any . For this, we note that and, because by definition , we have that . Therefore,
(7) 
To show that is a MISA, we must also show that for any such that , and for any , the distribution over next state equivalence classes will be equal for and .
For this, it suffices to observe that is closed under taking parents in the causal graph, and that by construction environments only contain interventions on variables outside of the causal set. Specifically, we observe that the probability of seeing any particular equivalence class after state is only a function of .
This allows us to define a natural decomposition of the transition function as follows.  
We further observe that since the components of are independent, . We now return to the property we want to show:
and because , we have  
for which we can apply the previous chain of equalities backward to obtain  
∎
Proposition 1 (Identifiability and Uniqueness of Causal State Abstraction).
In the setting of the previous theorem, assume the transition dynamics and reward are linear functions of the current state. If the training environment set satisfies any of the conditions of Theorem 2 (Peters et al., 2016) with respect to each variable in AN(R), then the causal feature set is identifiable. Conversely, if the training environments don’t contain sufficient interventions, then it may be that there exists a such that is a model irrelevance abstraction over , but not over globally.
Proof.
The proof of the first statement follows immediately from the iterative application of the identifiability result of Peters et al. (2016) to each variable in the causal variables set.
For the converse, we consider a simple counterexample in which one variable is constant in every training environment, with value . Then letting , we observe that is also a modelirrelevance state abstraction. First, we show for any .
Finally, we must show that
Again starting from the result of Theorem 1 we have:
and because , we have  
for which we can apply the previous chain of equalities backward to obtain  
However, if one of the test environments contains the intervention , then the distribution over nextstates in the new environment will violate the conditions for a modelirrelevance abstraction. ∎
Theorem 2.
Consider an MDP , with denoting a coarser bisimulation of . Let denote the mapping from states of to states of . Suppose that the dynamics of are Lipschitz w.r.t. and that is some approximate transition model satisfying , for some . Let denote the 1Wasserstein distance. Then
(8) 
We will use the shorthand for , the distribution of state embeddings corresponding to the behaviour policy, and for for the distribution of state embeddings given by the behaviour policy.
Proof.
Let be a coupling over the distributions of and such that  
∎
Theorem 4 (Existence of modelirrelevance state abstractions).
Let denote some family of bisimilar MDPs with joint state space . Let the mapping from states in to the underlying abstract MDP be denoted by . Then if the states in satisfy , then is a modelirrelevance state abstraction for .
Proof.
First, note that is welldefined (because each agrees with the rest on the value of all states appearing in multiple tasks). Then will be a modelirrelevance abstraction for every MDP because it agrees with (a modelirrelevance abstraction). ∎
Theorem 3.
Let be our block MDP and the learned invariant MDP with a mapping . For any Lipschitz valued policy the value difference is bounded by
(9) 
Comments
There are no comments yet.