1 Introduction
Learning control from images is important for many real world applications. While deep reinforcement learning (RL) has enjoyed many successes in simulated tasks, learning control from real vision is more complex, especially outdoors, where images reveal detailed scenes of a complex and unstructured world. Furthermore, while many RL algorithms can eventually
learn control from real images given unlimited data, dataefficiency is often a necessity in real trials which are expensive and constrained to realtime. Prior methods for dataefficient learning of simulated visual tasks typically use representation learning. Representation learning summarizes images by encoding them into smaller vectored representations better suited for RL. For example, sequential autoencoders aim to learn
lossless representations of streaming observations—sufficient to reconstruct current observations and predict future observations—from which various RL algorithms can be trained (Hafner et al., 2018; Lee et al., 2019; Yarats et al., 2019). However, such methods are taskagnostic: the models represent all dynamic elements they observe in the world, whether they are relevant to the task or not. We argue such representations can easily “distract” RL algorithms with irrelevant information in the case of real images. The issues of distraction is less evident in popular simulation MuJoCo and Atari tasks, since any change in observation space is likely taskrelevant, and thus, worth representing. By contrast, visual images that autonomous cars observe contain predominately taskirrelevant information, like cloud shapes and architectural details, illustrated in Figure 1.Rather than learning controlagnostic representations that focus on accurate reconstruction of clouds and buildings, we would rather achieve a more compressed representation from a lossy encoder, which only retains state information relevant to our task. If we would like to learn representations that capture only taskrelevant elements of the state and are invariant to taskirrelevant information, intuitively we can utilize the reward signal to determine taskrelevance. As cumulative rewards are our objective, state elements are relevant not only if they influence the current reward, but also if they influence state elements in the future that in turn influence future rewards. This recursive relationship can be distilled into a recursive taskaware notion of state abstraction: an ideal representation is one that is predictive of reward, and also predictive of itself in the future.
We propose learning such an invariant representation using the bisimulation metric, where the distance between two observation encodings correspond to how “behaviourally different” (Ferns and Precup, 2014) both observations are. Our main contribution is a practical representation learning method based on the bisimulation metric suitable for downstream control, which we call deep bisimulation for control (DBC). We additionally provide theoretical analysis that proves value bounds between the optimal value function of the true MDP and the optimal value function of the MDP constructed by the learned representation. Empirical evaluations demonstrate our nonreconstructive using bisimulation approach is substantially more robust to taskirrelevant distractors when compared to prior approaches that use reconstruction losses or contrastive losses. Our initial experiments insert natural videos into the background of MoJoCo control task as complex distraction. Our second setup is a highfidelity highway driving task using CARLA (Dosovitskiy et al., 2017), showing that our representations can be trained effectively even on highly realistic images with many distractions, such as trees, clouds, buildings, and shadows. For example videos see https://sites.google.com/view/deepbisim4control.
2 Related Work
Our work builds on the extensive prior research on bisimulation, a form of MDP state aggregation.
Reconstructionbased Representations. Early works on deep reinforcement learning from images (Lange and Riedmiller, 2010; Lange et al., 2012) used a twostep learning process where first an autoencoder was trained using reconstruction loss to learn a lowdimensional representation, and subsequently a controller was learned using this representation. This allows effective leveraging of large, unlabeled datasets for learning representations for control. In practice, there is no guarantee that the learned representation will capture useful information for the control task, and significant expert knowledge and tricks are often necessary for these approaches to work. In modelbased RL, one solution to this problem has been to jointly train the encoder and the dynamics model endtoend Watter et al. (2015); Wahlström et al. (2015) – this proved effective in learning useful taskoriented representations. Hafner et al. (2018) and Lee et al. (2019) learn latent state models using a reconstruction loss, but these approaches suffer from the difficulty of learning accurate longterm predictions and often still require significant manual tuning. Gelada et al. (2019) also propose a latent dynamics modelbased method and connect their approach to bisimulation metrics, using a reconstruction loss in Atari. They show that distance in the DeepMDP representation upper bounds the bisimulation distance, whereas our objective directly learns a representation where distance in latent space is the bisimulation metric. Further, their results rely on the assumption that the learned representation is Lipschitz, whereas we show that, by directly learning a bisimilaritybased representation, we guarantee a representation that generates a Lipschitz MDP. We show experimentally that our nonreconstructive DBC method is substantially more robust to complex distractors.
Contrastivebased Representations. Contrastive losses are a selfsupervised approach to learn useful representations by enforcing similarity constraints between data (van den Oord et al., 2018; Chen et al., 2020)
. Similarity functions can be provided as domain knowledge in the form of heuristic data augmentation, where we maximize similarity between augmentations of the same data point
(Laskin et al., 2020) or nearby image patches (Hénaff et al., 2019), and minimize similarity between different data points. In the absence of this domain knowledge, contrastive representations can be trained by predicting the future (van den Oord et al., 2018). We compare to such an approach in our experiments, and show that DBC is substantially more robust. While contrastive losses do not require reconstruction, they do not inherently have a mechanism to determine downstream task relevance without manual engineering, and when trained only for prediction, they aim to capture all predictable features in the observation, which performs poorly on real images for the same reasons world models do. A better method would be to incorporate knowledge of the downstream task into the similarity function in a datadriven way, so that images that are very different pixelwise (e.g. lighting or texture changes), can also be grouped as similar w.r.t. downstream objectives.Bisimulation.
Various forms of state abstractions have been defined in Markov decision processes (MDPs) to group states into clusters whilst preserving some property (e.g. the optimal value, or all values, or all action values from each state)
(Li et al., 2006). The strictest form, which generally preserves the most properties, is bisimulation (Larsen and Skou, 1989). Bisimulation only groups states that are indistinguishable w.r.t. reward sequences output given any action sequence tested. A related concept is bisimulation metrics (Ferns and Precup, 2014), which measure how “behaviorally similar” states are. Ferns et al. (2011) defines the bisimulation metric with respect to continuous MDPs, and propose a Monte Carlo algorithm for learning it using an exact computation of the Wasserstein distance between empirically measured transition distributions. However, this method does not scale well to large state spaces. Taylor et al. (2009) relate MDP homomorphisms to lax probabilistic bisimulation, and define a lax bisimulation metric. They then compute a value bound based on this metric for MDP homomorphisms, where approximately equivalent stateaction pairs are aggregated. Most recently, Castro (2020) propose an algorithm for computing onpolicy bisimulation metrics, but does so directly, without learning a representation. They focus on deterministic settings and the policy evaluation problem. We believe our work is the first to propose a gradientbased method for directly learning a representation space with the properties of bisimulation metrics and show that it works in the policy optimization setting.3 Preliminaries
We start by introducing notation and outlining realistic assumptions about underlying structure in the environment. Then, we review state abstractions and metrics for state similarity.
We assume the underlying environment is a Markov decision process (MDP), described by the tuple , where is the state space, the action space,
the probability of transitioning from state
to state , and a discount factor. An “agent” chooses actions according to a policy function , which updates the system state , yielding a reward . The agent’s goal is to maximize the expected cumulative discounted rewards by learning a good policy: .Bisimulation is a form of state abstraction that groups states and that are “behaviorally equivalent” (Li et al., 2006). For any action sequence , the probabilistic sequence of rewards from and are identical. A more compact definition has a recursive form: two states are bisimilar if they share both the same immediate reward and equivalent distributions over the next bisimilar states (Larsen and Skou, 1989; Givan et al., 2003).
Definition 1 (Bisimulation Relations (Givan et al., 2003)).
Given an MDP , an equivalence relation between states is a bisimulation relation if, for all states that are equivalent under (denoted ) the following conditions hold:
(1)  
(2) 
where is the partition of under the relation (the set of all groups of equivalent states), and
Exact partitioning with bisimulation relations is generally impractical in continuous state spaces, as the relation is highly sensitive to infinitesimal changes in the reward function or dynamics. For this reason, Bisimulation Metrics (Ferns et al., 2011; Ferns and Precup, 2014; Castro, 2020) softens the concept of state partitions, and instead defines a pseudometric space , where a distance function measures the “behavioral similarity” between two states^{1}^{1}1Note that is a pseudometric, meaning the distance between two different states can be zero, corresponding to behavioral equivalence.. Defining a distance between states requires defining both a distance between rewards (to soften Equation 1), and distance between state distributions (to soften Equation 2). Prior works use the Wasserstein metric for the latter, originally used in the context of bisimulation metrics by van Breugel and Worrell (2001). The
Wasserstein metric is defined between two probability distributions
and as , where is the set of all couplings of and . This is known as the “earth mover” distance, denoting the cost of transporting mass from one distribution to another (Villani, 2003). Finally, the bisimulation metric is the reward difference added to the Wasserstein distance between transition distributions:Definition 2 (Bisimulation Metric).
From Theorem 2.6 in Ferns et al. (2011) with :
(3) 
4 Learning Representations for Control with Bisimulation Metrics
We propose Deep Bisimulation for Control (DBC), a dataefficient approach to learn control policies from unstructured, highdimensional observations. In contrast to prior work on bisimulation, which typically aims to learn a distance function of the form between observations, our aim is instead to learn representations under which distances correspond to bisimulation metrics, and then use these representations to improve reinforcement learning. Our goal is to learn encoders that capture representations of states that are suitable to control, while discarding any information that is irrelevant for control. Any representation that relies on reconstruction of the observation cannot do this, as these irrelevant details are still important for reconstruction. We hypothesize that bisimulation metrics can acquire this type of representation, without any reconstruction.
Bisimulation metrics are a useful form of state abstraction, but prior methods to train distance functions either do not scale to pixel observations (Ferns et al., 2011) (due to the max operator in Equation 3), or were only designed for the (fixed) policy evaluation setting (Castro, 2020). By contrast, we learn improved representations for policy inputs, as the (nonfixed) policy improves. Our bisimulation metric can be learned with a gradientbased algorithm, and we prove it converges to a fixed point in Theorem 1 under some assumptions. To train our encoder towards our desired relation , we draw batches of observations pairs, and minimise the mean square error between the onpolicy bisimulation metric and Euclidean distance in the latent space:
(4) 
where , , denotes with stop gradients, and is the mean policy output. Equation 4 uses both a reward model and dynamics model , which have their own training steps in Algorithm 1. The full model architecture and training is illustrated by Figure 2. Our reward model is a deterministic network, and our dynamics model
outputs a Gaussian distribution. For this reason, we use the 2Wasserstein metric
in Equation 4, as opposed to the 1Wasserstein in Equation 3, since the metric has a convenient closed form: , where is the Frobenius norm. For all other distances we continue using the norm.Incorporating control. We combine our representation learning approach (Algorithm 1) with the soft actorcritic (SAC) algorithm (Haarnoja et al., 2018) to devise a practical reinforcement learning method. We modified SAC slightly in Algorithm 2 to allow the value function to backprop to our encoder, which can improve performance further (Yarats et al., 2019; Rakelly et al., 2019). Although, in principle, our method could be combined with any RL algorithm, including the modelfree DQN (Mnih et al., 2015), or modelbased PETS (Chua et al., 2018)
. Implementation details and hyperparameter values of DBC are summarized in the appendix,
Table 2. We train DBC by iteratively updating four components in turn: a dynamics model , reward model , encoder with Equation 4, and policy(in this case, with SAC). A single loss function would be less stable, and require balancing components. The inputs of each loss function
in Algorithm 1 represents which components are updated. After each training step, the policy is used to step in the environment, the data is collected in a replay buffer , and a batch is randomly selected to repeat training.5 Generalization Bounds and Links to Causal Inference
While DBC enables representation learning without pixel reconstruction, it leaves open the question of how good the resulting representations really are. In this section, we present theoretical analysis that bounds the suboptimality of a value function trained on the representation learned via DBC. First, we show that our bisimulation metric converges to a fixed point, starting from the initialized policy and converging to an optimal policy .
Theorem 1.
Let be the space of bounded pseudometrics on and a policy that is continuously improving. Define by
(5) 
Then has a least fixed point which is a bisimulation metric.
Proof in appendix. As evidenced by Definition 2, the bisimulation metric has no direct dependence on the observation space. Pixels can change, but bisimilarity will stay the same. Instead, bisimilarity is grounded in a recursion of future transition probabilities and rewards, which is closely related to the optimal value function. In fact, the bisimulation metric gives tight bounds on the optimal value function with discount factor . We show this using the property that the optimal value function is Lipschitz with respect to the bisimulation metric, see Theorem 5 in Appendix (Ferns et al., 2004). This result also implies that the closer two states are in terms of , the more likely they are to share the same optimal actions. This leads us to a generalization bound on the optimal value function of an MDP constructed from a representation space using bisimulation metrics, We can construct a partition of this space for some , giving us partitions where . We denote as the encoder that maps from the original state space to each cluster.
Theorem 2 (Value bound based on bisimulation metrics).
Given an MDP constructed by aggregating states in an neighborhood, and an encoder that maps from states in the original MDP to these clusters, the optimal value functions for the two MDPs are bounded as
(6) 
Proof in appendix. As the optimal value function of the aggregated MDP converges to the original. Further, by defining a learning error for , , we can update the bound in Theorem 2 to incorporate :
MDP dynamics have a strong connection to causal inference and causal graphs, which are directed acyclic graphs (Jonsson and Barto, 2006; Schölkopf, 2019; Zhang et al., 2020). Specifically, the state and action at time causally affect the next state at time . In this work, we care about the components of the state space that causally affect current and future reward. Deep bisimulation for control representations connect to causal feature sets, or the minimal feature set needed to predict a target variable (Zhang et al., 2020).
Theorem 3 (Connections to causal feature sets (Thm 1 in Zhang et al. (2020))).
If we partition observations using the bisimulation metric, those clusters (a bisimulation partition) correspond to the causal feature set of the observation space with respect to current and future reward.
This connection tells us that these features are the minimal sufficient statistic of the current and future reward, and therefore consist of (and only consist of) the causal ancestors of the reward variable .
Definition 3 (Causal Ancestors).
In a causal graph where nodes correspond to variables and directed edges between a parent node and child node are causal relationships, the causal ancestors of a node are all nodes in the path from to a root node.
If there are interventions on distractor variables, or variables that control the rendering function and therefore the rendered observation but do not affect the reward, the causal feature set will be robust to these interventions, and correctly predict current and future reward in the linear function approximation setting (Zhang et al., 2020). As an example, in the context of autonomous driving, an intervention can be a change in weather, or a change from day to night which affects the observation space but not the dynamics or reward. Finally, we show that a representation based on the bisimulation metric generalizes to other reward functions with the same causal ancestors, with an example causal graph in Figure 3.
Theorem 4 (Task Generalization).
Given an encoder that maps observations to a latent bisimulation metric representation where , encodes information about all the causal ancestors of the reward .
Proof in appendix. This result shows that the learned representation will generalize to unseen reward functions, as long as the new reward function has a subset of the same causal ancestors. As an example, a representation learned for a robot to walk will likely generalize to learning to run, because the reward function depends on forward velocity and all the factors that contribute to forward velocity. However, that representation will not generalize to picking up objects, as those objects will be ignored by the learned representation, since they are not likely to be causal ancestors of a reward function designed for walking. Theorem 4 shows that the learned representation will be robust to spurious correlations, or changes in factors that are not in . This complements Theorem 5, that the representation is a minimal sufficient statistic of the optimal value function, improving generalization over nonminimal representations. We show empirical validation of these findings in Section 6.2.
6 Experiments
Our central hypothesis is that our nonreconstructive bisimulation based representation learning approach should be substantially more robust to taskirrelevant distractors. To that end, we evaluate our method in a clean setting without distractors, as well as a much more difficult setting with distractors. We compare against several baselines. The first is Stochastic Latent ActorCritic (SLAC, Lee et al. (2019)), a stateoftheart method for pixel observations on DeepMind Control that learns a dynamics model with a reconstruction loss. The second is DeepMDP (Gelada et al., 2019), a recent method that also learns a latent representation space using a latent dynamics model, reward model, and distributional Q learning, but for which they needed a reconstruction loss to scale up to Atari. Finally, we compare against two methods using the same architecture as ours but exchange our bisimulation loss with (1) a reconstruction loss (Reconstruction) and (2) contrastive predictive coding (Oord et al., 2018) (Contrastive) to ground the dynamics model and learn a latent representation.
6.1 Control with Background Distraction
In this section, we benchmark deep bisimulation for control and the previously described baselines on the DeepMind Control (DMC) suite (Tassa et al., 2018) in two settings and nine environments (Figure 4), finger_spin, cheetah_run, and walker_walk and additional environments in the appendix.
: Results comparing out DBC method to baselines on 10 seeds with 1 standard error shaded in the default setting. The gridlocation of each graph corresponds to the gridlocation of each observation.
Default Setting. Here, the pixel observations have simple backgrounds as shown in Figure 4 (top row) with training curves for our DBC and baselines. We see SLAC, a recent stateoftheart modelbased representation learning method that uses reconstruction, generally performs best.
Natural Video Setting. Next, we incorporate natural video from the Kinetics dataset (Kay et al., 2017) as background (Zhang et al., 2018), shown in Figure 4 (bottom row). The results confirm our hypothesis: although a number of prior methods can learn effectively in the absence of complex distractors, when distractors are introduced, our nonreconstructive bisimulation based method attains substantially better results.
To visualize the representation learned with our bisimulation metric loss function in Equation 4, we use a tSNE plot (Figure 5). We see that even when the background looks drastically different, our encoder learns to ignore irrelevant information and maps observations with similar robot configurations near each other. On the farleft of Figure 5, we took 10 nearby points in the tSNE plot and average the observations. We see that the agent is quite crisp, which means neighboring points encode the agent in similar positions, but the backgrounds are very different, and so are blurry when averaged.
6.2 Generalization Experiments
We test generalization of our learned representation in two ways. First, we show that the learned representation space can generalize to different types of distractors, by training with simple distractors and testing on the natural video setting. Second, we show that our learned representation can be useful reward functions other than those it was trained for.
Generalizing over backgrounds. In the first experiment, we train on the simple distractors setting and evaluate on natural video. Figure 6 shows an example of the simple distractors setting and performance during training time of two experiments, blue being the zeroshot transfer to the natural video setting, and orange the baseline which trains on natural video. This result empirically validates that the representations learned by our method are able to effectively learn to ignore the background, regardless of what the background contains or how dynamic it is.
Generalizing over reward functions. We evaluate (Figure 6) the generalization capabilities of the learned representation by training SAC with new reward functions walker_stand and walker_run using the fixed representation learned from walker_walk. This is empirical evidence that confirms Theorem 4: if the new reward functions are causally dependent on a subset of the same factors that determine the original reward function, then our representation should be sufficient.
6.3 Comparison with other Bisimulation Encoders
Even though the purpose of bisimulation metrics by Castro (2020) is learning distances , not representation spaces , it nevertheless implements with function approximation: by encoding observations with before computing distances with , trained as:
(7) 
where and are target networks. A natural question is: how does the encoder above perform in control tasks? We combine above with our policy in Algorithm 2 and use the same network (single hidden layer 729 wide). Figure 7 shows representations from Castro (2020) can learn control, but our method learns faster. Further, our method is simpler: by comparing Equation 7 to Equation 4, our method uses the distance between the encoding instead of introducing an addition network .
6.4 Autonomous Driving with Visual Redundancy
To evaluate DBC on tasks with more realistic observations, we construct a highway driving scenario with photorealistic visual observations using the CARLA simulator (Dosovitskiy et al., 2017) shown in Figure 8. The agent’s goal is to drive as far as possible down CARLA’s Town04’s figure8 the highway in 1000 timesteps without colliding into the 20 other moving vehicles or barriers. Our objective function rewards highway progression and penalises collisions:
where is the velocity vector of the ego vehicle, projected onto the highway’s unit vector , and multiplied by time discretization to measure highway progression in meters. Collisions result in impulses , measured in Newtonseconds. We found a steering penalty helped, and used weights and . While more specialized objectives exist like lanekeeping, this experiment’s purpose is to compare representations with observations more characteristic of real robotic tasks. We use five cameras on the vehicle’s roof, each with 60 degree views. By concatenating the images together, our vehicle has a 300 degree view, observed as pixels. Code and install instructions in appendix.
Realworld control systems such as robotics and autonomous vehicles must contend with a huge variety of taskirrelevant information, such as irrelevant objects (e.g. clouds) and irrelevant details (e.g. obstacle color).
Results in Figure 9 compare the same baselines as before, except for SLAC which is easily distracted (Figure 4). Instead we used SAC, which does not explicitly learn a representation, but performs surprisingly well from raw images. DeepMDP performs well too, perhaps given its similarly to bisimulation. But, Reconstruction and Contrastive methods again perform poorly with complex images. More intuitive metrics are in Section 6.4 and Figure 10 provides insight into the representation space as a tSNE with corresponding observations. Training took 12 hours using an NVIDIA Quadro GP100.
7 Discussion
This paper presents Deep Bisimulation for Control: a new representation learning method that considers downstream control. Observations are encoded into representations that are invariant to different taskirrelevant details in the observation. We show this is important when learning control from outdoor images, or otherwise images with background “distractions”. In contrast to other bisimulation methods, we show performance gains when distances in representation space match the bisimulation distance between observations.
Future work: Our latent dynamics model was only used for training our encoder in Equation 4, but could also be used for multistep planning in latent space. An ensemble of models could also help handle uncertainty better—and give robustness to—distributional shift between training observations and test observations (McAllister et al., 2019).
Acknowledgements: We thank Audrey Durand for an insightful weekend of discussion that led to one of the results in this paper. This work was supported by the Office of Naval Research, the DARPA Assured Autonomy Program, and ARL DCIST CRA W911NF1720181.
References

Scalable methods for computing state similarity in deterministic Markov decision processes.
In
Association for the Advancement of Artificial Intelligence (AAAI)
, Cited by: Appendix A, Appendix A, §2, §3, §4, §6.3, §6.3.  A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709 [cs, stat]. Note: arXiv: 2002.05709 External Links: Link Cited by: §2.
 Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Neural Information Processing Systems (NeurIPS), pp. 4754–4765. Cited by: §4.
 CARLA: an open urban driving simulator. arXiv preprint arXiv:1711.03938. Cited by: §1, §6.4.
 Metrics for finite Markov decision processes. In Uncertainty in Artificial Intelligence (UAI), pp. 162–169. External Links: ISBN 0974903906, Link Cited by: Appendix A, Appendix A, Appendix A, §5.
 Bisimulation metrics for continuous Markov decision processes. Society for Industrial and Applied Mathematics 40 (6), pp. 1662–1714. External Links: ISSN 00975397, Link, Document Cited by: §2, §3, §4, Definition 2.
 Bisimulation metrics are optimal value functions.. In Uncertainty in Artificial Intelligence (UAI), pp. 210–219. Cited by: §1, §2, §3.

DeepMDP: learning continuous latent space models for representation learning.
In
International Conference on Machine Learning (ICML)
, K. Chaudhuri and R. Salakhutdinov (Eds.), Vol. 97, pp. 2170–2179. Cited by: §2, §6.  Equivalence notions and model minimization in Markov decision processes. Artificial Intelligence 147, pp. 163–223. Cited by: §3, Definition 1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: Appendix C, §4.
 Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551. Cited by: §1, §2.
 Dataefficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §2.
 Causal graph based decomposition of factored MDPs. J. Mach. Learn. Res. 7, pp. 2259–2301. External Links: ISSN 15324435 Cited by: §5.
 The kinetics human action video dataset. Computing Research Repository (CoRR). External Links: Link, 1705.06950 Cited by: §6.1.

Autonomous reinforcement learning on raw visual input data in a real world application.
In
International Joint Conference on Neural Networks (IJCNN)
, pp. 1–8. External Links: Document Cited by: §2.  Deep autoencoder neural networks in reinforcement learning. In International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.
 Bisimulation through probabilistic testing (preliminary report). In Symposium on Principles of Programming Languages, pp. 344–352. External Links: ISBN 0897912942, Link, Document Cited by: §2, §3.
 CURL: contrastive unsupervised representations for reinforcement learning. Note: arXiv:2003.06417 Cited by: §2.
 Stochastic latent actorcritic: deep reinforcement learning with a latent variable model. arXiv preprint arXiv:1907.00953. Cited by: §1, §2, §6.
 Towards a unified theory of state abstraction for MDPs. In ISAIM, Cited by: §2, §3.
 Robustness to outofdistribution inputs via taskaware generative uncertainty. In International Conference on Robotics and Automation, Cited by: §7.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: ISSN 00280836, Link Cited by: §4.
 Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §6.
 Efficient offpolicy metareinforcement learning via probabilistic context variables. arXiv preprint arXiv:1903.08254. Cited by: §4.
 Causality for machine learning. External Links: 1911.10500 Cited by: §5.
 DeepMind control suite. Technical report Vol. abs/1504.04804, DeepMind. Note: https://arxiv.org/abs/1801.00690 External Links: Link Cited by: Appendix C, §6.1.
 Bounding performance loss in approximate MDP homomorphisms. In Neural Information Processing (NeurIPS), pp. 1649–1656. External Links: Link Cited by: §2.
 Towards quantitative verification of probabilistic transition systems. In Automata, Languages and Programming, F. Orejas, P. G. Spirakis, and J. van Leeuwen (Eds.), pp. 421–432 (en). External Links: ISBN 9783540482246, Document Cited by: §3.
 Representation learning with contrastive predictive coding. ArXiv abs/1807.03748. Cited by: §2.
 Topics in optimal transportation. American Mathematical Society. Cited by: §3.
 From pixels to torques: policy learning with deep dynamical models. arXiv preprint arXiv:1502.02251. Cited by: §2.
 Embed to control: a locally linear latent dynamics model for control from raw images. In Neural Information Processing Systems (NeurIPS), pp. 2728–2736. Cited by: §2.

Soft actorcritic (SAC) implementation in PyTorch
. GitHub. Note: https://github.com/denisyarats/pytorch_sac Cited by: Appendix C.  Improving sample efficiency in modelfree reinforcement learning from images. arXiv preprint arXiv:1910.01741. Cited by: Appendix C, §1, §4.
 Invariant causal prediction for block mdps. In International Conference on Machine Learning (ICML), Cited by: §5, §5, Theorem 3.
 Natural environment benchmarks for reinforcement learning. Computing Research Repository (CoRR) abs/1811.06032. External Links: Link, 1811.06032 Cited by: §6.1.
Appendix A Additional Theorems and Proofs
Theorem 1.
Let be the space of bounded pseudometrics on and a policy that is continuously improving. Define by
(8) 
Then has a least fixed point which is a bisimulation metric.
Proof.
Ideally, to prove this theorem we show that is monotonically increasing and continuous, and apply Fixed Point Theorem to show the existence of a fixed point that converges to. Unfortunately, we can show that under as monotonically converges to is not also monotonic, unlike the original bisimulation metric setting [Ferns et al., 2004] and the policy evaluation setting [Castro, 2020]. We start the iterates from bottom , denoted as . In Ferns et al. [2004] the can be thought of as learning a policy between every two pairs of states to maximize their distance, and therefore this distance can only stay the same or grow over iterations of . In Castro [2020], is fixed, and under a deterministic MDP it can also be shown that distance between states will only expand, not contract as increases. In the policy iteration setting, however, with starting from initialization and getting updated:
(9) 
there is no guarantee that the distance between two states under policy iterations and distance metric iterations for , which is required for monotonicity.
Instead, we show that using the policy improvement theorem which gives us
(10) 
will converge to a fixed point using the Fixed Point Theorem, and taking the result by Castro [2020] that has a fixed point for every , we can show that a fixed point bisimulation metric will be found with policy iteration. ∎
Theorem 5 ( is Lipschitz with respect to ).
Let be the optimal value function for a given discount factor . If , then is Lipschitz continuous with respect to with Lipschitz constant , where is the bisimilarity metric.
(11) 
See Theorem 5.1 in Ferns et al. [2004] for proof.
Theorem 2.
Given a new aggregated MDP constructed by aggregating states in an neighborhood, and an encoder that maps from states in the original MDP to these clusters, the optimal value functions for the two MDPs are bounded as
(12) 
Proof.
From Theorem 5.1 in Ferns et al. [2004] we have:
where is the average distance between a state and all other states in its equivalence class under the bisimulation metric . By specifying a neighborhood for each cluster of states we can replace :
∎
Theorem 4.
Given an encoder that maps observations to a latent bisimulation metric representation where , encodes information about all the causal ancestors of the reward .
Proof.
We assume a MDP with a state space that can be factorized into variables with 1step causal transition dynamics described by a causal graph (example in Figure 11). We break the proof up into two parts: 1) show that if a factor changes, the bisimulation distance between the original state and the new state is 0. and 2) show that if a factor changes, the bisimulation distance can be .
1) If , an intervention on that factor does not affect current or future reward.
If does not affect future reward, then states and will have the same future reward conditioned on all future actions. This gives us
2) If there is an intervention on then current and/or future reward can change. If current reward changes, then we already have , giving us . If only future reward changes, then those future states will have nonzero bisimilarity, and , giving us . ∎
Appendix B Additional Results
In Figure 12 we show performance on the default setting on 9 different environments from DMC. Figures 13 and 14 give performance on the simple distractors and natural video settings for all 9 environments.
Appendix C Implementation Details
We use the same encoder architecture as in Yarats et al. [2019], which is an almost identical encoder architecture as in Tassa et al. [2018], with two more convolutional layers to the convnet trunk. The encoder has kernels of size with
channels for all the convolutional layers and set stride to
everywhere, except of the first convolutional layer, which has stride, and interpolate with
ReLU activations. Finally, we add tanh nonlinearity to the dimensional output of the fullyconnected layer.For the reconstruction method, the decoder consists of a fullyconnected layer followed by four deconvolutional layers. We use ReLU activations after each layer, except the final deconvolutional layer that produces pixels representation. Each deconvolutional layer has kernels of size with channels and stride , except of the last layer, where stride is .
The dynamics and reward models are both MLPs with two hidden layers with 200 neurons each and
ReLU activations.Soft Actor Critic (SAC) [Haarnoja et al., 2018] is an offpolicy actorcritic method that uses the maximum entropy framework for soft policy iteration. At each iteration, SAC performs soft policy evaluation and improvement steps. The policy evaluation step fits a parametric soft Qfunction using transitions sampled from the replay buffer by minimizing the soft Bellman residual,
The target value function
is approximated via a MonteCarlo estimate of the following expectation,
where is the target soft Qfunction parameterized by a weight vector obtained from an exponentially moving average of the Qfunction weights to stabilize training. The policy improvement step then attempts to project a parametric policy by minimizing KL divergence between the policy and a Boltzmann distribution induced by the Qfunction, producing the following objective,
We modify the Soft ActorCritic PyTorch implementation by Yarats and Kostrikov [2020] and augment with a shared encoder between the actor and critic, the general model and taskspecific models
. The forward models are multilayer perceptions with ReLU nonlinearities and two hidden layers of 200 neurons each. The encoder is a linear layer that maps to a 50dim hidden representation. The hyperparameters used for the RL experiments are in
Table 2.Parameter name  Value 
Replay buffer capacity  
Batch size  
Discount  
Optimizer  Adam 
Critic learning rate  
Critic target update frequency  
Critic Qfunction softupdate rate  0.005 
Critic encoder softupdate rate  0.005 
Actor learning rate  
Actor update frequency  
Actor log stddev bounds  
Encoder learning rate  
Temperature learning rate  
Temperature Adam’s  
Init temperature 