1 Introduction
In reinforcement learning (RL), it is typical to model the environment as a Markov Decision Process (MDP). However, for many practical tasks, the state representations of these MDPs include a large amount of redundant information and taskirrelevant noise. For example, image observations from the Arcade Learning Environment
(Bellemare et al., 2013) consist of 33,600dimensional pixel arrays, yet it is intuitively clear that there exist lowerdimensional approximate representations for all games. Consider Pong; observing only the positions and velocities of the three objects in the frame is enough to play. Converting each frame into such a simplified state before learning a policy facilitates the learning process by reducing the redundant and irrelevant information presented to the agent. Representation learning techniques for reinforcement learning seek to improve the learning efficiency of existing RL algorithms by doing exactly this: learning a mapping from states to simplified states.Prior work on representation learning, such as state aggregation with bisimulation metrics (Givan et al., 2003; Ferns et al., 2004, 2011) or feature discovery algorithms (Comanici & Precup, 2011; Mahadevan & Maggioni, 2007; Bellemare et al., 2019)
, has resulted in algorithms with good theoretical properties; however, these algorithms do not scale to large scale problems or are not easily combined with deep learning. On the other hand, many recentlyproposed approaches to representation learning via deep learning have strong empirical results on complex domains, but lack formal guarantees
(Jaderberg et al., 2016; van den Oord et al., 2018; Fedus et al., 2019). In this work, we propose an approach to representation learning that unifies the desirable aspects of both of these categories: a deeplearningfriendly approach with theoretical guarantees.We describe the DeepMDP
, a latent space model of an MDP which has been trained to minimize two tractable losses: predicting the rewards and predicting the distribution of next latent states. DeepMDPs can be viewed as a formalization of recent works which use neural networks to learn latent space models of the environment
(Ha & Schmidhuber, 2018; Oh et al., 2017; Hafner et al., 2018; FrancoisLavet et al., 2018), because the value functions in the DeepMDP are guaranteed to be good approximations of value functions in the original task MDP. To provide this guarantee, careful consideration of the metric between distribution is necessary. A novel analysis of Maximum Mean Discrepancy (MMD) metrics (Gretton et al., 2012) defined via a function norm allows us to provide such guarantees; this includes the Total Variation, the Wasserstein and Energy metrics. These results represent a promising first step towards principled latentspace modelbased RL algorithms.From the perspective of representation learning, the state of a DeepMDP can be interpreted as a representation of the original MDP’s state. When the Wasserstein metric is used for the latent transition loss, analysis reveals a profound theoretical connection between DeepMDPs and bisimulation. These results provide a theoreticallygrounded approach to representation learning that is salable and compatible with modern deep networks.
In Section 2, we review key concepts and formally define the DeepMDP. We start by studying the modelquality and representationquality results of DeepMDPs (using the Wasserstein metric) in Sections 3 and 4. In Section 5, we investigate the connection between DeepMDPs using the Wasserstein and bisimulation. Section 6 generalizes only our modelbased guarantees to metrics other than the Wasserstein; this limitation emphasizes the special role of that the Wasserstein metric plays in learning good representations. Finally, in Section 8 we consider a synthetic environment with highdimensional observations and show that a DeepMDP learns to recover its underlying lowdimensional latent structure. We then demonstrate that learning a DeepMDP as an auxiliary task to modelfree RL in the Atari 2600 environment leads to significant improvement in performance when compared to a baseline modelfree method.
2 Background
2.1 Markov Decision Processes
Define a Markov Decision Process (MDP) in standard fashion: (Puterman, 1994). For simplicity of notation we will assume that and are discrete spaces unless otherwise stated. A policy defines a distribution over actions conditioned on the state, . Denote by the set of all stationary policies. The value function of a policy at a state is the expected sum of future discounted rewards by running the policy from that state. is defined as:
The action value function is similarly defined:
We denote by the actionindependent transition function induced by running a policy , . Similarly . We denote as the optimal policy in ; i.e., the policy which maximizes expected future reward. We denote the optimal state and action value functions with respect to as . We denote the stationary distribution of a policy in by ; i.e.,
We overload notation by also denoting the stateaction stationary distribution as . Although only nonterminating MDPs have stationary distributions, a state distribution for terminating MDPs with similar properties exists (Gelada & Bellemare, 2019).
2.2 Latent Space Models
For some MDP , let be an MDP where is a continuous space with metric and a shared action space between and . Furthermore, let be an embedding function which connects the state spaces of these two MDPs. We refer to as a latent space model of .
Since is, by definition, an MDP, value functions can be defined in the standard way. We use to denote the value functions of a policy , where is the set of policies defined on the state space .The transition and reward functions, and , of a policy are also defined in the standard manner. We use to denote the optimal policy in . The corresponding optimal state and action value functions are then . For ease of notation, when , we use to denote first using to map to the state space of and subsequently using
to generate the probability distribution over actions.
Although similar definitions of latent space models have been previously studied (FrancoisLavet et al., 2018; Zhang et al., 2018; Ha & Schmidhuber, 2018; Oh et al., 2017; Hafner et al., 2018; Kaiser et al., 2019; Silver et al., 2017), the parametrizations and training objectives used to learn such models have varied widely. For example Ha & Schmidhuber (2018); Hafner et al. (2018); Kaiser et al. (2019) use pixel prediction losses to learn the latent representation while (Oh et al., 2017) chooses instead to optimize the model to predict next latent states with the same value function as the sampled next states.
In this work, we study the minimization of loss functions defined with respect to rewards and transitions in the latent space:
(1)  
(2) 
where we use the shorthand notation to denote the probability distribution over of first sampling and then embedding , and where is a metric between probability distributions. To provide guarantees, in Equation 2 needs to be chosen carefully. For the majority of this work, we focus on the Wasserstein metric; in Section 6, we generalize some of the results to alternative metrics from the Maximum Mean Discrepancy family. FrancoisLavet et al. (2018) and Chung et al. (2019) have considered similar latent losses, but to the best of our knowledge ours is the first theoretical analysis of these models. See Figure 1 for an illustration of how the latent space losses are constructed.
We use the term DeepMDP to refer to a parameterized latent space model trained via the minimization of losses consisting of and (sometimes referred to as DeepMDP losses). In Section 3, we derive theoretical guarantees of DeepMDPs when minimizing and over the whole state space. However, our principal objective is to learn DeepMDPs parameterized by deep networks, which requires DeepMDP losses in the form of expectations; we show in Section 4 that similar theoretical guarantees can be obtained in this setting.
2.3 Wasserstein Metric
Initially studied in the optimal transport literature (Villani, 2008), the Wasserstein1 (which we simply refer to as the Wasserstein) metric between two distributions and , defined on a space with metric , corresponds to the minimum cost of transforming into , where the cost of moving a particle at point to point comes from the underlying metric .
Definition 1.
The Wasserstein1 metric between distributions and on a metric space is:
where denotes the set of all couplings of and .
When there is no ambiguity on what the underlying metric is, we will simply write . The MongeKantorovich duality (Mueller, 1997) shows that the Wasserstein has a dual form:
(3) 
where is the set of Lipschitz functions under the metric , .
2.4 Lipschitz Norm of Value Functions
The degree to which a value function of , approximates the value function of will depend on the Lipschitz norm of . In this section we define and provide conditions for value functions to be Lipschitz.^{1}^{1}1Another benefit of MDP smoothness is improved learning dynamics. Pirotta et al. (2015) suggest that the smaller the Lipschitz constant of an MDP, the faster it is to converge to a nearoptimal policy. Note that we study the Lipschitz properties of DeepMDPs (instead of a MDP because in this work, only the Lipschiz properties of DeepMDPs are relevant; the reader should note that these results follow for any continuous MDP with a metric state space.
We say a policy is Lipschitzvalued if its value function is Lipschitz, i.e. it has Lipschitz and functions.
Definition 2.
Let be a DeepMDP with a metric . A policy is Lipschitzvalued if for all :
and if for all :
Several works have studied Lipschitz norm constraints on the transition and reward functions (Hinderer, 2005; Asadi et al., 2018) to provide conditions for value functions to be Lipschitz. Closely following their formulation, we define Lipschitz DeepMDPs as follows:
Definition 3.
Let be a DeepMDP with a metric . We say is Lipschitz if, for all and :
From here onwards, we will we restrict our attention to the set of Lipschitz DeepMDPs for which the constant is sufficiently small, formalized in the following assumption:
Assumption 1.
The Lipschitz constant of the transition function is strictly smaller than .
From a practical standpoint, Assumption 1 is relatively strong, but simplifies our analysis by ensuring that close states cannot have future trajectories that are “divergent.” An MDP might still not exhibit divergent behaviour even when . In particular, when episodes terminate after a finite amount of time, Assumption 1 becomes unnecessary. We leave as future work how to improve on this assumption.
We describe a small set of Lipschitzvalued policies. For any policy , we refer to the Lipschitz norm of its transition function as for all . Similarly, we denote the Lipschitz norm of the reward function as .
Lemma 1.
Let be Lipschitz. Then,

The optimal policy is Lipschitzvalued.

All policies with are Lipschitzvalued.

All constant policies (i.e. ) are Lipschitzvalued.
Proof.
See Appendix A for all proofs. ∎
A more general framework for understanding Lipschitz value functions is still lacking. Little prior work studying classes of Lipschitzvalued policies exists in the literature and we believe that this is an important direction for future research.
3 Global DeepMDP Bounds
We now present our first main contributions: concrete DeepMDP losses, and several bounds which provide us with useful guarantees when these losses are minimized. We refer to these losses as the global DeepMDP losses, to emphasize their dependence on the whole state and action space:^{2}^{2}2The notation is a reference to the norm
(4)  
(5) 
3.1 Value Difference Bound
We start by bounding the difference of the value functions and for any policy . Note that is computed using and on while is computed using and on .
Lemma 2.
Let and be an MDP and DeepMDP respectively, with an embedding function and global loss functions and . For any Lipschitzvalued policy the value difference can be bounded by
The previous result holds for all policies , a subset of all possible policies . The reader might ask whether this is an interesting set of policies to consider; in Section 5, we answer with a fat “yes” by characterizing this set via a connection with bisimulation.
A bound similar to Lemma 2 can be found in Asadi et al. (2018), who study nonlatent transition models using the Wasserstein metric when there is access to an exact reward function. We also note that our results are arguably simpler, since we do not require the treatment of MDP transitions in terms of distributions over a set of deterministic components.
3.2 Representation Quality Bound
When a representation is used to predict the value of a policy in , a clear failure case is when two states with different values are collapsed to the same representation. The following result demonstrates that when the global DeepMDP losses and , this failure case can never occur for the embedding function .
Theorem 1.
Let and be an MDP and DeepMDP respectively, let be a metric in , be an embedding function and and be the global loss functions. For any Lipschitzvalued policy the representation guarantees that for all and ,
This result justifies learning a DeepMDP and using the embedding function as a representation to predict values. A similar connection between the quality of representations and model based objectives in the linear setting was made by Parr et al. (2008).
3.3 Suboptimality Bound
4 Local DeepMDP Bounds
In largescale tasks, data from many regions of the state space is often unavailable,^{3}^{3}3Challenging exploration environments like Montezuma’s Revenge are a prime example. making it infeasible to measure – let alone optimize – the global losses. Further, when the capacity of a model is limited, or when sample efficiency is a concern, it might not even be desirable to precisely learn a model of the whole state space. Interestingly, we can still provide similar guarantees based on the DeepMDP losses, as measured under an expectation over a stateaction distribution, denoted here as . We refer to these as the losses local to . Taking , to be the reward and transition losses under , respectively, we have the following local DeepMDP losses:
(6)  
(7) 
Losses of this form are compatible with the stochastic gradient decent methods used by neural networks. Thus, study of the local losses allows us to bridge the gap between theory and practice.
4.1 Value Difference Bound
We provide a value function bound for the local case, analogous to Lemma 2.
Lemma 3.
Let and be an MDP and DeepMDP respectively, with an embedding function . For any Lipschitzvalued policy , the expected value function difference can be bounded using the local loss functions and measured under , the stationary state action distribution of .
The provided bound guarantees that for any policy which visits stateaction pairs where and are small, the DeepMDP will provide accurate value functions for any states likely to be seen under the policy.^{4}^{4}4The value functions might be inaccurate in states that the policy rarely visits.
4.2 Representation Quality Bound
We can also extend the local value difference bound to provide a local bound on how well the representation can be used to predict the value function of a policy , analogous to Theorem 1.
Theorem 2.
Let and be an MDP and DeepMDP respectively, let be the metric in and be the embedding function. Let be any Lipschitzvalued policy with stationary distribution , and let and be the local loss functions. For any two states , the representation is such that,
Thus, the representation quality argument given in 3.2 holds for any two states and which are visited often by a policy .
5 Bisimulation
5.1 Bisimulation Relations
Bisimulation relations in the context of RL (Givan et al., 2003), are a formalization of behavioural equivalence between states.
Definition 4 (Givan et al. (2003)).
Given an MDP , an equivalence relation between states is a bisimulation relation if for all states that are equivalent under (i.e. ), the following conditions hold for all actions .
Where denotes the partition of under the relation , the set of all groups of equivalent states, and where .
Note that bisimulation relations are not unique. For example, the equality relation is always a bisimulation relation. Of particular interest is the maximal bisimulation relation , which defines the partition with the fewest elements (or equivalently, the relation that generates the largest possible groups of states). We will say that two states are bisimilar if they are equivalent under . Essentially, two states are bisimilar if (1) they have the same immediate reward for all actions and (2) both of their distributions over nextstates contain states which themselves are bisimilar. Figure 2 gives an example of states that are bisimilar in the Atari 2600 game Asteroids. An important property of bisimulation relations is that any two bisimilar states must have the same optimal value function . Bisimulation relations were first introduced for state aggregation (Givan et al., 2003), which is a form of representation learning, since merging behaviourally equivalent states does not result in the loss of information necessary for solving the MDP.
5.2 Bisimulation Metrics
A drawback of bisimulation relations is their allornothing nature. Two states that are nearly identical, but differ slightly in their reward or transition functions, are treated as though they were just as unrelated as two states with nothing in common. Relying on the optimal transport perspective of the Wasserstein, Ferns et al. (2004) introduced bisimulation metrics, which are pseudometrics that quantify the behavioural similarity of two discrete states.
A pseudometric satisfies all the properties of a metric except identity of indiscernibles, . A pseudometric can be used to define an equivalence relation by saying that two points are equivalent if they have zero distance; this is called the kernel of the pseudometric. Note that pseudometrics must obey the triangle inequality, which ensures the kernel satisfies the associative property. Without any changes to its definition, the Wasserstein metric can be extended to spaces , where is a pseudometric. Intuitively, the usage of a pseudometric in the Wasserstein can be interpreted as allowing different points in to be equivalent under the pseudometric (i.e. ). Thus, there is no need for transportation from one to the other.
An extension of bisimulation metrics based on Banach fixed points by Ferns et al. (2011) which allows the metric to be defined for MDPs with discrete and continuous state spaces.
Definition 5 (Ferns et al. (2011)).
Let be an MDP and denote by the space of pseudometrics on the space s.t. for . Define the operator to be:
Then:

The operator is a contraction with a unique fixed point denoted by .

The kernel of is the maximal bisimulation relation . (i.e. )
A useful property of bisimulation metrics is that the optimal value function difference between any two states can be upper bounded by the bisimulation metric between the two states.
Bisimulation metrics have been used for state aggregation (Ferns et al., 2004; Ruan et al., 2015), feature discovery (Comanici & Precup, 2011)
and transfer learning between MDPs
(Castro & Precup, 2010), but due to their high computational cost and poor compatibility with deep networks they have not been successfully applied to large scale settings.5.3 Connection with DeepMDPs
The representation learned by global DeepMDP losses with the Wasserstein metric can be connected to bisimulation metrics.
Theorem 3.
Let be an MDP and be a Lipschitz DeepMDP with metric . Let be the embedding function and and be the global DeepMDP losses. The bisimulation distance in , can be upperbounded by the distance in the embedding and the losses in the following way:
This result provides a similar bound to Theorem 1, except that instead of bounding the value difference the bisimulation distance is bounded. We speculate that similar results should be possible based on local DeepMDP losses, but they would require a generalization of bisimulation metrics to the local setting.
5.4 Characterizing
In order to better understand the set of policies (which appears in the bounds of Sections 3 and 4), we first consider the set of bisimilar policies, defined as , which contains all policies that act the same way on states that are bisimilar. Although this set excludes many policies in , we argue that it is adequately expressive, since any policy that acts differently on states that are bisimilar is fundamentally uninteresting.^{5}^{5}5For control, searching over these policies increases the size of the search space with no benefits on the optimality of the solution.
We show a connection between deep policies and bisimilar policies by proving that the set of Lipschitzdeep policies, , approximately contains the set of Lipschitzbisimilar policies, , defined as follows:
The following theorem proves that minimizing the global DeepMDP losses ensures that for any , there is a deep policy which is close to , where the constant .
Theorem 4.
Let be an MDP and be a (, )Lipschitz DeepMDP, with an embedding function and global loss functions and . Denote by and the sets of Lipschitzbisimilar and Lipschitzdeep policies. Then for any there exists a which is close to in the sense that, for all and ,
6 Beyond the Wasserstein
Interestingly, value difference bounds (Lemmas 2 and 3) can be derived for many different choices of probability metric (in the DeepMDP transition loss function, Equation 2). Here, we generalize the result to a family of Maximum Mean Discrepancy (MMD) metrics (Gretton et al., 2012) defined via a function norm that we denote as Norm Maximum Mean Discrepancy (NormMMD) metrics. Interestingly, the role of the Lipschitz norm in the value difference bounds is a consequence of using the Wasserstein; when we switch from the Wasserstein to another metric, it is replaced by a different term. We interpret these terms as different forms of smoothness of the value functions in .
By choosing a metric whose associated smoothness corresponds well to the environment, we can potentially improve the tightness of the bounds. For example, in environments with highly nonLipschitz dynamics, it may be impossible to learn an accurate DeepMDP whose deep value function has a small Lipschitz norm. Instead, the associated smoothness of another metric might be more appropriate. Another reason to consider other metrics is computational; the Wasserstein has high computational cost and suffers from biased stochastic gradient estimates
(Bińkowski et al., 2018; Bellemare et al., 2017b), so minimizing a simpler metric, such as the KL, may be more convenient.6.1 Norm Maximum Mean Discrepancy Metrics
MMD metrics (Gretton et al., 2012) are a family of probability metrics, each generated via a class of functions. They have also been studied by Müller (1997) under the name of Integral Probability Metrics.
Definition 6 (Gretton et al. (2012) Definition 2).
Let and be distributions on a measurable space and let be a class of functions . The Maximum Mean Discrepancy is
When it’s obvious that regardless of the function class . But the class of functions leads to MMD metrics with different behaviours and properties. Of interest to us are function classes generated via function seminorms^{6}^{6}6A seminorm is a norm except that .. Concretely, we define a NormMMD metric to be an MMD metric generated from a function class of the following form:
where is the associated function seminorm of . We will see that the family of NormMMDs are well suited for the task of latent space modeling. Their key property is the following: let be a NormMMD, then for any function s.t. ,
(8) 
We now discuss three particularly interesting examples of NormMMD metrics.
Total Variation: Defined as , the Total Variation is one of the most widelystudied metrics. Pinsker’s inequality (Borwein & Lewis, 2005, p.63) bounds the TV with the Kullback–Leibler (KL) divergence. The Total Variation is also the NormMMD generated from the set of functions with absolute value bounded by (Müller, 1997). Thus, the function norm .
Wasserstein metric: The interpretation of the Wasserstein as an MMD metrics is clear from its dual form (Equation 3), where the function class is set of Lipschitz functions,
The norm associated with the Wasserstein metric is therefore the Lipschitz norm, which in turn is the the norm of (the derivative of ). Thus, .
Energy distance: The energy distance was first developed to compare distributions in high dimensions via a two sample test (Székely & Rizzo, 2004; Gretton et al., 2012). It is defined as:
where denotes two independent samples of the distribution . Sejdinovic et al. (2013) showed the connection between the energy distance and MMD metrics. Similarly to the Wasserstein, the Energy distance’s associated seminorm is: .
6.2 Value Function Smoothness
In the context of value functions, we interpret the function seminorms associated with NormMMD metrics as different forms of smoothness.
Definition 7.
Let be a DeepMDP and let be a NormMMD with associated norm . We say that a policy is smoothvalued if:
and if for all :
For a value function , is the maximum absolute value of . Both and depend on the derivative of , but while is governed by point of maximal change, instead measures the amount of change over the whole state space . Thus, a value function with a small region of high derivative (and thus, large ) can still have small . In Figure 3 we provide an intuitive visualization of these three forms of smoothness in the game of Pong.
One advantage of the Total Variation is that it requires minimal assumptions on the DeepMDP. If the reward function is bounded, i.e. , then all policies are smoothvalued. We leave it to future work to study value function smoothness more generally for different NormMMD metrics and their associated norms.
6.3 Generalized Value Difference Bounds
The global and local value difference results (Lemmas 2 and 3), as well as the suboptimality result Lemma 1, can easily be derived when is any NormMMD metric. Due to the repetitiveness of these results, we don’t include them in the main paper; refer to Appendix A.6 for the full statements and proofs. We leave it to future work to characterize the of policies when general (i.e. nonWasserstein) NormMMD metrics are used.
7 Related Work in Representation Learning
State aggregation methods (Abel et al., 2017; Li et al., 2006; Singh et al., 1995; Givan et al., 2003; Jiang et al., 2015; Ruan et al., 2015) attempt to reduce the dimensionality of the state space by joining states together, taking the perspective that a good representation is one that reduces the total number of states without sacrificing any necessary information. Other representation learning approaches take the perspective that an optimal representation contains features that allow for the linear parametrization of the optimal value function (Comanici & Precup, 2011; Mahadevan & Maggioni, 2007). Recently, Bellemare et al. (2019); Dadashi et al. (2019) approached the representation learning problem from the perspective that a good representation is one that allows the prediction via a linear map of any value function in the value function space. In contrast, we have argued that a good representation (1) allows for the parametrization of a large set of interesting policies and (2) allows for the good approximation of the value function of these policies.
Concurrently, a suite of methods combining modelfree deep reinforcement learning with auxiliary tasks has shown large benefits on a wide variety of domains (Jaderberg et al., 2016; van den Oord et al., 2018; Mirowski et al., 2017). Distributional RL (Bellemare et al., 2017a), which was not initially introduced as a representation learning technique, has been shown by Lyle et al. (2019) to only play an auxiliary task role. Similarly, (Fedus et al., 2019) studied different discounting techniques by learning the spectrum of value functions for different discount values , and incidentally found that to be a highly useful auxiliary task. Although successful in practice, these auxiliary task methods currently lack strong theoretical justification. Our approach also proposes to minimize losses as an auxilliary task for representation learning, for a specifc choice of losses: the DeepMDP losses. We have formally justified this choice of losses, by providing theoretical guarantees on representation quality.
Given a state in our DonutWorld environment (first row), we plot a heatmap of the distance between that latent state and each other latent state, for both autoencoder representations (second row) and DeepMDP representations (third row). Moresimilar latent states are represented by lighter colors.
8 Empirical Evaluation
Our results depend on minimizing losses in expectation, which is the main requirement for deep networks to be applicable. Still, two main obstacles arise when turning these theoretical results into practical algorithms:
(1) Minimization of the Wasserstein Arjovsky et al. (2017) first proposed the use of the Wasserstein distance for Generative Adversarial Networks (GANs) via its dual formulation (see Equation 3). Their approach consists of training a network, constrained to be
Lipschitz, to attain the supremum of the dual. Once this supremum is attained, the Wasserstein can be minimized by differentiating through the network. Quantile regression has been proposed as an alternative solution to the minimization of the Wasserstein
(Dabney et al., 2018b), (Dabney et al., 2018a), and has shown to perform well for Distributional RL. The reader might note that issues with the stochastic minimization of the Wasserstein distance have been found to be biased by Bellemare et al. (2017b) and Bińkowski et al. (2018). In our experiments, we circumvent these issues by assuming that both and are deterministic. This reduces the Wasserstein distance to , where and denote the deterministic transition functions.(2) Control the Lipschitz constants and . We also turn to the field of Wasserstein GANs for approaches to constrain deep networks to be Lipschitz. Originally, Arjovsky et al. (2017) used a projection step to constraint the discriminator function to be Lipschitz. Gulrajani et al. (2017a) proposed using a gradient penalty, and sowed improved learning dynamics. Lipschitz continuity has also been proposed as a regularization method by Gouk et al. (2018), who provided an approach to compute an upper bound to the Lipschitz constant of neural nets. In our experiments, we follow Gulrajani et al. (2017a) and utilize the gradient penalty.
8.1 DonutWorld Experiments
In order to evaluate whether we can learn effective representations, we study the representations learned by DeepMDPs in a simple synthetic environment we call DonutWorld. DonutWorld consists of an agent rewarded for running clockwise around a fixed track. Staying in the center of the track results in faster movement. Observations are given in terms of 32x32 greyscale pixel arrays, but there is a simple 2D latent state space (the xy coordinates of the agent). We investigate whether the xy coordinates are correctly recovered when learning a twodimensional representation.
This task epitomizes the lowdimensional dynamics, highdimensional observations structure typical of Atari 2600 games, while being sufficiently simple to experiment with. We implement the DeepMDP training procedure using Tensorflow and compare it to a simple autoencoder baseline. See Appendix
B for a full environment specification, experimental setup, and additional experiments. Code for replicating all experiments is included in the supplementary material.In order to investigate whether the learned representations learned correspond well to reality, we plot a heatmap of closeness of representation for various states. Figure 4(a) shows that the DeepMDP representations effectively recover the underlying state of the agent, i.e. its 2D position, from the highdimensional pixel observations. In contrast, the autoencoder representations are less meaningful, even when the autoencoder solves the task nearperfectly.
In Figure 4(b), we modify the environment: rather than a single track, the environment now has four identical tracks. The agent starts in one uniformly at random and cannot move between tracks. The DeepMDP hidden state correctly merges all states with indistinguishable value functions, learning a deep state representation which is almost completely invariant to which track the agent is in.
The DeepMDP training loss can be difficult to optimize, as illustrated in Figure 5. This is due to the tendency of the transition and reward losses to compete with one another. If the deep state representation is uniformly zero, the transition loss will be zero as well; this is an easilydiscovered local optimum, and gradient descent tends to arrive at this point early on in training. Of course, an informationless representation results in a large reward loss. As training progresses, the algorithm incurs a small amount of transition loss in return for a large decrease in reward loss, resulting in a net decrease in loss.
In DonutWorld, which has very simple dynamics, gradient descent is able to discover a good representation after only a few thousand iterations. However, in complex environments such as Atari, it is often much more difficult to discover representations that allow us to escape the lowinformation local minima. Using architectures with good inductive biases can help to combat this, as shown in Section 8.3. This issue also motivates the use of auxiliary losses (such as value approximation losses or reconstruction losses), which may help guide the optimizer towards good solutions; see Appendix C.5.
8.2 Atari 2600 Experiments
In this section, we demonstrate practical benefits of approximately learning a DeepMDP in the Arcade Learning Environment (Bellemare et al., 2013). Our results on representationsimilarity indicate that learning a DeepMDP is a principled method for learning a highquality representation. Therefore, we minimize DeepMDP losses as an auxiliary task alongside modelfree reinforcement learning, learning a single representation which is shared between both tasks. Our implementations of the proposed algorithms are based on Dopamine (Castro et al., 2018).
We adopt the Distributional Qlearning approach to modelfree RL; specifically, we use as a baseline the C51 agent (Bellemare et al., 2017a)
, which estimates probability masses on a discrete support and minimizes the KL divergence between the estimated distribution and a target distribution. C51 encodes the input frames using a convolutional neural network
, outputting a dense vector representation
. The C51 Qfunction is a feedforward neural network which maps
to an estimate of the reward distribution’s logits.To incorporate learning a DeepMDP as an auxiliary learning objective, we define a deep reward function and deep transition function. These are each implemented as a feedforward neural network, which uses to estimate the immediate reward and the nextstate representation, respectively. The overall objective function is a simple linear combination of the standard C51 loss and the Wasserstein distancebased approximations to the local DeepMDP loss given by Equations 6 and 7. For experimental details, see Appendix C.
By optimizing to jointly minimize both C51 and DeepMDP losses, we hope to learn meaningful that form the basis for learning good value functions. In the following subsections, we aim to answer the following questions: (1) What deep transition model architecture is conducive to learning a DeepMDP on Atari? (2) How does the learning of a DeepMDP affect the overall performance of C51 on Atari 2600 games? (2) How do the DeepMDP objectives compare with similar representationlearning approaches?
8.3 Transition Model Architecture
We compare the performance achieved by using different architectures for the DeepMDP transition model (see Figure 7). We experiment with a single fullyconnected layer, two fullyconnected layers, and a single convolutional layer (see Appendix C for more details). We find that using a convolutional transition model leads to the best DeepMDP performance, and we use this transition model architecture for the rest of the experiments in this paper. Note how the performance of the agent is highly dependent on the architecture. We hypothesize that the inductive bias provided via the model has a large effect on the learned DeepMDPs. Further exploring model architectures which provide inductive biases is a promising avenue to develop better auxiliary tasks. Particularly, we believe that exploring attention (Vaswani et al., 2017; Bahdanau et al., 2014) and relational inductive biases (Watters et al., 2017; Battaglia et al., 2016) could be useful in visual domains like Atari2600.
8.4 DeepMDPs as an Auxiliary Task
8.5 Comparison to Alternative Objectives
We empirically compare the effect of the DeepMDP auxilliary objectives on the performance of a C51 agent to a variety of alternatives. In the experiments in this section, we replace the deep transition loss suggested by the DeepMDP bounds with each of the following:
(1) Observation Reconstruction: We train a state decoder to reconstruct observations from . This framework is similar to (Ha & Schmidhuber, 2018), who learn a latent space representation of the environment with an autoencoder, and use it to train an RL agent.
(2) Next Observation Prediction: We train a transition model to predict next observations from the current state representation . This framework is similar to modelbased RL algorithms which predict future observations (Xu et al., 2018).
(3) Next Logits Prediction: We train a transition model to predict nextstate representations such that the Qfunction correctly predicts the logits of , where is the action associated with the max Qvalue of . This can be understood as a distributional analogue of the Value Prediction Network, VPN, (Oh et al., 2017). Note that this auxiliary loss is used to update only the parameters of the representation encoder and the transition model, not the Qfunction.
Our experiments demonstrate that the deep transition loss suggested by the DeepMDP bounds (i.e. predicting the next state’s representation) outperforms all three ablations (see Figure 8). Accurately modeling Atari 2600 frames, whether through observation reconstruction or next observation prediction, forces the representation to encode irrelevant information with respect to the underlying task. VPNstyle losses have been shown to be helpful when using the learned predictive model for planning (Oh et al., 2017); however, we find that with a distributional RL agent, using this as an auxiliary task tends to hurt performance.
9 Discussion on ModelBased RL
We have focused on the implications of DeepMDPs for representation learning, but our results also provide a principled basis for modelbased RL – in latent space or otherwise. Although DeepMDPs are latent space models, by letting be the identity function, all the provided results immediately apply to the standard modelbased RL setting, where the model predicts states instead of latent states. In fact, our results serve as a theoretical justification for common practices already found in the modelbased deep RL literature. For example, Chua et al. (2018); Doerr et al. (2018); Hafner et al. (2018); Buesing et al. (2018); Feinberg et al. (2018); Buckman et al. (2018) train models to predict a reward and a distribution over next states, minimizing the negative logprobability of the true next state. The negative logprobability of the next state can be viewed as a onesample estimate of the KL between the model’s state distribution and the next state distribution. Due to Pinsker’s inequality (which bounds the TV with the KL), and the suitability of TV as a metric (Section 6), this procedure can be interpreted as training a DeepMDP. Thus, the learned model will obey our local value difference bounds (Lemma 8) and suboptimality bounds (Theorem 6), which provide theoretical guarantees for the model.
Further, the suitability of NormMMD metrics for learning models presents a promising new research avenue for modelbased RL: to break away from the KL and explore the vast family of Norm Maximum Mean Discrepancy metrics.
10 Conclusions
We introduce the concept of a DeepMDP: a parameterized latent space model trained via the minimization of tractable losses. Theoretical analysis provides guarantees on the quality of the value functions of the learned model when the latent transition loss is any member of the large family of Norm Maximum Mean Discrepancy metrics. When the Wasserstein metric is used, a novel connection to bisimulation metrics guarantees the set of parametrizable policies is highly expressive. Further, it’s guaranteed that two states with different values for any of those policies will never be collapsed under the representation. Together, these findings suggest that learning a DeepMDP with the Wasserstein metric is a theoretically sound approach to representation learning. Our results are corroborated by strong performance on largescale Atari 2600 experiments, demonstrating that minimizing the DeepMDP losses can be a beneficial auxiliary task in modelfree RL.
Using the transition and reward models of the DeepMDP for modelbased RL (e.g. planning, exploration) is a promising future research direction. Additionally, extending DeepMDPs to accommodate different action spaces or time scales from the original MDPs could be a promising path towards learning hierarchical models of the environment.
Acknowledgements
The authors would like to thank Philip Amortila and Robert Dadashi for invaluable feedback on the theoretical results; Pablo Samuel Castro, Doina Precup, Nicolas Le Roux, Sasha Vezhnevets, Simon Osindero, Arthur Gretton, Adrien Ali Taiga, Fabian Pedregosa and Shane Gu for useful discussions and feedback.
Changes From ICML 2019 Proceedings
This document represents an updated version of our work relative to the version published in ICML 2019. The major addition was the inclusion of the generalization to NormMMD metrics and associated math in Section 6. Lemma 1 also underwent minor changes to its statements and proofs. Additionally, some sections were partially rewritten, especially the discussion on bisimulation (Section 5), which was significantly expanded.
References
 Abel et al. (2017) Abel, D., Hershkowitz, D. E., and Littman, M. L. Near optimal behavior via approximate state abstraction. arXiv preprint arXiv:1701.04113, 2017.
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In ICML, 2017.
 Asadi et al. (2018) Asadi, K., Misra, D., and Littman, M. L. Lipschitz continuity in modelbased reinforcement learning. arXiv preprint arXiv:1804.07193, 2018.
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Battaglia et al. (2016) Battaglia, P. W., Pascanu, R., Lai, M., Rezende, D. J., and Kavukcuoglu, K. Interaction networks for learning about objects, relations and physics. In NIPS, 2016.

Bellemare et al. (2013)
Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M.
The Arcade Learning Environment: An evaluation platform for
general agents.
Journal of Artificial Intelligence Research
, 47:253–279, June 2013.  Bellemare et al. (2017a) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2017a.
 Bellemare et al. (2017b) Bellemare, M. G., Danihelka, I., Dabney, W., Mohamed, S., Lakshminarayanan, B., Hoyer, S., and Munos, R. The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743, 2017b.
 Bellemare et al. (2019) Bellemare, M. G., Dabney, W., Dadashi, R., Taiga, A. A., Castro, P. S., Roux, N. L., Schuurmans, D., Lattimore, T., and Lyle, C. A geometric perspective on optimal representations for reinforcement learning. CoRR, abs/1901.11530, 2019.
 Bińkowski et al. (2018) Bińkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. Demystifying MMD GANs. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1lUOzWCW.
 Borwein & Lewis (2005) Borwein, J. and Lewis, A. S. Convex Analysis and Nonlinear Optimization. Springer, 2005.
 Buckman et al. (2018) Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sampleefficient reinforcement learning with stochastic ensemble value expansion. In NeurIPS, 2018.
 Buesing et al. (2018) Buesing, L., Weber, T., Racaniere, S., Eslami, S., Rezende, D., Reichert, D. P., Viola, F., Besse, F., Gregor, K., Hassabis, D., et al. Learning and querying fast generative models for reinforcement learning. arXiv preprint arXiv:1802.03006, 2018.
 Castro & Precup (2010) Castro, P. and Precup, D. Using bisimulation for policy transfer in mdps. Proceeedings of the 9th International Conference on Autonomous Agents and Multiagent Systems (AAMAS2010), 2010.
 Castro et al. (2018) Castro, P. S., Moitra, S., Gelada, C., Kumar, S., and Bellemare, M. G. Dopamine: A research framework for deep reinforcement learning. arXiv, 2018.
 Chua et al. (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765, 2018.
 Chung et al. (2019) Chung, W., Nath, S., Joseph, A. G., and White, M. Twotimescale networks for nonlinear value function approximation. In International Conference on Learning Representations, 2019.

Comanici & Precup (2011)
Comanici, G. and Precup, D.
Basis function discovery using spectral clustering and bisimulation metrics.
In AAMAS, 2011.  Dabney et al. (2018a) Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforcement learning. In ICML, 2018a.
 Dabney et al. (2018b) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. Distributional reinforcement learning with quantile regression. In AAAI, 2018b.
 Dadashi et al. (2019) Dadashi, R., Taiga, A. A., Roux, N. L., Schuurmans, D., and Bellemare, M. G. The value function polytope in reinforcement learning. CoRR, abs/1901.11524, 2019.
 Doerr et al. (2018) Doerr, A., Daniel, C., Schiegg, M., NguyenTuong, D., Schaal, S., Toussaint, M., and Trimpe, S. Probabilistic recurrent statespace models. arXiv preprint arXiv:1801.10395, 2018.
 Fedus et al. (2019) Fedus, W., Gelada, C., Bengio, Y., Bellemare, M. G., and Larochelle, H. Hyperbolic discounting and learning over multiple horizons. ArXiv, abs/1902.06865, 2019.
 Feinberg et al. (2018) Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. Modelbased value estimation for efficient modelfree reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
 Ferns et al. (2004) Ferns, N., Panangaden, P., and Precup, D. Metrics for finite markov decision processes. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI’04:162–169, 2004.
 Ferns et al. (2011) Ferns, N., Panangaden, P., and Precup, D. Bisimulation metrics for continuous markov decision processes. SIAM Journal on Computing, 40(6):1662–1714, 2011.
 FrancoisLavet et al. (2018) FrancoisLavet, V., Bengio, Y., Precup, D., and Pineau, J. Combined reinforcement learning via abstract representations. arXiv preprint arXiv:1809.04506, 2018.
 Gelada & Bellemare (2019) Gelada, C. and Bellemare, M. G. Offpolicy deep reinforcement learning by bootstrapping the covariate shift. CoRR, abs/1901.09455, 2019.
 Givan et al. (2003) Givan, R., Dean, T., and Greig, M. Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147(12):163–223, 2003.
 Gouk et al. (2018) Gouk, H., Frank, E., Pfahringer, B., and Cree, M. J. Regularisation of neural networks by enforcing lipschitz continuity. CoRR, abs/1804.04368, 2018.
 Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. J. A kernel twosample test. Journal of Machine Learning Research, 13:723–773, 2012.
 Gulrajani et al. (2017a) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In NIPS, 2017a.
 Gulrajani et al. (2017b) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777, 2017b.
 Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, pp. 2455–2467, 2018.
 Hafner et al. (2018) Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551, 2018.
 Hinderer (2005) Hinderer, K. Lipschitz continuity of value functions in markovian decision processes. Math. Meth. of OR, 62:3–22, 2005.
 Jaderberg et al. (2016) Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Jiang et al. (2015) Jiang, N., Kulesza, A., and Singh, S. Abstraction selection in modelbased reinforcement learning. In International Conference on Machine Learning, pp. 179–188, 2015.
 Kaiser et al. (2019) Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowski, P., Levine, S., Sepassi, R., Tucker, G., and Michalewski, H. Modelbased reinforcement learning for atari. CoRR, abs/1903.00374, 2019.
 Li et al. (2006) Li, L., Walsh, T. J., and Littman, M. L. Towards a unified theory of state abstraction for mdps. In ISAIM, 2006.
 Lyle et al. (2019) Lyle, C., Castro, P. S., and Bellemare, M. G. A comparative analysis of expected and distributional reinforcement learning. CoRR, abs/1901.11084, 2019.
 Mahadevan & Maggioni (2007) Mahadevan, S. and Maggioni, M. Protovalue functions: A laplacian framework for learning representation and control in markov decision processes. Journal of Machine Learning Research, 8:2169–2231, 2007.
 Mirowski et al. (2017) Mirowski, P. W., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., Kumaran, D., and Hadsell, R. Learning to navigate in complex environments. CoRR, abs/1611.03673, 2017.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. Humanlevel control through deep reinforcement learning. Nature, 518:529–533, 2015.
 Mueller (1997) Mueller, A. Integral probability metrics and their generating classes of functions. 1997.
 Müller (1997) Müller, A. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429–443, 1997.
 Oh et al. (2017) Oh, J., Singh, S., and Lee, H. Value prediction network. In Advances in Neural Information Processing Systems, pp. 6118–6128, 2017.

Parr et al. (2008)
Parr, R., Li, L., Taylor, G., PainterWakefield, C., and Littman, M. L.
An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning.
In ICML, 2008.  Pirotta et al. (2015) Pirotta, M., Restelli, M., and Bascetta, L. Policy gradient in lipschitz markov decision processes. Machine Learning, 100(23):255–283, 2015.
 Puterman (1994) Puterman, M. L. Markov decision processes: Discrete stochastic dynamic programming. 1994.
 Ruan et al. (2015) Ruan, S. S., Comanici, G., Panangaden, P., and Precup, D. Representation discovery for mdps using bisimulation metrics. In AAAI, 2015.
 Sejdinovic et al. (2013) Sejdinovic, D., Sriperumbudur, B. K., Gretton, A., and Fukumizu, K. Equivalence of distancebased and rkhsbased statistics in hypothesis testing. CoRR, abs/1207.6076, 2013.
 Silver et al. (2017) Silver, D., van Hasselt, H. P., Hessel, M., Schaul, T., Guez, A., Harley, T., DulacArnold, G., Reichert, D. P., Rabinowitz, N. C., Barreto, A., and Degris, T. The predictron: Endtoend learning and planning. In ICML, 2017.
 Singh et al. (1995) Singh, S. P., Jaakkola, T., and Jordan, M. I. Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pp. 361–368, 1995.
 Székely & Rizzo (2004) Székely, G. J. and Rizzo, M. L. Testing for equal distributions in high dimension. 2004.
 van den Oord et al. (2018) van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In NIPS, 2017.
 Villani (2008) Villani, C. Optimal Transport: Old and New. Springer Science & Business Media, 2008, 2008.
 Watters et al. (2017) Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P. W., and Zoran, D. Visual interaction networks. CoRR, abs/1706.01433, 2017.
 Xu et al. (2018) Xu, H., Li, Y., Tian, Y., Darrell, T., and Ma, T. Algorithmic framework for modelbased reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
 Zhang et al. (2018) Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J., and Levine, S. Solar: Deep structured latent representations for modelbased reinforcement learning. arXiv preprint arXiv:1808.09105, 2018.
Appendix A Proofs
a.1 Lipschitz MDP
See 1
Proof.
Start by proving 1. By induction we will show that a sequence of Q values converging to are all Lipschitz, and that as , their Lipschitz norm goes to . Let be the base case. Define . It is a well known result that the sequence converges to . Now let be the Lipschitz norm of . Clearly . Then,
Comments
There are no comments yet.