DeepMDP: Learning Continuous Latent Space Models for Representation Learning

by   Carles Gelada, et al.

Many reinforcement learning (RL) tasks provide the agent with high-dimensional observations that can be simplified into low-dimensional continuous states. To formalize this process, we introduce the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization of two tractable losses: prediction of rewards and prediction of the distribution over next latent states. We show that the optimization of these objectives guarantees (1) the quality of the latent space as a representation of the state space and (2) the quality of the DeepMDP as a model of the environment. We connect these results to prior work in the bisimulation literature, and explore the use of a variety of metrics. Our theoretical findings are substantiated by the experimental result that a trained DeepMDP recovers the latent structure underlying high-dimensional observations on a synthetic environment. Finally, we show that learning a DeepMDP as an auxiliary task in the Atari 2600 domain leads to large performance improvements over model-free RL.



There are no comments yet.


page 5

page 8

page 9


Dynamics-aware Embeddings

In this paper we consider self-supervised representation learning to imp...

Analytic Manifold Learning: Unifying and Evaluating Representations for Continuous Control

We address the problem of learning reusable state representations from s...

Towards Robust Bisimulation Metric Learning

Learned representations in deep reinforcement learning (DRL) have to ext...

Bootstrap Latent-Predictive Representations for Multitask Reinforcement Learning

Learning a good representation is an essential component for deep reinfo...

Provable RL with Exogenous Distractors via Multistep Inverse Dynamics

Many real-world applications of reinforcement learning (RL) require the ...

Beta DVBF: Learning State-Space Models for Control from High Dimensional Observations

Learning a model of dynamics from high-dimensional images can be a core ...

Learning Invariant Representations for Reinforcement Learning without Reconstruction

We study how representation learning can accelerate reinforcement learni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In reinforcement learning (RL), it is typical to model the environment as a Markov Decision Process (MDP). However, for many practical tasks, the state representations of these MDPs include a large amount of redundant information and task-irrelevant noise. For example, image observations from the Arcade Learning Environment

(Bellemare et al., 2013) consist of 33,600-dimensional pixel arrays, yet it is intuitively clear that there exist lower-dimensional approximate representations for all games. Consider Pong; observing only the positions and velocities of the three objects in the frame is enough to play. Converting each frame into such a simplified state before learning a policy facilitates the learning process by reducing the redundant and irrelevant information presented to the agent. Representation learning techniques for reinforcement learning seek to improve the learning efficiency of existing RL algorithms by doing exactly this: learning a mapping from states to simplified states.

Prior work on representation learning, such as state aggregation with bisimulation metrics (Givan et al., 2003; Ferns et al., 2004, 2011) or feature discovery algorithms (Comanici & Precup, 2011; Mahadevan & Maggioni, 2007; Bellemare et al., 2019)

, has resulted in algorithms with good theoretical properties; however, these algorithms do not scale to large scale problems or are not easily combined with deep learning. On the other hand, many recently-proposed approaches to representation learning via deep learning have strong empirical results on complex domains, but lack formal guarantees

(Jaderberg et al., 2016; van den Oord et al., 2018; Fedus et al., 2019). In this work, we propose an approach to representation learning that unifies the desirable aspects of both of these categories: a deep-learning-friendly approach with theoretical guarantees.

We describe the DeepMDP

, a latent space model of an MDP which has been trained to minimize two tractable losses: predicting the rewards and predicting the distribution of next latent states. DeepMDPs can be viewed as a formalization of recent works which use neural networks to learn latent space models of the environment 

(Ha & Schmidhuber, 2018; Oh et al., 2017; Hafner et al., 2018; Francois-Lavet et al., 2018), because the value functions in the DeepMDP are guaranteed to be good approximations of value functions in the original task MDP. To provide this guarantee, careful consideration of the metric between distribution is necessary. A novel analysis of Maximum Mean Discrepancy (MMD) metrics (Gretton et al., 2012) defined via a function norm allows us to provide such guarantees; this includes the Total Variation, the Wasserstein and Energy metrics. These results represent a promising first step towards principled latent-space model-based RL algorithms.

Figure 1: Diagram of the latent space losses. Circles denote a distribution.

From the perspective of representation learning, the state of a DeepMDP can be interpreted as a representation of the original MDP’s state. When the Wasserstein metric is used for the latent transition loss, analysis reveals a profound theoretical connection between DeepMDPs and bisimulation. These results provide a theoretically-grounded approach to representation learning that is salable and compatible with modern deep networks.

In Section 2, we review key concepts and formally define the DeepMDP. We start by studying the model-quality and representation-quality results of DeepMDPs (using the Wasserstein metric) in Sections 3 and 4. In Section 5, we investigate the connection between DeepMDPs using the Wasserstein and bisimulation. Section 6 generalizes only our model-based guarantees to metrics other than the Wasserstein; this limitation emphasizes the special role of that the Wasserstein metric plays in learning good representations. Finally, in Section 8 we consider a synthetic environment with high-dimensional observations and show that a DeepMDP learns to recover its underlying low-dimensional latent structure. We then demonstrate that learning a DeepMDP as an auxiliary task to model-free RL in the Atari 2600 environment leads to significant improvement in performance when compared to a baseline model-free method.

2 Background

2.1 Markov Decision Processes

Define a Markov Decision Process (MDP) in standard fashion:  (Puterman, 1994). For simplicity of notation we will assume that and are discrete spaces unless otherwise stated. A policy defines a distribution over actions conditioned on the state, . Denote by the set of all stationary policies. The value function of a policy at a state is the expected sum of future discounted rewards by running the policy from that state. is defined as:

The action value function is similarly defined:

We denote by the action-independent transition function induced by running a policy , . Similarly . We denote as the optimal policy in ; i.e., the policy which maximizes expected future reward. We denote the optimal state and action value functions with respect to as . We denote the stationary distribution of a policy in by ; i.e.,

We overload notation by also denoting the state-action stationary distribution as . Although only non-terminating MDPs have stationary distributions, a state distribution for terminating MDPs with similar properties exists (Gelada & Bellemare, 2019).

2.2 Latent Space Models

For some MDP , let be an MDP where is a continuous space with metric and a shared action space between and . Furthermore, let be an embedding function which connects the state spaces of these two MDPs. We refer to as a latent space model of .

Since is, by definition, an MDP, value functions can be defined in the standard way. We use to denote the value functions of a policy , where is the set of policies defined on the state space .The transition and reward functions, and , of a policy are also defined in the standard manner. We use to denote the optimal policy in . The corresponding optimal state and action value functions are then . For ease of notation, when , we use to denote first using to map to the state space of and subsequently using

to generate the probability distribution over actions.

Although similar definitions of latent space models have been previously studied (Francois-Lavet et al., 2018; Zhang et al., 2018; Ha & Schmidhuber, 2018; Oh et al., 2017; Hafner et al., 2018; Kaiser et al., 2019; Silver et al., 2017), the parametrizations and training objectives used to learn such models have varied widely. For example Ha & Schmidhuber (2018); Hafner et al. (2018); Kaiser et al. (2019) use pixel prediction losses to learn the latent representation while (Oh et al., 2017) chooses instead to optimize the model to predict next latent states with the same value function as the sampled next states.

In this work, we study the minimization of loss functions defined with respect to rewards and transitions in the latent space:


where we use the shorthand notation to denote the probability distribution over of first sampling and then embedding , and where is a metric between probability distributions. To provide guarantees, in Equation 2 needs to be chosen carefully. For the majority of this work, we focus on the Wasserstein metric; in Section 6, we generalize some of the results to alternative metrics from the Maximum Mean Discrepancy family. Francois-Lavet et al. (2018) and Chung et al. (2019) have considered similar latent losses, but to the best of our knowledge ours is the first theoretical analysis of these models. See Figure 1 for an illustration of how the latent space losses are constructed.

We use the term DeepMDP to refer to a parameterized latent space model trained via the minimization of losses consisting of and (sometimes referred to as DeepMDP losses). In Section 3, we derive theoretical guarantees of DeepMDPs when minimizing and over the whole state space. However, our principal objective is to learn DeepMDPs parameterized by deep networks, which requires DeepMDP losses in the form of expectations; we show in Section 4 that similar theoretical guarantees can be obtained in this setting.

2.3 Wasserstein Metric

Initially studied in the optimal transport literature (Villani, 2008), the Wasserstein-1 (which we simply refer to as the Wasserstein) metric between two distributions and , defined on a space with metric , corresponds to the minimum cost of transforming into , where the cost of moving a particle at point to point comes from the underlying metric .

Definition 1.

The Wasserstein-1 metric between distributions and on a metric space is:

where denotes the set of all couplings of and .

When there is no ambiguity on what the underlying metric is, we will simply write . The Monge-Kantorovich duality (Mueller, 1997) shows that the Wasserstein has a dual form:


where is the set of -Lipschitz functions under the metric , .

2.4 Lipschitz Norm of Value Functions

The degree to which a value function of , approximates the value function of will depend on the Lipschitz norm of . In this section we define and provide conditions for value functions to be Lipschitz.111Another benefit of MDP smoothness is improved learning dynamics. Pirotta et al. (2015) suggest that the smaller the Lipschitz constant of an MDP, the faster it is to converge to a near-optimal policy. Note that we study the Lipschitz properties of DeepMDPs (instead of a MDP because in this work, only the Lipschiz properties of DeepMDPs are relevant; the reader should note that these results follow for any continuous MDP with a metric state space.

We say a policy is Lipschitz-valued if its value function is Lipschitz, i.e. it has Lipschitz and functions.

Definition 2.

Let be a DeepMDP with a metric . A policy is -Lipschitz-valued if for all :

and if for all :

Several works have studied Lipschitz norm constraints on the transition and reward functions (Hinderer, 2005; Asadi et al., 2018) to provide conditions for value functions to be Lipschitz. Closely following their formulation, we define Lipschitz DeepMDPs as follows:

Definition 3.

Let be a DeepMDP with a metric . We say is -Lipschitz if, for all and :

From here onwards, we will we restrict our attention to the set of Lipschitz DeepMDPs for which the constant is sufficiently small, formalized in the following assumption:

Assumption 1.

The Lipschitz constant of the transition function is strictly smaller than .

From a practical standpoint, Assumption 1 is relatively strong, but simplifies our analysis by ensuring that close states cannot have future trajectories that are “divergent.” An MDP might still not exhibit divergent behaviour even when . In particular, when episodes terminate after a finite amount of time, Assumption 1 becomes unnecessary. We leave as future work how to improve on this assumption.

We describe a small set of Lipschitz-valued policies. For any policy , we refer to the Lipschitz norm of its transition function as for all . Similarly, we denote the Lipschitz norm of the reward function as .

Lemma 1.

Let be -Lipschitz. Then,

  1. The optimal policy is -Lipschitz-valued.

  2. All policies with are -Lipschitz-valued.

  3. All constant policies (i.e. ) are -Lipschitz-valued.


See Appendix A for all proofs. ∎

A more general framework for understanding Lipschitz value functions is still lacking. Little prior work studying classes of Lipschitz-valued policies exists in the literature and we believe that this is an important direction for future research.

3 Global DeepMDP Bounds

We now present our first main contributions: concrete DeepMDP losses, and several bounds which provide us with useful guarantees when these losses are minimized. We refer to these losses as the global DeepMDP losses, to emphasize their dependence on the whole state and action space:222The notation is a reference to the norm


3.1 Value Difference Bound

We start by bounding the difference of the value functions and for any policy . Note that is computed using and on while is computed using and on .

Lemma 2.

Let and be an MDP and DeepMDP respectively, with an embedding function and global loss functions and . For any -Lipschitz-valued policy the value difference can be bounded by

The previous result holds for all policies , a subset of all possible policies . The reader might ask whether this is an interesting set of policies to consider; in Section 5, we answer with a fat “yes” by characterizing this set via a connection with bisimulation.

A bound similar to Lemma 2 can be found in Asadi et al. (2018), who study non-latent transition models using the Wasserstein metric when there is access to an exact reward function. We also note that our results are arguably simpler, since we do not require the treatment of MDP transitions in terms of distributions over a set of deterministic components.

3.2 Representation Quality Bound

When a representation is used to predict the value of a policy in , a clear failure case is when two states with different values are collapsed to the same representation. The following result demonstrates that when the global DeepMDP losses and , this failure case can never occur for the embedding function .

Theorem 1.

Let and be an MDP and DeepMDP respectively, let be a metric in , be an embedding function and and be the global loss functions. For any -Lipschitz-valued policy the representation guarantees that for all and ,

This result justifies learning a DeepMDP and using the embedding function as a representation to predict values. A similar connection between the quality of representations and model based objectives in the linear setting was made by Parr et al. (2008).

3.3 Suboptimality Bound

For completeness, we also bound the performance loss of running the optimal policy of in , compared to the optimal policy . See Theorem 5 in Appendix A.

4 Local DeepMDP Bounds

In large-scale tasks, data from many regions of the state space is often unavailable,333Challenging exploration environments like Montezuma’s Revenge are a prime example. making it infeasible to measure – let alone optimize – the global losses. Further, when the capacity of a model is limited, or when sample efficiency is a concern, it might not even be desirable to precisely learn a model of the whole state space. Interestingly, we can still provide similar guarantees based on the DeepMDP losses, as measured under an expectation over a state-action distribution, denoted here as . We refer to these as the losses local to . Taking , to be the reward and transition losses under , respectively, we have the following local DeepMDP losses:


Losses of this form are compatible with the stochastic gradient decent methods used by neural networks. Thus, study of the local losses allows us to bridge the gap between theory and practice.

4.1 Value Difference Bound

We provide a value function bound for the local case, analogous to Lemma 2.

Lemma 3.

Let and be an MDP and DeepMDP respectively, with an embedding function . For any -Lipschitz-valued policy , the expected value function difference can be bounded using the local loss functions and measured under , the stationary state action distribution of .

The provided bound guarantees that for any policy which visits state-action pairs where and are small, the DeepMDP will provide accurate value functions for any states likely to be seen under the policy.444The value functions might be inaccurate in states that the policy rarely visits.

4.2 Representation Quality Bound

We can also extend the local value difference bound to provide a local bound on how well the representation can be used to predict the value function of a policy , analogous to Theorem 1.

Theorem 2.

Let and be an MDP and DeepMDP respectively, let be the metric in and be the embedding function. Let be any -Lipschitz-valued policy with stationary distribution , and let and be the local loss functions. For any two states , the representation is such that,

Thus, the representation quality argument given in 3.2 holds for any two states and which are visited often by a policy .

5 Bisimulation

Figure 2: A pair of bisimilar states. In the game of Asteroids, the colors of the asteroids can vary randomly, but this in no way impacts gameplay.

5.1 Bisimulation Relations

Bisimulation relations in the context of RL (Givan et al., 2003), are a formalization of behavioural equivalence between states.

Definition 4 (Givan et al. (2003)).

Given an MDP , an equivalence relation between states is a bisimulation relation if for all states that are equivalent under (i.e. ), the following conditions hold for all actions .

Where denotes the partition of under the relation , the set of all groups of equivalent states, and where .

Note that bisimulation relations are not unique. For example, the equality relation is always a bisimulation relation. Of particular interest is the maximal bisimulation relation , which defines the partition with the fewest elements (or equivalently, the relation that generates the largest possible groups of states). We will say that two states are bisimilar if they are equivalent under . Essentially, two states are bisimilar if (1) they have the same immediate reward for all actions and (2) both of their distributions over next-states contain states which themselves are bisimilar. Figure 2 gives an example of states that are bisimilar in the Atari 2600 game Asteroids. An important property of bisimulation relations is that any two bisimilar states must have the same optimal value function . Bisimulation relations were first introduced for state aggregation (Givan et al., 2003), which is a form of representation learning, since merging behaviourally equivalent states does not result in the loss of information necessary for solving the MDP.

5.2 Bisimulation Metrics

A drawback of bisimulation relations is their all-or-nothing nature. Two states that are nearly identical, but differ slightly in their reward or transition functions, are treated as though they were just as unrelated as two states with nothing in common. Relying on the optimal transport perspective of the Wasserstein, Ferns et al. (2004) introduced bisimulation metrics, which are pseudometrics that quantify the behavioural similarity of two discrete states.

A pseudometric satisfies all the properties of a metric except identity of indiscernibles, . A pseudometric can be used to define an equivalence relation by saying that two points are equivalent if they have zero distance; this is called the kernel of the pseudometric. Note that pseudometrics must obey the triangle inequality, which ensures the kernel satisfies the associative property. Without any changes to its definition, the Wasserstein metric can be extended to spaces , where is a pseudometric. Intuitively, the usage of a pseudometric in the Wasserstein can be interpreted as allowing different points in to be equivalent under the pseudometric (i.e. ). Thus, there is no need for transportation from one to the other.

An extension of bisimulation metrics based on Banach fixed points by Ferns et al. (2011) which allows the metric to be defined for MDPs with discrete and continuous state spaces.

Definition 5 (Ferns et al. (2011)).

Let be an MDP and denote by the space of pseudometrics on the space s.t. for . Define the operator to be:


  1. The operator is a contraction with a unique fixed point denoted by .

  2. The kernel of is the maximal bisimulation relation . (i.e. )

A useful property of bisimulation metrics is that the optimal value function difference between any two states can be upper bounded by the bisimulation metric between the two states.

Bisimulation metrics have been used for state aggregation (Ferns et al., 2004; Ruan et al., 2015), feature discovery (Comanici & Precup, 2011)

and transfer learning between MDPs

(Castro & Precup, 2010), but due to their high computational cost and poor compatibility with deep networks they have not been successfully applied to large scale settings.

5.3 Connection with DeepMDPs

The representation learned by global DeepMDP losses with the Wasserstein metric can be connected to bisimulation metrics.

Theorem 3.

Let be an MDP and be a --Lipschitz DeepMDP with metric . Let be the embedding function and and be the global DeepMDP losses. The bisimulation distance in , can be upperbounded by the distance in the embedding and the losses in the following way:

This result provides a similar bound to Theorem 1, except that instead of bounding the value difference the bisimulation distance is bounded. We speculate that similar results should be possible based on local DeepMDP losses, but they would require a generalization of bisimulation metrics to the local setting.

5.4 Characterizing

In order to better understand the set of policies (which appears in the bounds of Sections 3 and 4), we first consider the set of bisimilar policies, defined as , which contains all policies that act the same way on states that are bisimilar. Although this set excludes many policies in , we argue that it is adequately expressive, since any policy that acts differently on states that are bisimilar is fundamentally uninteresting.555For control, searching over these policies increases the size of the search space with no benefits on the optimality of the solution.

We show a connection between deep policies and bisimilar policies by proving that the set of Lipschitz-deep policies, , approximately contains the set of Lipschitz-bisimilar policies, , defined as follows:

The following theorem proves that minimizing the global DeepMDP losses ensures that for any , there is a deep policy which is close to , where the constant .

Theorem 4.

Let be an MDP and be a (, )-Lipschitz DeepMDP, with an embedding function and global loss functions and . Denote by and the sets of Lipschitz-bisimilar and Lipschitz-deep policies. Then for any there exists a which is close to in the sense that, for all and ,

Figure 3: Visualization of the way in which different smoothness properties on the value function are derived. The left compares two near-identical frames of Pong, (a) and (b), whose only difference is the position of the player’s paddle. The plots on the right show the optimal value of the state (top) and the derivative of the optimal value (bottom) as a function of the position of the player’s paddle, assuming all other features of the state are kept constant. The associated smoothness of each Norm-MMD metric is shown visually. (Note that this is for illustrative purposes only, and was not actually computed from the real game. The curve in the value function represents noisy dynamics, such as those induced by “sticky actions” (Mnih et al., 2015); if the environment were deterministic, the optimal value would be a step function.)

6 Beyond the Wasserstein

Interestingly, value difference bounds (Lemmas 2 and 3) can be derived for many different choices of probability metric (in the DeepMDP transition loss function, Equation 2). Here, we generalize the result to a family of Maximum Mean Discrepancy (MMD) metrics (Gretton et al., 2012) defined via a function norm that we denote as Norm Maximum Mean Discrepancy (Norm-MMD) metrics. Interestingly, the role of the Lipschitz norm in the value difference bounds is a consequence of using the Wasserstein; when we switch from the Wasserstein to another metric, it is replaced by a different term. We interpret these terms as different forms of smoothness of the value functions in .

By choosing a metric whose associated smoothness corresponds well to the environment, we can potentially improve the tightness of the bounds. For example, in environments with highly non-Lipschitz dynamics, it may be impossible to learn an accurate DeepMDP whose deep value function has a small Lipschitz norm. Instead, the associated smoothness of another metric might be more appropriate. Another reason to consider other metrics is computational; the Wasserstein has high computational cost and suffers from biased stochastic gradient estimates

(Bińkowski et al., 2018; Bellemare et al., 2017b), so minimizing a simpler metric, such as the KL, may be more convenient.

6.1 Norm Maximum Mean Discrepancy Metrics

MMD metrics (Gretton et al., 2012) are a family of probability metrics, each generated via a class of functions. They have also been studied by Müller (1997) under the name of Integral Probability Metrics.

Definition 6 (Gretton et al. (2012) Definition 2).

Let and be distributions on a measurable space and let be a class of functions . The Maximum Mean Discrepancy is

When it’s obvious that regardless of the function class . But the class of functions leads to MMD metrics with different behaviours and properties. Of interest to us are function classes generated via function seminorms666A seminorm is a norm except that .. Concretely, we define a Norm-MMD metric to be an MMD metric generated from a function class of the following form:

where is the associated function seminorm of . We will see that the family of Norm-MMDs are well suited for the task of latent space modeling. Their key property is the following: let be a Norm-MMD, then for any function s.t. ,


We now discuss three particularly interesting examples of Norm-MMD metrics.

Total Variation: Defined as , the Total Variation is one of the most widely-studied metrics. Pinsker’s inequality (Borwein & Lewis, 2005, p.63) bounds the TV with the Kullback–Leibler (KL) divergence. The Total Variation is also the Norm-MMD generated from the set of functions with absolute value bounded by (Müller, 1997). Thus, the function norm .

Wasserstein metric: The interpretation of the Wasserstein as an MMD metrics is clear from its dual form (Equation 3), where the function class is set of -Lipschitz functions,

The norm associated with the Wasserstein metric is therefore the Lipschitz norm, which in turn is the the norm of (the derivative of ). Thus, .

Energy distance: The energy distance was first developed to compare distributions in high dimensions via a two sample test (Székely & Rizzo, 2004; Gretton et al., 2012). It is defined as:

where denotes two independent samples of the distribution . Sejdinovic et al. (2013) showed the connection between the energy distance and MMD metrics. Similarly to the Wasserstein, the Energy distance’s associated seminorm is: .

6.2 Value Function Smoothness

In the context of value functions, we interpret the function seminorms associated with Norm-MMD metrics as different forms of smoothness.

Definition 7.

Let be a DeepMDP and let be a Norm-MMD with associated norm . We say that a policy is -smooth-valued if:

and if for all :

For a value function , is the maximum absolute value of . Both and depend on the derivative of , but while is governed by point of maximal change, instead measures the amount of change over the whole state space . Thus, a value function with a small region of high derivative (and thus, large ) can still have small . In Figure 3 we provide an intuitive visualization of these three forms of smoothness in the game of Pong.

One advantage of the Total Variation is that it requires minimal assumptions on the DeepMDP. If the reward function is bounded, i.e. , then all policies are -smooth-valued. We leave it to future work to study value function smoothness more generally for different Norm-MMD metrics and their associated norms.

6.3 Generalized Value Difference Bounds

The global and local value difference results (Lemmas 2 and 3), as well as the suboptimality result Lemma 1, can easily be derived when is any Norm-MMD metric. Due to the repetitiveness of these results, we don’t include them in the main paper; refer to Appendix A.6 for the full statements and proofs. We leave it to future work to characterize the of policies when general (i.e. non-Wasserstein) Norm-MMD metrics are used.

The fact that the representation quality results (Theorems 1 and 2) and the connection with bisimulation (Theorems 3 and 4) don’t generalize to Norm-MMD metrics emphasizes the special role the Wasserstein metric plays for representation learning.

7 Related Work in Representation Learning

State aggregation methods (Abel et al., 2017; Li et al., 2006; Singh et al., 1995; Givan et al., 2003; Jiang et al., 2015; Ruan et al., 2015) attempt to reduce the dimensionality of the state space by joining states together, taking the perspective that a good representation is one that reduces the total number of states without sacrificing any necessary information. Other representation learning approaches take the perspective that an optimal representation contains features that allow for the linear parametrization of the optimal value function (Comanici & Precup, 2011; Mahadevan & Maggioni, 2007). Recently, Bellemare et al. (2019); Dadashi et al. (2019) approached the representation learning problem from the perspective that a good representation is one that allows the prediction via a linear map of any value function in the value function space. In contrast, we have argued that a good representation (1) allows for the parametrization of a large set of interesting policies and (2) allows for the good approximation of the value function of these policies.

Concurrently, a suite of methods combining model-free deep reinforcement learning with auxiliary tasks has shown large benefits on a wide variety of domains (Jaderberg et al., 2016; van den Oord et al., 2018; Mirowski et al., 2017). Distributional RL (Bellemare et al., 2017a), which was not initially introduced as a representation learning technique, has been shown by Lyle et al. (2019) to only play an auxiliary task role. Similarly, (Fedus et al., 2019) studied different discounting techniques by learning the spectrum of value functions for different discount values , and incidentally found that to be a highly useful auxiliary task. Although successful in practice, these auxiliary task methods currently lack strong theoretical justification. Our approach also proposes to minimize losses as an auxilliary task for representation learning, for a specifc choice of losses: the DeepMDP losses. We have formally justified this choice of losses, by providing theoretical guarantees on representation quality.

(a) One-track DonutWorld.

(b) Four-track DonutWorld.
Figure 4:

Given a state in our DonutWorld environment (first row), we plot a heatmap of the distance between that latent state and each other latent state, for both autoencoder representations (second row) and DeepMDP representations (third row). More-similar latent states are represented by lighter colors.

8 Empirical Evaluation

Our results depend on minimizing losses in expectation, which is the main requirement for deep networks to be applicable. Still, two main obstacles arise when turning these theoretical results into practical algorithms:

(1) Minimization of the Wasserstein Arjovsky et al. (2017) first proposed the use of the Wasserstein distance for Generative Adversarial Networks (GANs) via its dual formulation (see Equation 3). Their approach consists of training a network, constrained to be

-Lipschitz, to attain the supremum of the dual. Once this supremum is attained, the Wasserstein can be minimized by differentiating through the network. Quantile regression has been proposed as an alternative solution to the minimization of the Wasserstein

(Dabney et al., 2018b), (Dabney et al., 2018a), and has shown to perform well for Distributional RL. The reader might note that issues with the stochastic minimization of the Wasserstein distance have been found to be biased by Bellemare et al. (2017b) and Bińkowski et al. (2018). In our experiments, we circumvent these issues by assuming that both and are deterministic. This reduces the Wasserstein distance to , where and denote the deterministic transition functions.

(2) Control the Lipschitz constants and . We also turn to the field of Wasserstein GANs for approaches to constrain deep networks to be Lipschitz. Originally, Arjovsky et al. (2017) used a projection step to constraint the discriminator function to be -Lipschitz. Gulrajani et al. (2017a) proposed using a gradient penalty, and sowed improved learning dynamics. Lipschitz continuity has also been proposed as a regularization method by Gouk et al. (2018), who provided an approach to compute an upper bound to the Lipschitz constant of neural nets. In our experiments, we follow Gulrajani et al. (2017a) and utilize the gradient penalty.

Figure 5: Due to the competition between reward and transition losses, the optimization procedure spends significant time in local minima early on in training. It eventually learns a good representation, which it then optimizes further. (Note that the curves use different scaling on the y-axis.)
Figure 6: We compare the DeepMDP agent versus the C51 agent on the 60 games from the ALE (3 seeds each). For each game, the percentage performance improvement of DeepMDP over C51 is recorded.

8.1 DonutWorld Experiments

In order to evaluate whether we can learn effective representations, we study the representations learned by DeepMDPs in a simple synthetic environment we call DonutWorld. DonutWorld consists of an agent rewarded for running clockwise around a fixed track. Staying in the center of the track results in faster movement. Observations are given in terms of 32x32 greyscale pixel arrays, but there is a simple 2D latent state space (the x-y coordinates of the agent). We investigate whether the x-y coordinates are correctly recovered when learning a two-dimensional representation.

This task epitomizes the low-dimensional dynamics, high-dimensional observations structure typical of Atari 2600 games, while being sufficiently simple to experiment with. We implement the DeepMDP training procedure using Tensorflow and compare it to a simple autoencoder baseline. See Appendix

B for a full environment specification, experimental setup, and additional experiments. Code for replicating all experiments is included in the supplementary material.

In order to investigate whether the learned representations learned correspond well to reality, we plot a heatmap of closeness of representation for various states. Figure 4(a) shows that the DeepMDP representations effectively recover the underlying state of the agent, i.e. its 2D position, from the high-dimensional pixel observations. In contrast, the autoencoder representations are less meaningful, even when the autoencoder solves the task near-perfectly.

In Figure 4(b), we modify the environment: rather than a single track, the environment now has four identical tracks. The agent starts in one uniformly at random and cannot move between tracks. The DeepMDP hidden state correctly merges all states with indistinguishable value functions, learning a deep state representation which is almost completely invariant to which track the agent is in.

The DeepMDP training loss can be difficult to optimize, as illustrated in Figure 5. This is due to the tendency of the transition and reward losses to compete with one another. If the deep state representation is uniformly zero, the transition loss will be zero as well; this is an easily-discovered local optimum, and gradient descent tends to arrive at this point early on in training. Of course, an informationless representation results in a large reward loss. As training progresses, the algorithm incurs a small amount of transition loss in return for a large decrease in reward loss, resulting in a net decrease in loss.

In DonutWorld, which has very simple dynamics, gradient descent is able to discover a good representation after only a few thousand iterations. However, in complex environments such as Atari, it is often much more difficult to discover representations that allow us to escape the low-information local minima. Using architectures with good inductive biases can help to combat this, as shown in Section 8.3. This issue also motivates the use of auxiliary losses (such as value approximation losses or reconstruction losses), which may help guide the optimizer towards good solutions; see Appendix C.5.

8.2 Atari 2600 Experiments

Figure 7: Performance of C51 with model-based auxiliary objectives. Three types of transition models are used for predicting next latent states: a single convolutional layer (convolutional), a single fully-connected layer (one-layer), and a two-layer fully-connected network (two-layer).
Figure 8:

Using various auxiliary tasks in the Arcade Learning Environment. We compare predicting the next state’s representation (Next Latent State, recommended by theoretical bounds on DeepMDPs) with reconstructing the current observation (Observation), predicting the next observation (Next Observation), and predicting the next C51 logits (Next Logits). Training curves for a baseline C51 agent are also shown.

In this section, we demonstrate practical benefits of approximately learning a DeepMDP in the Arcade Learning Environment (Bellemare et al., 2013). Our results on representation-similarity indicate that learning a DeepMDP is a principled method for learning a high-quality representation. Therefore, we minimize DeepMDP losses as an auxiliary task alongside model-free reinforcement learning, learning a single representation which is shared between both tasks. Our implementations of the proposed algorithms are based on Dopamine (Castro et al., 2018).

We adopt the Distributional Q-learning approach to model-free RL; specifically, we use as a baseline the C51 agent (Bellemare et al., 2017a)

, which estimates probability masses on a discrete support and minimizes the KL divergence between the estimated distribution and a target distribution. C51 encodes the input frames using a convolutional neural network

, outputting a dense vector representation

. The C51 Q-function is a feed-forward neural network which maps

to an estimate of the reward distribution’s logits.

To incorporate learning a DeepMDP as an auxiliary learning objective, we define a deep reward function and deep transition function. These are each implemented as a feed-forward neural network, which uses to estimate the immediate reward and the next-state representation, respectively. The overall objective function is a simple linear combination of the standard C51 loss and the Wasserstein distance-based approximations to the local DeepMDP loss given by Equations 6 and 7. For experimental details, see Appendix C.

By optimizing to jointly minimize both C51 and DeepMDP losses, we hope to learn meaningful that form the basis for learning good value functions. In the following subsections, we aim to answer the following questions: (1) What deep transition model architecture is conducive to learning a DeepMDP on Atari? (2) How does the learning of a DeepMDP affect the overall performance of C51 on Atari 2600 games? (2) How do the DeepMDP objectives compare with similar representation-learning approaches?

8.3 Transition Model Architecture

We compare the performance achieved by using different architectures for the DeepMDP transition model (see Figure 7). We experiment with a single fully-connected layer, two fully-connected layers, and a single convolutional layer (see Appendix C for more details). We find that using a convolutional transition model leads to the best DeepMDP performance, and we use this transition model architecture for the rest of the experiments in this paper. Note how the performance of the agent is highly dependent on the architecture. We hypothesize that the inductive bias provided via the model has a large effect on the learned DeepMDPs. Further exploring model architectures which provide inductive biases is a promising avenue to develop better auxiliary tasks. Particularly, we believe that exploring attention (Vaswani et al., 2017; Bahdanau et al., 2014) and relational inductive biases (Watters et al., 2017; Battaglia et al., 2016) could be useful in visual domains like Atari2600.

8.4 DeepMDPs as an Auxiliary Task

We show that when using the best performing DeepMDP architecture described in Appendix C.2, we obtain nearly consistent performance improvements over C51 on the suite of 60 Atari 2600 games (see Figure 6).

8.5 Comparison to Alternative Objectives

We empirically compare the effect of the DeepMDP auxilliary objectives on the performance of a C51 agent to a variety of alternatives. In the experiments in this section, we replace the deep transition loss suggested by the DeepMDP bounds with each of the following:

(1) Observation Reconstruction: We train a state decoder to reconstruct observations from . This framework is similar to (Ha & Schmidhuber, 2018), who learn a latent space representation of the environment with an auto-encoder, and use it to train an RL agent.

(2) Next Observation Prediction: We train a transition model to predict next observations from the current state representation . This framework is similar to model-based RL algorithms which predict future observations (Xu et al., 2018).

(3) Next Logits Prediction: We train a transition model to predict next-state representations such that the Q-function correctly predicts the logits of , where is the action associated with the max Q-value of . This can be understood as a distributional analogue of the Value Prediction Network, VPN, (Oh et al., 2017). Note that this auxiliary loss is used to update only the parameters of the representation encoder and the transition model, not the Q-function.

Our experiments demonstrate that the deep transition loss suggested by the DeepMDP bounds (i.e. predicting the next state’s representation) outperforms all three ablations (see Figure 8). Accurately modeling Atari 2600 frames, whether through observation reconstruction or next observation prediction, forces the representation to encode irrelevant information with respect to the underlying task. VPN-style losses have been shown to be helpful when using the learned predictive model for planning (Oh et al., 2017); however, we find that with a distributional RL agent, using this as an auxiliary task tends to hurt performance.

9 Discussion on Model-Based RL

We have focused on the implications of DeepMDPs for representation learning, but our results also provide a principled basis for model-based RL – in latent space or otherwise. Although DeepMDPs are latent space models, by letting be the identity function, all the provided results immediately apply to the standard model-based RL setting, where the model predicts states instead of latent states. In fact, our results serve as a theoretical justification for common practices already found in the model-based deep RL literature. For example, Chua et al. (2018); Doerr et al. (2018); Hafner et al. (2018); Buesing et al. (2018); Feinberg et al. (2018); Buckman et al. (2018) train models to predict a reward and a distribution over next states, minimizing the negative log-probability of the true next state. The negative log-probability of the next state can be viewed as a one-sample estimate of the KL between the model’s state distribution and the next state distribution. Due to Pinsker’s inequality (which bounds the TV with the KL), and the suitability of TV as a metric (Section 6), this procedure can be interpreted as training a DeepMDP. Thus, the learned model will obey our local value difference bounds (Lemma 8) and suboptimality bounds (Theorem 6), which provide theoretical guarantees for the model.

Further, the suitability of Norm-MMD metrics for learning models presents a promising new research avenue for model-based RL: to break away from the KL and explore the vast family of Norm Maximum Mean Discrepancy metrics.

10 Conclusions

We introduce the concept of a DeepMDP: a parameterized latent space model trained via the minimization of tractable losses. Theoretical analysis provides guarantees on the quality of the value functions of the learned model when the latent transition loss is any member of the large family of Norm Maximum Mean Discrepancy metrics. When the Wasserstein metric is used, a novel connection to bisimulation metrics guarantees the set of parametrizable policies is highly expressive. Further, it’s guaranteed that two states with different values for any of those policies will never be collapsed under the representation. Together, these findings suggest that learning a DeepMDP with the Wasserstein metric is a theoretically sound approach to representation learning. Our results are corroborated by strong performance on large-scale Atari 2600 experiments, demonstrating that minimizing the DeepMDP losses can be a beneficial auxiliary task in model-free RL.

Using the transition and reward models of the DeepMDP for model-based RL (e.g. planning, exploration) is a promising future research direction. Additionally, extending DeepMDPs to accommodate different action spaces or time scales from the original MDPs could be a promising path towards learning hierarchical models of the environment.


The authors would like to thank Philip Amortila and Robert Dadashi for invaluable feedback on the theoretical results; Pablo Samuel Castro, Doina Precup, Nicolas Le Roux, Sasha Vezhnevets, Simon Osindero, Arthur Gretton, Adrien Ali Taiga, Fabian Pedregosa and Shane Gu for useful discussions and feedback.

Changes From ICML 2019 Proceedings

This document represents an updated version of our work relative to the version published in ICML 2019. The major addition was the inclusion of the generalization to Norm-MMD metrics and associated math in Section 6. Lemma 1 also underwent minor changes to its statements and proofs. Additionally, some sections were partially rewritten, especially the discussion on bisimulation (Section 5), which was significantly expanded.


Appendix A Proofs

a.1 Lipschitz MDP

See 1


Start by proving 1. By induction we will show that a sequence of Q values converging to are all Lipschitz, and that as , their Lipschitz norm goes to . Let be the base case. Define . It is a well known result that the sequence converges to . Now let be the Lipschitz norm of . Clearly . Then,