A Game Theoretic Framework for Model Based Reinforcement Learning

04/16/2020 ∙ by Aravind Rajeswaran, et al. ∙ University of Washington 38

Model-based reinforcement learning (MBRL) has recently gained immense interest due to its potential for sample efficiency and ability to incorporate off-policy data. However, designing stable and efficient MBRL algorithms using rich function approximators have remained challenging. To help expose the practical challenges in MBRL and simplify algorithm design from the lens of abstraction, we develop a new framework that casts MBRL as a game between: (1) a policy player, which attempts to maximize rewards under the learned model; (2) a model player, which attempts to fit the real-world data collected by the policy player. For algorithm development, we construct a Stackelberg game between the two players, and show that it can be solved with approximate bi-level optimization. This gives rise to two natural families of algorithms for MBRL based on which player is chosen as the leader in the Stackelberg game. Together, they encapsulate, unify, and generalize many previous MBRL algorithms. Furthermore, our framework is consistent with and provides a clear basis for heuristics known to be important in practice from prior works. Finally, through experiments we validate that our proposed algorithms are highly sample efficient, match the asymptotic performance of model-free policy gradient, and scale gracefully to high-dimensional tasks like dexterous hand manipulation.



There are no comments yet.


page 8

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is the setting where an agent must learn a highly rewarding decision making policy through interactions with an unknown world [1]

. Model-based RL (MBRL) refers to a class of approaches that explicitly build a model of the world to aid policy search. They can incorporate historical off-policy data and generic priors like knowledge of physics, making them highly sample efficient. In addition, the learned models can also be re-purposed to solve new tasks. Accompanied by advances in deep learning, there has been a recent surge of interest in MBRL with rich function approximators. However, a clear algorithmic framework to understand MBRL and unify insights from recent works has been lacking. To bridge this gap, and to facilitate the design of stable and efficient algorithms, we develop a new framework for MBRL that casts it as a two-player game.

Classical frameworks for MBRL, adaptive control [2], and dynamic programming [3], are often confined to simple linear models or tabular representations. They also rely on building global models through ideas like persistent excitation [4] or tabular generative models [5]. Such settings and assumptions are often limiting for modern applications. To obtain a globally accurate model, we need the ability to collect data from all parts of the state space [6], which is often impossible. Furthermore, learning globally accurate models may be unnecessary, unsafe, and inefficient. For example, to make an autonomous car drive on the road, we should not require accurate models in situations where it tumbles and crashes in different ways. This motivates a class of incremental MBRL methods that interleave policy and model learning to gradually construct and refine models in the task-relevant parts of the state space. This is in sharp contrast to a two-stage approach of first building a model of the world, and subsequently planning in it.

Despite growing interest in incremental MBRL, a clear algorithmic framework has been lacking. A unifying framework can connect insights from different approaches and help simplify the algorithm design process from the lens of abstraction. As an example, distribution or domain shift is known to be a major challenge for incremental MBRL. When improving the policy using the learned model, the policy will attempt to shift the distribution over visited states. The learned model may be inaccurate for this modified distribution, resulting in a greatly biased policy update. A variety of approaches have been developed to mitigate this issue. One class of approaches [7, 8, 9], inspired by trust region methods, make conservative changes to the policy to constrain the distribution between successive iterates. In sharp contrast, an alternate set of approaches do not constrain the policy updates in any way, but instead rely on data aggregation to mitigate distribution shift [10, 11, 12]. Our game-theoretic framework for MBRL reveals that these two seemingly disparate approaches are essentially dual approaches to solve the same game.

Our Contributions: We list the major contributions of our work below.

  • [leftmargin=*]

  • We develop a framework that casts MBRL as a game between: (a) a policy player, which maximizes rewards in the learned model; and (b) a model player, which minimizes prediction error of data collected by policy player. Theoretically, we establish that at equilibrium: (1) the model can accurately simulate the policy and predict its performance; and (2) the policy is near-optimal.

  • Developing learning algorithms for general continuous games is well known to be challenging. Direct extensions of workhorses from learning (e.g. SGD) can be unstable in game settings due to non-stationarity [13, 14]. These instabilities mirror the aforementioned challenge of distribution shift in MBRL. In order to derive stable algorithms, we setup a Stackelberg game [15] between the two players, which can be solved efficiently through (approximate) bi-level optimization [16]. Stackelberg games and closely related ideas of bi-level optimization and min-max games have been used to understand settings like meta-learning [17], GANs [13, 14, 18], human-robot interaction [19, 20], and primal-dual RL [21]. In this work, we show how such games can be useful for MBRL.

  • Stackelberg games are asymmetric games where players make decisions in a pre-specified order. The leader plays first and subsequently the follower. Due to the asymmetric nature, the MBRL game can take two Stackelberg forms based on the choice of the leader. This gives rise to two natural families of algorithms (which we name PAL and MAL) for solving the MBRL game. These two algorithmic families have complementary strengths and we provide intuitions on when to prefer which. Together, they encompass, generalize, and unify a large number of existing MBRL algorithms. Furthermore, our formulation is consistent with and provides explanations for commonly used robustification heuristics like model ensembles and entropy regularization.

  • Finally, we develop practical versions for the above algorithm families, and show that they enable sample efficient learning on a suite of continuous control tasks. In particular, our algorithms outperform prior model-based and model-free algorithms in sample efficiency; match the asymptotic performance of model-free policy gradient algorithms; and scale gracefully to high-dimensional tasks like dexterous hand manipulation.

2 Background and Notations

We consider a world that can be represented as an infinite horizon MDP characterized by the tuple: . Per usual notation, and represent the continuous state and action spaces. describes the transition dynamics. , , and

represent the reward function, discount factor, and initial state distribution respectively. Policy is a mapping from states to a probability distribution over actions, i.e.

, and in practice we typically consider parameterized policies. The goal is to optimize the objective:


where the expectation is over all the randomness due to the MDP  and policy 

. Model-free methods solve this optimization using collected samples by either directly estimating the gradient (direct policy search) or through learning of value functions (e.g. Q-learning, actor-critic). Model-based methods, in contrast, construct an explicit model of the world to aid policy optimization.

2.1 Model-Based Reinforcement Learning

In MBRL, we learn an approximate model of the world as another MDP: . The model has the same state-action space, reward function, discount, and initial state distribution. We parameterize the transition dynamics of the model

(as a neural network) and learn the parameters so that it approximates the transition dynamics of the world

. For simplicity, we assume that the reward function and initial state distribution are known. This is a benign assumption for many applications in control, robotics, and operations research. If required, these quantities can also be learned from data, and are typically easier to learn than . Enormous quantities of experience can be cheaply generated by simulating the model, without interacting with the world, and can be used for policy optimization. Thus, model-based methods tend to be sample efficient.

Idealized Global Model Setting To motivate the practical issues, we first consider the idealized setting of an approximate global model. This corresponds to the case where is sufficiently expressive and approximates everywhere. Lemma 1 relates , the performance of a policy in the model with its performance in the world, . We use to denote total variation distance.

Lemma 1.

(Simulation Lemma) Suppose is such that . Then, for any policy , we have


The proof is provided in the appendix. Using the model, we can solve the policy optimization problem, , using any RL algorithm without real-world samples. Since Lemma 1 provides a uniform bound applicable to all policies, we can expect good performance from the policy in in the environment up to small additive factors which can be reduced by improving model quality.

Beyond global models A global modeling approach as above is often impractical. To obtain a globally accurate model, we need the ability to collect data from all parts of the state space [6, 22, 23], which is often impractical. More importantly, learning globally accurate models may be unnecessary, unsafe, and inefficient. For example, to make a robot walk, we should not require accurate models in situations where it falls and crashes in different ways. This motivates the need for incremental approaches to MBRL, where models are gradually constructed and refined in the task-relevant parts of the state space. To formalize this intuition, we consider the below notion of model quality.

Definition 1.

(Model approximation loss) Given a model and a state-action distribution , the quality of model is given by


We use to refer to the KL divergence which can be optimized using samples from , and is closely related to

through Pinsker’s inequality. In the case of isotropic Gaussian distributions, as typically considered in continuous control applications,

reduces to the familiar loss. Importantly, the loss is intimately tied to the sampling distribution . In general, models that are accurate in some parts of the state space need not generalize/transfer to other parts. As a result, a more conservative policy learning procedure is required, in contrast to the global model case.

3 Model Based RL as a Two Player Game

In order to capture the interactions between model learning and policy optimization, we formulate MBRL as the following two-player general sum game (we refer to this as the MBRL game):


We use to denote the average state visitation distribution when executing the policy in the world. The policy player maximizes performance in the learned model, while the model player minimizes prediction error under policy player’s induced state distribution. This is a game since the players can only pick their own parameters while their payoffs depend on the parameters of both players.

The above formulation separates MBRL into the constituent components of policy optimization (planning) and generative model learning. At the same time, it exposes that the two components are closely intertwined and must be considered together in order to succeed in MBRL. We discuss algorithms for solving the game in Section 4, and first focus on the equilibrium properties of the MBRL game. Theorem 1 presents an informal version of our theoretical result; a more formal version of the theorem and proof is provided in Appendix A. Our results establish that at (approximate) Nash equilibrium of the MBRL game: (1) the model can accurately simulate and predict the performance of the policy; (2) the policy is near-optimal.

Theorem 1.

(Global performance of approximate equilibrium pair; informal) Suppose we have a pair of policy and model, , such that simultaneously

Let be an optimal policy and denote corresponding performance as . Then, the performance gap is bounded by


A few remarks are in order about the above result and its implications.

  1. [leftmargin=*]

  2. The first two terms are related to sub-optimality in policy optimization (planning) and model learning, and can be made small with more compute and data, assuming sufficient capacity.

  3. There may be multiple Nash equilibria for the MBRL game, and the third domain adaptation or transfer learning term in the bound captures the quality of an equilibrium. We refer to it as the domain adaptation term since the model is trained under the distribution of , i.e. , but evaluated under the distribution of , i.e. . If the model can accurately simulate , we can expect to find it in the planning phase, since it would obtain high rewards. This domain adaptation term is a consequence of the exploration problem, and is unavoidable if we desire globally optimal policies. Indeed, even purely model-free algorithms suffer from an analogous divergence term [9, 24, 25]. However, Theorem 1 also applies to locally optimal policies for which we may expect better model transfer.

  4. There are multiple avenues to minimize the impact of the domain adaptation term. One approach is to consider a wide initial state distribution [9, 26]. This ensures the model we learn is applicable for a wider set of states and thereby simulate a larger collection of policies. However, in some applications, the initial state distribution may not be under our control. In such a case, we may draw upon advances in domain adaptation literature [27, 28, 29], to learn state-action representations better suited for transfer across different policies.

4 Algorithms

So far, we have established how MBRL can be viewed as a game that couples policy and model learning. We now turn to developing algorithms to solve the MBRL game. Unlike common deep learning settings (e.g. supervised learning), there are no standard workhorses for continuous games. Direct extensions of optimization workhorses (e.g. SGD) are unstable for games due to non-stationarity 

[13, 14, 18, 30]. We first review some of these extensions before presenting our final algorithms.

4.1 Independent simultaneous learners

We first consider a class of algorithms where each player individually optimize their own objectives using gradient based methods. Thus, each player treats the setting as a (stochastic) optimization problem unaware of potential drifts in their objectives due to the two-player nature. These algorithms are sometimes called independent learners, simultaneous learners, or naive learners [14, 31].

Gradient Descent Ascent (GDA) In GDA, each player performs an improvement step holding the parameters of the other player fixed. The resulting updates are given below.

(conservative policy step) (6)
(conservative model step) (7)

Note that both policy and model players update their parameters simultaneously from iteration to . For simplicity, we show vanilla gradient based optimization in the above equations. In practice, this can be replaced with alternatives like momentum [32] or Adam [33] for model learning; and NPG [34, 26], TRPO [35], PPO [36] etc. for policy optimization.

GDA is a conceptually simple and intuitive algorithm. Variants of GDA have been used to solve min-max games arising in deep learning such as GANs. However, for certain problems, it can exhibit poor convergence and require very small learning rates [30, 18, 13, 14] or domain-specific heuristics. Furthermore, it makes sub-optimal use of data, since it is desirable to take multiple policy improvement steps to fully reap the benefits of model learning. The following algorithm addresses this drawback.

Best Response (BR) In BR, each player fixes the parameters of the other player and computes the best response – the parameters that optimize their objective. To approximate the best response, we can take a large number of gradient steps.

(aggressive policy step) (8)
(aggressive model step) (9)

Again, both players simultaneously update their parameters. It is known from a large body of work in online learning that aggressive changes can destabilize learning in non-stationary settings [37, 38]. Large changes to the policy can dramatically alter the sampling distribution, which renders the model incompetent and introduces bias into policy optimization. In Section 5 we experimentally study the performance of GDA and BR on a suite of control tasks. The experimental results corroborate with the drawbacks suggested above, suggesting the need for better algorithms to solve the MBRL game.

4.2 Stackelberg formulation and algorithms

To enable stable and sample efficient learning, we require algorithms that take the game structure into account. While good workhorses like SGD are lacking for general games, one of the exceptions is the Stackelberg game [15], which admits stable gradient based algorithms. Stackelberg games are asymmetric games where we impose a specific playing order. It is a generalization of min-max games and is closely related to bi-level optimization. We cast the MBRL game in the Stackelberg form, and derive gradient based algorithms to solve the resulting game.

First, we briefly review continuous Stackelberg games. Consider a two player game with players and . Let , be their parameters, and , be their losses. Each player would like their losses minimized. With player as the leader, the Stackelberg game corresponds to the following nested optimization:

subject to (11)

Since the follower chooses the best response, the follower’s parameters are implicitly a function of the leader’s parameters. The leader is aware of this, and can utilize this information when updating its parameters. The Stackelberg formulation has a number of appealing properties.

  • [leftmargin=*]

  • Algorithm design based on optimization: From the leader’s viewpoint, the Stackelberg formulation transforms a game with complex interactions into a more familiar albeit complex optimization problem. Gradient based workhorses exist for optimization, unlike general games.

  • Notion of stability and progress: In general games, there exists no single function that can be used to check if an iterative algorithm makes progress towards the equilibrium. This makes algorithm design and diagnosis difficult. By reducing the game to a nested optimization, the outer level objective can be used to effectively track progress.

For simplicity of exposition, we assume that the best-response is unique for the follower. We later remark on the possibility of multiple minimizers. To solve the nested optimization, it suffices to focus on since the follower parameters are implicitly a function of . We can iteratively optimize as: , where the gradient is described in Eq. 12. Thus, the key to solving Stackelberg game is to make one player learn very quickly (follower) to play the best response to a slow (stable) learning player (leader).


The implicit Jacobian term can be obtained using the implicit function theorem [39, 17] as:


Thus, in principle, we can compute the gradient with respect to the leader parameters and solve the nested optimization (to at least a local minimizer). To develop a practical algorithm based on these ideas, we use a few relaxations and approximations. First, it may be hard to compute the exact best response in the inner level with an iterative optimization algorithm. Thus, we use a large number of gradient steps to approximate the best response. Secondly, the implicit Jacobian term may be computationally expensive and difficult to obtain. In practice, this term can often be dropped (i.e. approximated as ) without suffering significant performance degradation, leading to a “first-order” approximation of the gradient. Such an approximation has proven effective in applications like meta-learning [40, 41] and GANs [42, 43, 44]. This also resembles two timescale algorithms previously studied for actor-critic algorithms [45]. Finally, since the Stackelberg game is asymmetric, we can cast the MBRL game in two forms based on which player we choose as the leader.

Policy As Leader (PAL): Choosing the policy player as leader results in the following optimization:


We solve this nested optimization using the first order gradient approximation, resulting in updates:

(aggressive model step) (15)
(conservative policy step) (16)

We first aggressively improve the model to minimize the loss under current visitation distribution. Subsequently we take a conservative policy step to enable stable optimization. The algorithmic template is described further in Algorithm 1. Note that the PAL updates are different from GDA even if a single gradient step is used to approximate the . In PAL, the model is first updated using the current visitation distribution from to . The policy subsequently uses for improvement. In contrast, GDA uses for improving the policy. Finally, suppose we find an approximate solution to the PAL optimization (eq. 14) such that . Since the model (follower) is optimal for the policy by constriction, we inherit the guarantees of Theorem 1.

1:  Require: Initial policy , Initial model , data buffer
2:  for  forever do
3:     Collect data-set by executing in the world
4:     Build local (policy-specific) dynamics model:
5:     Conservatively improve policy:  // NPG, TRPO, PPO etc.
6:  end for
Algorithm 1 Policy as Leader (PAL) meta-algorithm

Model as Leader (MAL): Conversely, choosing model as the leader results in the optimization


Similar to the PAL formulation, using first order approximation to the bi-level gradient results in:

(aggressive policy step) (18)
(conservative model step) (19)

We first optimize a policy for the current model using RL or other planning techniques (e.g. MPC [46]

). Subsequently, we conservatively improve the model using the data collected with the optimized policy. In practice, instead of a single conservative model improvement step, we aggregate all the historical data and perform a few epochs of training. This has an effect similar to conservative model improvement in a follow the regularized leader interpretation 

[37, 10, 47]. The algorithmic template is described in Algorithm 2. Similar to the PAL case, we again inherit the guarantees from Theorem 1.

1:  Require: Initial policy , Initial model , data buffer
2:  for  forever do
3:     Aggressively optimize policy  // RL, MPC, planning etc.
4:     Collect data-set by executing in the world
5:     Conservatively improve the model:  // can use dataset aggregation, natural gradient, mirror descent etc.
6:  end for
Algorithm 2 Model as Leader (MAL) meta-algorithm

On distributionally robust models and policies Finally, we illustrate how the Stackelberg framework is consistent with commonly used robustification heuristics. We now consider the case where there could be multiple best responses to the leader Eq. 10. For instance, in PAL, there could be multiple models that achieve low error for the policy. Similarly, in MAL, there could be multiple policies that achieve high rewards for the specified model. In such cases, the standard notion of Stackelberg equilibrium is to optimize under the worst case realization [18], which results in:


In PAL, model ensemble approaches correspond to approximating the best response set with a finite collection (ensemble) of models. Algorithms inspired by robust or risk-averse control [48, 49, 50] explicitly improve against the adversarial choice in the ensemble, consistent with the Stackelberg setting. Similarly, in the MAL formulation, entropy regularization [51, 23] and disagreement based reward bonuses [22, 52] lead to adversarial best response by encouraging the policy to visit parts of the state space where the model is likely to be inaccurate. Thus far, these ideas (e.g. model ensembles) have largely been viewed as important heuristics. Our Stackelberg MBRL game formulation is consistent with and provides a principled foundation for these important findings, leading to a unified framework.

5 Experiments


Figure 1: (a) Reacher task with a 7DOF arm. (b) In-hand manipulation task with a 24DOF dexterous hand. (c) DClaw-Turn task with a 3 fingered “claw”. (d) DKitty-Orient task with a quadrupedal robot. In all the tasks, the desired goal locations and/or orientations are randomized for every episode. This forces the learning of generalizable policies that can be successful for many goal specifications, and we measure the success rate in our experiments.

In our experiemental evaluation, we aim to primarily answer the following questions:

  1. [leftmargin=*]

  2. Do independent learning algorithms (GDA and BR) learn slowly or suffer from instabilities?

  3. Do the Stackelberg-style algorithms (PAL and MAL) enable stable and sample efficient learning?

  4. Do MAL and PAL exhibit different learning characteristics and strengths? Can we characterize the situations where PAL might be better than MAL, and vice versa?

Task Suite We study the behavior of algorithms on a suite of continuous control tasks consisting of: DClaw-Turn, DKitty-Orient, 7DOF-Reacher, and InHand-Pen. The tasks are illustrated in Figure 1 and further details are provided in Appendix B.1. The DClaw and DKitty tasks use physically accurate models of robots [53, 54]. The Reacher task is a representative whole arm manipulation task, while the in-hand dexterous manipulation task [55] serves as a representative high-dimensional control task. In addition, we also present results with our algorithms in the OpenAI gym tasks in Appendix B.2.

Algorithm Details For all the algorithms of interest (GDA, BR, PAL, MAL), we represent the policy as well as the dynamics model with fully connected neural networks. We instantiate all of these algorithm families with model-based natural policy gradient. Details about the implementation are provided in Appendix B. We use ensembles of dynamics models and entropy regularization to encourage robustness.

Figure 2: Comparison of the learning algorithms. Note that the x-axis is scaled with samples. All results are averaged over 5 random seeds. We observe that PAL and MAL exhibit highly stable and sample efficient learning, leading to near 100% success in the equivalent of a few hours of experience in the real world. GDA exhibits slow learning due to sub-optimal use of data. In contrast, BR being too aggressive and suffering from distribution mismatch is unable to effectively make any learning progress. For the Robel tasks, we also include published results for SAC, a representative off-policy algorithm. The performance of SAC is better than GDA, requiring approx. 0.3 million samples for 95+% success rate, in contrast to 0.5 million samples for GDA.

Comparison of learning algorithms We first study the performance of Stackelberg-style algorithms (PAL, MAL) and compare against the performance of independent algorithms (GDA and BR). Our results, summarized in Figure 2, suggest that PAL and MAL can learn all the tasks efficiently. We observe near monotonic improvement, suggesting that the Stackelberg formulation enables stable learning. We also observe that PAL learns faster than MAL for the tasks we study. While GDA eventually achieves near-100% success rate, it is considerably slower due to conservative nature of updates for both the policy and model. Furthermore, the performance fluctuates rapidly during course of learning, since it does not correspond to stable optimization of any objective. Finally, we observe that BR is unable to make consistent progress. As suggested earlier in Section 4, BR makes rapid changes to both model and policy which exacerbates the challenge of distribution mismatch.

As a point of comparison, we also plot results of SAC [51], a leading model-free algorithm for the ROBEL tasks (results taken from Ahn et al. [54]). Although SAC is able to solve these tasks, it’s sample efficiency is comparable to GDA, and substantially slower than PAL and MAL. To compare against other model-based algorithms, we turn to published results from prior work on OpenAI gym tasks. In Figure 3, we show that PAL and MAL significantly outperforms prior algorithms. In particular, PAL and MAL are 10 times as efficient as other model-based and model-free methods. PAL is also twice as efficient as MBPO [56], a state of the art hybrid model-based and model-free algorithm. Further details about this comparison are provided in Appendix B.2.

Overall our results indicate that PAL and MAL: (a) are substantially more sample efficient than prior model-based and model-free algorithms; (b) achieve the asymptotic performance of their model-free counterparts; (c) can scale to high-dimensional tasks with complex dynamics like dexterous manipulation; (d) can scale to tasks requiring extended rollout horizons (e.g. the OpenAI gym tasks).

Figure 3: Comparison of results on the OpenAI gym benchmark tasks. Results for the baselines are reproduced from Janner et al. [56]. We observe that PAL and MAL show near-monotonic improvement, and substantially outperform the baselines.
Figure 4: PAL vs MAL in non-stationary learning environments. X axis is the number of samples used and Y axis is the distance between end effector and goal, averaged over the trajectory (lower is better). The left plot corresponds to introduction of a change/perturbation to the dynamics of after samples, while the right plot corresponds to introduction of a change/perturbation to the goal distribution after samples. We observe that PAL can quickly recover from changes to dynamics, while MAL can quickly recover from changes to goal distribution.

Choosing between PAL and MAL Finally, we turn to studying relative strengths of PAL and MAL. For this, we consider two variations of the 7DOF reacher task (from Figure 1) corresponding to environment perturbations at an intermediate point of training. In the first case, we perturb the dynamics by changing the length of the forearm. In the second case, halfway through the training, we change the goal distribution to a different region of 3D space. Training curves are presented in Figure 4. Note that there is a performance drop at the time of introducing the perturbation.

For the first case of dynamics perturbation, we observe that PAL recovers faster. Since PAL learns the model aggressively using recent data, it can forget old inconsistent data and improve the policy using an accurate model. In contrast, MAL adapts the model conservatively, taking longer to forget old inconsistent data, ultimately biasing and slowing the policy learning. In the second experiment, the dynamics is stationary but the goal distribution changes midway. Note that the policy does not generalize zero-shot to the new goal distribution, and requires additional learning or fine-tuning. Since MAL learns a more broadly accurate model, it quickly adapts to the new goal distribution. In contrast, PAL conservatively changes the policy and takes longer to adapt to the new goal distribution.

Thus, in summary, we find that PAL is better suited for situations where the dynamics of the world can drift over time. In contrast, MAL is better suited for situations where the task or goal distribution can change over time, and related settings like multi-task learning.

6 Related Work

MBRL and the closely related fields of adaptive control and system identification have a long and rich history (see [4, 2, 57] for overview). Early works in MBRL primarily focused on tabular reinforcement learning in a known generative model setting [5, 24]. However, this setting assumes access to a highly exploratory policy to collect data, which is often not available in practice. Subsequent works like E3 [58] and R-MAX [59] attempt to lift this limitation, but rely heavily on tabular representations which are inadequate for modern applications like robotics. Coupled with advances in deep learning, there has been a surge of interest in incremental MBRL algorithms with rich function approximation. They generally fall into two sets of approaches, as we outline below.

The first set of approaches are largely inspired by trust region methods, and are similar to the PAL family from our work. A highly accurate “local” model is constructed around the visitation distribution of the current policy, and subsequently used to conservatively improve the policy. The trust region is intended to ensure that the model is accurate for all policies within it, thereby enabling monotonic performance improvement. GPS [7, 60], DPI [61], and related approaches [62, 63] learn a time varying linear model and perform a KL-constrained policy improvement step. Such a model representation is convenient for an iLQG [64] based policy update, but might be restrictive for complex dynamics beyond trajectory-centric RL. To remove these limitations, recent works have started to consider neural network representations for both the policy and dynamics model. However, somewhat surprisingly, a clean version from the PAL family has not been studied with neural network models [65]. The motivations presented by Xu et al. [66] and Kurutach et al. [67] resemble PAL, however their practical implementations do not strongly enforce the conservative nature of the policy update.

An alternate set of MBRL approaches take a view similar to MAL. Models are updated conservatively through data aggregation, while policies are aggressively optimized. Ross et al. [10] explicitly studied the role of data aggregation in MBRL. They presented an agnostic online learning view of MBRL and showed that data aggregation can lead to a no-regret algorithm for learning the model, even with aggressive policy optimization. Subsequent works have used data augmentation and proposed additional components to enhance efficiency and stability, such as the use of model predictive control for fast/aggressive policy improvement [68, 69, 12] and uncertainty quantification through Bayesian models like Gaussian processes [70] and ensembles of dynamics models [50, 11, 12]. We refer readers to Wang et al. [65] for overview of recent MBRL advances.

We emphasize that while algorithm instances in the PAL and MAL families have been studied in the past, an overarching framework around them has been lacking. Our descriptions of the PAL and MAL families generalize and unify core insights from prior work and simplify them from the lens of abstraction. Furthermore, the game theoretic formulation enables us to form a connection between the PAL and MAL frameworks. We also note that the PAL and MAL families have similarities to multiple timescale algorithms [45, 71, 72] studied for actor-critic temporal difference learning. These ideas have also been extended to study min-max games like GANs [43]. However, they have not been extended to study model-based RL.

We presented a model-based setting where the model is used to directly improve the policy through rollout based optimization. However, models can be utilized in other ways too. Dyna [73] and MBPO [56] use a learned model to provide additional learning targets for an actor-critic algorithm through short-horizon synthetic trajectories. MBVE [74], STEVE [75], and doubly-robust methods [76, 77, 78]

use model-based rollouts to obtain more favorable bias-variance trade-offs for off-policy evaluation. Some of these works have noted that long horizon rollouts can exacerbate model bias. However, in our experiments, we were able to successfully perform rollouts of hundreds of steps. This is likely due to our practical implementation closely following the Stackelberg setting, which was explicitly designed to mitigate distribution shift and enable effective simulation. It is straightforward to extend PAL and MAL to a hybrid model-based and model-free algorithm. Similarly, approaches that bootstrap from model’s own predictions can improve multi-step simulation 

[79, 80]. We leave exploration of these directions for future work.

7 Summary and Conclusion

In this work, we developed a new framework for MBRL that casts it as a two player game between a policy player and a model player. We established that at equilibrium: (1) the model accurately simulates the policy and predits its performance; (2) the policy is near-optimal. We derived sub-optimality bounds and made a connection to domain adaptation to characterize the quality of an equilibrium.

In order to solve the MBRL game, we constructed the Stackelberg version of the game. This has the advantage of: (1) effective gradient based workhorses to solve the Stackelberg optimization problem; (2) an effective objective function to track learning progress towards equilibrium. General continuous games possess neither of these characteristics. The Stackelberg game can take two forms based on which player we choose as the leader, resulding in two natural algorithm families, which we named PAL and MAL. Together they encompass, generalize, and unify a large collection of prior MBRL works. This greatly simplifies MBRL and particularly algorithm design from the lens of abstraction.

We developed practical versions of PAL and MAL using model-based natural policy gradient. We demonstrated stable and sample efficient learning on a suite of control tasks, including state of the art results on OpenAI gym benchmarks. These results suggest that our practical variants of PAL and MAL: (a) are substantially more sample efficient than prior approaches; (b) achieve the same asymptotic results as model-free counterparts; (c) can scale to high-dimensional tasks with complex dynamics like dexterous manipulation; (d) can scale to tasks requiring rollouts of hundreds of timesteps.

More broadly, our work adds to a growing body of recent work which suggests that MBRL can be stable, sample efficient, and more adaptable (for example to new tasks). For future work, we hope to study alternate ways to solve the Stackelberg optimization; such as using the full implicit gradient term and unrolled optimization. Finally, although we presented our game theoretic framework in the context of MBRL, it is more broadly applicable for any surrogate based optimization including actor-critic methods. It would make for interesting future work to study broader extensions and implications.


We thank Emo Todorov, Sham Kakade, Sergey Levine, and Drew Bagnell for valuable feedback and discussions. We thank Michael Ahn and Michael Janner for sharing the baseline learning curves. The work was done by Aravind Rajeswaran during internship(s) at Google Brain, MTV.


Appendix A Theory

We provide the formal statements and proofs for theoretical results in the paper.

a.1 Performance with Global Models

Lemma 1 restated. (Simulation lemma) Suppose we have a model such that

and the reward function is such that . Then, we have


Let and denote the value of policy starting from an arbitrary state in and respectively. For simplicity of notation, we also define

Before the proof, we note the following useful observations.

  1. Since , the inequality also holds for an average over actions, i.e. .

  2. Since the rewards are bounded, we can achieve a maximum reward of in each time step. Using a geometric summation with discounting , we have

  3. Let be a real-valued function with bounded range, i.e. . Let and be two probability distribution (density) over the space . Then, we have

Using the above observations, we have the following inequalities:

Since the above bound holds for all states, we have that

Stated alternatively, the above inequality implies

Finally, note that the performance criteria and are simply the average of the value function over the initial state distribution. Since the above inequality holds for all states, it also holds for the average over initial state distribution. ∎

We note that the above simulation lemma (or closely related forms) have been proposed and proved several times in prior literature (e.g. see [58, 81]). We present the proof largely for completeness and also to motivate the proof techniques we will use for our main theoretical result (Theorem 1).

a.2 Performance with Task-Driven Local Models

In this section, we relax the global model requirement and consider the case where we have more local models, as well as the case of a policy-model equilibrium pair. We first provide a lemma that characterizes error amplification in local simulation.

Lemma 2.

(Error amplification in local simulation) Let and

be two Markov chains with the same initial state distribution. Let

and be the marginal distributions over states at time when following and respectively. Suppose

then, the marginal distributions are bounded as:


Let us fix a state , and let denote a “dummy” state variable. Then,

Using the above inequality, we have

where the last step uses the previous inequality recursively till , where the Markov chains have the same (initial) state distribution. ∎

The above lemma considers the error between two Markov chains. Note that fixing a policy in an MDP results in a Markov chain transition dynamics. Thus, fixing the policy, we can use the above lemma to compare the resulting Markov chains in and . Consider the following definitions:

The first distribution is the average state visitation distribution when executing in , and is the episode duration (could tend to in the non-episodic case). The second distribution is the discounted state visitation distribution when executing in . Let and be their analogues in . When learning the dynamics model, we would minimize the prediction error under , while is dependent on rewards under . Let

be the marginal distribution at time when following in . Let be analogously defined when following in . Using these definitions, we first characterize the difference in performance of the same policy under and .

Lemma 3.

(Performance difference due to model error) Let and be two different MDPs differing only in their transition dynamics – and . Let the absolute value of rewards be bounded by . Fix a policy for both and , and let and be the resulting marginal state distributions at time . If the MDPs are such that

then, the performance difference is bounded as:


Recall that the performance of a policy can be written as:

where the randomness for the second term is due to and . We can analogously write as well. Thus, the performance difference can be bounded as:

Also recall that we have

We can bound the discounted state visitation distribution as

where the last inequality uses Lemma 2. Notice that the final summation is an arithmetico–geometric series. When simplified, this results in

Using this bound for the performance difference yields the desired result. ∎

Remarks: The performance difference (due to model error) lemma we present is quite distinct and different from the performance difference lemma from [9]. Specifically, our lemma bounds the performance difference between the same policy in two different models. In contrast, the lemma from [9] characterizes the performance difference between two different policies in the same model.

Finally, we study the global performance guarantee when we have a policy-model pair close to equilibrium.

Theorem 1 restated. (Global performance of equilibrium pair) Suppose we have policy-model pair such that the following conditions hold simultaneously:

Let be an optimal policy such that . Then, the performance gap is bounded by:


We first simplify the performance difference, and subsequently bound the different terms. Let to be an optimal policy in the model, so that . We can decompose the performance difference due to various contributions as:

Let us first consider Term-II, which is related to the sub-optimality in the planning problem. Notice that we have:

The first difference is since is the optimal policy in the model, and the second term is small due to the approximate equilibrium condition.

For Term-III, we will draw upon the model error performance difference lemma (Lemma 3). Note that the equilibrium condition of low error along with Pinsker’s inequality implies

Using this and Lemma 3, we have

Finally, Term-I is a transfer learning term that measures the error of (which has low error under ) under the distribution of . The performance difference can be written as