1 Introduction
Reinforcement learning (RL) is the setting where an agent must learn a highly rewarding decision making policy through interactions with an unknown world [1]
. Modelbased RL (MBRL) refers to a class of approaches that explicitly build a model of the world to aid policy search. They can incorporate historical offpolicy data and generic priors like knowledge of physics, making them highly sample efficient. In addition, the learned models can also be repurposed to solve new tasks. Accompanied by advances in deep learning, there has been a recent surge of interest in MBRL with rich function approximators. However, a clear algorithmic framework to understand MBRL and unify insights from recent works has been lacking. To bridge this gap, and to facilitate the design of stable and efficient algorithms, we develop a new framework for MBRL that casts it as a twoplayer game.
Classical frameworks for MBRL, adaptive control [2], and dynamic programming [3], are often confined to simple linear models or tabular representations. They also rely on building global models through ideas like persistent excitation [4] or tabular generative models [5]. Such settings and assumptions are often limiting for modern applications. To obtain a globally accurate model, we need the ability to collect data from all parts of the state space [6], which is often impossible. Furthermore, learning globally accurate models may be unnecessary, unsafe, and inefficient. For example, to make an autonomous car drive on the road, we should not require accurate models in situations where it tumbles and crashes in different ways. This motivates a class of incremental MBRL methods that interleave policy and model learning to gradually construct and refine models in the taskrelevant parts of the state space. This is in sharp contrast to a twostage approach of first building a model of the world, and subsequently planning in it.
Despite growing interest in incremental MBRL, a clear algorithmic framework has been lacking. A unifying framework can connect insights from different approaches and help simplify the algorithm design process from the lens of abstraction. As an example, distribution or domain shift is known to be a major challenge for incremental MBRL. When improving the policy using the learned model, the policy will attempt to shift the distribution over visited states. The learned model may be inaccurate for this modified distribution, resulting in a greatly biased policy update. A variety of approaches have been developed to mitigate this issue. One class of approaches [7, 8, 9], inspired by trust region methods, make conservative changes to the policy to constrain the distribution between successive iterates. In sharp contrast, an alternate set of approaches do not constrain the policy updates in any way, but instead rely on data aggregation to mitigate distribution shift [10, 11, 12]. Our gametheoretic framework for MBRL reveals that these two seemingly disparate approaches are essentially dual approaches to solve the same game.
Our Contributions: We list the major contributions of our work below.

[leftmargin=*]

We develop a framework that casts MBRL as a game between: (a) a policy player, which maximizes rewards in the learned model; and (b) a model player, which minimizes prediction error of data collected by policy player. Theoretically, we establish that at equilibrium: (1) the model can accurately simulate the policy and predict its performance; and (2) the policy is nearoptimal.

Developing learning algorithms for general continuous games is well known to be challenging. Direct extensions of workhorses from learning (e.g. SGD) can be unstable in game settings due to nonstationarity [13, 14]. These instabilities mirror the aforementioned challenge of distribution shift in MBRL. In order to derive stable algorithms, we setup a Stackelberg game [15] between the two players, which can be solved efficiently through (approximate) bilevel optimization [16]. Stackelberg games and closely related ideas of bilevel optimization and minmax games have been used to understand settings like metalearning [17], GANs [13, 14, 18], humanrobot interaction [19, 20], and primaldual RL [21]. In this work, we show how such games can be useful for MBRL.

Stackelberg games are asymmetric games where players make decisions in a prespecified order. The leader plays first and subsequently the follower. Due to the asymmetric nature, the MBRL game can take two Stackelberg forms based on the choice of the leader. This gives rise to two natural families of algorithms (which we name PAL and MAL) for solving the MBRL game. These two algorithmic families have complementary strengths and we provide intuitions on when to prefer which. Together, they encompass, generalize, and unify a large number of existing MBRL algorithms. Furthermore, our formulation is consistent with and provides explanations for commonly used robustification heuristics like model ensembles and entropy regularization.

Finally, we develop practical versions for the above algorithm families, and show that they enable sample efficient learning on a suite of continuous control tasks. In particular, our algorithms outperform prior modelbased and modelfree algorithms in sample efficiency; match the asymptotic performance of modelfree policy gradient algorithms; and scale gracefully to highdimensional tasks like dexterous hand manipulation.
2 Background and Notations
We consider a world that can be represented as an infinite horizon MDP characterized by the tuple: . Per usual notation, and represent the continuous state and action spaces. describes the transition dynamics. , , and
represent the reward function, discount factor, and initial state distribution respectively. Policy is a mapping from states to a probability distribution over actions, i.e.
, and in practice we typically consider parameterized policies. The goal is to optimize the objective:(1) 
where the expectation is over all the randomness due to the MDP and policy
. Modelfree methods solve this optimization using collected samples by either directly estimating the gradient (direct policy search) or through learning of value functions (e.g. Qlearning, actorcritic). Modelbased methods, in contrast, construct an explicit model of the world to aid policy optimization.
2.1 ModelBased Reinforcement Learning
In MBRL, we learn an approximate model of the world as another MDP: . The model has the same stateaction space, reward function, discount, and initial state distribution. We parameterize the transition dynamics of the model
(as a neural network) and learn the parameters so that it approximates the transition dynamics of the world
. For simplicity, we assume that the reward function and initial state distribution are known. This is a benign assumption for many applications in control, robotics, and operations research. If required, these quantities can also be learned from data, and are typically easier to learn than . Enormous quantities of experience can be cheaply generated by simulating the model, without interacting with the world, and can be used for policy optimization. Thus, modelbased methods tend to be sample efficient.Idealized Global Model Setting To motivate the practical issues, we first consider the idealized setting of an approximate global model. This corresponds to the case where is sufficiently expressive and approximates everywhere. Lemma 1 relates , the performance of a policy in the model with its performance in the world, . We use to denote total variation distance.
Lemma 1.
(Simulation Lemma) Suppose is such that . Then, for any policy , we have
(2) 
The proof is provided in the appendix. Using the model, we can solve the policy optimization problem, , using any RL algorithm without realworld samples. Since Lemma 1 provides a uniform bound applicable to all policies, we can expect good performance from the policy in in the environment up to small additive factors which can be reduced by improving model quality.
Beyond global models A global modeling approach as above is often impractical. To obtain a globally accurate model, we need the ability to collect data from all parts of the state space [6, 22, 23], which is often impractical. More importantly, learning globally accurate models may be unnecessary, unsafe, and inefficient. For example, to make a robot walk, we should not require accurate models in situations where it falls and crashes in different ways. This motivates the need for incremental approaches to MBRL, where models are gradually constructed and refined in the taskrelevant parts of the state space. To formalize this intuition, we consider the below notion of model quality.
Definition 1.
(Model approximation loss) Given a model and a stateaction distribution , the quality of model is given by
(3) 
We use to refer to the KL divergence which can be optimized using samples from , and is closely related to
through Pinsker’s inequality. In the case of isotropic Gaussian distributions, as typically considered in continuous control applications,
reduces to the familiar loss. Importantly, the loss is intimately tied to the sampling distribution . In general, models that are accurate in some parts of the state space need not generalize/transfer to other parts. As a result, a more conservative policy learning procedure is required, in contrast to the global model case.3 Model Based RL as a Two Player Game
In order to capture the interactions between model learning and policy optimization, we formulate MBRL as the following twoplayer general sum game (we refer to this as the MBRL game):
(4) 
We use to denote the average state visitation distribution when executing the policy in the world. The policy player maximizes performance in the learned model, while the model player minimizes prediction error under policy player’s induced state distribution. This is a game since the players can only pick their own parameters while their payoffs depend on the parameters of both players.
The above formulation separates MBRL into the constituent components of policy optimization (planning) and generative model learning. At the same time, it exposes that the two components are closely intertwined and must be considered together in order to succeed in MBRL. We discuss algorithms for solving the game in Section 4, and first focus on the equilibrium properties of the MBRL game. Theorem 1 presents an informal version of our theoretical result; a more formal version of the theorem and proof is provided in Appendix A. Our results establish that at (approximate) Nash equilibrium of the MBRL game: (1) the model can accurately simulate and predict the performance of the policy; (2) the policy is nearoptimal.
Theorem 1.
(Global performance of approximate equilibrium pair; informal) Suppose we have a pair of policy and model, , such that simultaneously
Let be an optimal policy and denote corresponding performance as . Then, the performance gap is bounded by
(5) 
A few remarks are in order about the above result and its implications.

[leftmargin=*]

The first two terms are related to suboptimality in policy optimization (planning) and model learning, and can be made small with more compute and data, assuming sufficient capacity.

There may be multiple Nash equilibria for the MBRL game, and the third domain adaptation or transfer learning term in the bound captures the quality of an equilibrium. We refer to it as the domain adaptation term since the model is trained under the distribution of , i.e. , but evaluated under the distribution of , i.e. . If the model can accurately simulate , we can expect to find it in the planning phase, since it would obtain high rewards. This domain adaptation term is a consequence of the exploration problem, and is unavoidable if we desire globally optimal policies. Indeed, even purely modelfree algorithms suffer from an analogous divergence term [9, 24, 25]. However, Theorem 1 also applies to locally optimal policies for which we may expect better model transfer.

There are multiple avenues to minimize the impact of the domain adaptation term. One approach is to consider a wide initial state distribution [9, 26]. This ensures the model we learn is applicable for a wider set of states and thereby simulate a larger collection of policies. However, in some applications, the initial state distribution may not be under our control. In such a case, we may draw upon advances in domain adaptation literature [27, 28, 29], to learn stateaction representations better suited for transfer across different policies.
4 Algorithms
So far, we have established how MBRL can be viewed as a game that couples policy and model learning. We now turn to developing algorithms to solve the MBRL game. Unlike common deep learning settings (e.g. supervised learning), there are no standard workhorses for continuous games. Direct extensions of optimization workhorses (e.g. SGD) are unstable for games due to nonstationarity
[13, 14, 18, 30]. We first review some of these extensions before presenting our final algorithms.4.1 Independent simultaneous learners
We first consider a class of algorithms where each player individually optimize their own objectives using gradient based methods. Thus, each player treats the setting as a (stochastic) optimization problem unaware of potential drifts in their objectives due to the twoplayer nature. These algorithms are sometimes called independent learners, simultaneous learners, or naive learners [14, 31].
Gradient Descent Ascent (GDA) In GDA, each player performs an improvement step holding the parameters of the other player fixed. The resulting updates are given below.
(conservative policy step)  (6)  
(conservative model step)  (7) 
Note that both policy and model players update their parameters simultaneously from iteration to . For simplicity, we show vanilla gradient based optimization in the above equations. In practice, this can be replaced with alternatives like momentum [32] or Adam [33] for model learning; and NPG [34, 26], TRPO [35], PPO [36] etc. for policy optimization.
GDA is a conceptually simple and intuitive algorithm. Variants of GDA have been used to solve minmax games arising in deep learning such as GANs. However, for certain problems, it can exhibit poor convergence and require very small learning rates [30, 18, 13, 14] or domainspecific heuristics. Furthermore, it makes suboptimal use of data, since it is desirable to take multiple policy improvement steps to fully reap the benefits of model learning. The following algorithm addresses this drawback.
Best Response (BR) In BR, each player fixes the parameters of the other player and computes the best response – the parameters that optimize their objective. To approximate the best response, we can take a large number of gradient steps.
(aggressive policy step)  (8)  
(aggressive model step)  (9) 
Again, both players simultaneously update their parameters. It is known from a large body of work in online learning that aggressive changes can destabilize learning in nonstationary settings [37, 38]. Large changes to the policy can dramatically alter the sampling distribution, which renders the model incompetent and introduces bias into policy optimization. In Section 5 we experimentally study the performance of GDA and BR on a suite of control tasks. The experimental results corroborate with the drawbacks suggested above, suggesting the need for better algorithms to solve the MBRL game.
4.2 Stackelberg formulation and algorithms
To enable stable and sample efficient learning, we require algorithms that take the game structure into account. While good workhorses like SGD are lacking for general games, one of the exceptions is the Stackelberg game [15], which admits stable gradient based algorithms. Stackelberg games are asymmetric games where we impose a specific playing order. It is a generalization of minmax games and is closely related to bilevel optimization. We cast the MBRL game in the Stackelberg form, and derive gradient based algorithms to solve the resulting game.
First, we briefly review continuous Stackelberg games. Consider a two player game with players and . Let , be their parameters, and , be their losses. Each player would like their losses minimized. With player as the leader, the Stackelberg game corresponds to the following nested optimization:
(10)  
subject to  (11) 
Since the follower chooses the best response, the follower’s parameters are implicitly a function of the leader’s parameters. The leader is aware of this, and can utilize this information when updating its parameters. The Stackelberg formulation has a number of appealing properties.

[leftmargin=*]

Algorithm design based on optimization: From the leader’s viewpoint, the Stackelberg formulation transforms a game with complex interactions into a more familiar albeit complex optimization problem. Gradient based workhorses exist for optimization, unlike general games.

Notion of stability and progress: In general games, there exists no single function that can be used to check if an iterative algorithm makes progress towards the equilibrium. This makes algorithm design and diagnosis difficult. By reducing the game to a nested optimization, the outer level objective can be used to effectively track progress.
For simplicity of exposition, we assume that the bestresponse is unique for the follower. We later remark on the possibility of multiple minimizers. To solve the nested optimization, it suffices to focus on since the follower parameters are implicitly a function of . We can iteratively optimize as: , where the gradient is described in Eq. 12. Thus, the key to solving Stackelberg game is to make one player learn very quickly (follower) to play the best response to a slow (stable) learning player (leader).
(12) 
The implicit Jacobian term can be obtained using the implicit function theorem [39, 17] as:
(13) 
Thus, in principle, we can compute the gradient with respect to the leader parameters and solve the nested optimization (to at least a local minimizer). To develop a practical algorithm based on these ideas, we use a few relaxations and approximations. First, it may be hard to compute the exact best response in the inner level with an iterative optimization algorithm. Thus, we use a large number of gradient steps to approximate the best response. Secondly, the implicit Jacobian term may be computationally expensive and difficult to obtain. In practice, this term can often be dropped (i.e. approximated as ) without suffering significant performance degradation, leading to a “firstorder” approximation of the gradient. Such an approximation has proven effective in applications like metalearning [40, 41] and GANs [42, 43, 44]. This also resembles two timescale algorithms previously studied for actorcritic algorithms [45]. Finally, since the Stackelberg game is asymmetric, we can cast the MBRL game in two forms based on which player we choose as the leader.
Policy As Leader (PAL): Choosing the policy player as leader results in the following optimization:
(14) 
We solve this nested optimization using the first order gradient approximation, resulting in updates:
(aggressive model step)  (15)  
(conservative policy step)  (16) 
We first aggressively improve the model to minimize the loss under current visitation distribution. Subsequently we take a conservative policy step to enable stable optimization. The algorithmic template is described further in Algorithm 1. Note that the PAL updates are different from GDA even if a single gradient step is used to approximate the . In PAL, the model is first updated using the current visitation distribution from to . The policy subsequently uses for improvement. In contrast, GDA uses for improving the policy. Finally, suppose we find an approximate solution to the PAL optimization (eq. 14) such that . Since the model (follower) is optimal for the policy by constriction, we inherit the guarantees of Theorem 1.
Model as Leader (MAL): Conversely, choosing model as the leader results in the optimization
(17) 
Similar to the PAL formulation, using first order approximation to the bilevel gradient results in:
(aggressive policy step)  (18)  
(conservative model step)  (19) 
We first optimize a policy for the current model using RL or other planning techniques (e.g. MPC [46]
). Subsequently, we conservatively improve the model using the data collected with the optimized policy. In practice, instead of a single conservative model improvement step, we aggregate all the historical data and perform a few epochs of training. This has an effect similar to conservative model improvement in a follow the regularized leader interpretation
[37, 10, 47]. The algorithmic template is described in Algorithm 2. Similar to the PAL case, we again inherit the guarantees from Theorem 1.On distributionally robust models and policies Finally, we illustrate how the Stackelberg framework is consistent with commonly used robustification heuristics. We now consider the case where there could be multiple best responses to the leader Eq. 10. For instance, in PAL, there could be multiple models that achieve low error for the policy. Similarly, in MAL, there could be multiple policies that achieve high rewards for the specified model. In such cases, the standard notion of Stackelberg equilibrium is to optimize under the worst case realization [18], which results in:
(20)  
(21) 
In PAL, model ensemble approaches correspond to approximating the best response set with a finite collection (ensemble) of models. Algorithms inspired by robust or riskaverse control [48, 49, 50] explicitly improve against the adversarial choice in the ensemble, consistent with the Stackelberg setting. Similarly, in the MAL formulation, entropy regularization [51, 23] and disagreement based reward bonuses [22, 52] lead to adversarial best response by encouraging the policy to visit parts of the state space where the model is likely to be inaccurate. Thus far, these ideas (e.g. model ensembles) have largely been viewed as important heuristics. Our Stackelberg MBRL game formulation is consistent with and provides a principled foundation for these important findings, leading to a unified framework.
5 Experiments
In our experiemental evaluation, we aim to primarily answer the following questions:

[leftmargin=*]

Do independent learning algorithms (GDA and BR) learn slowly or suffer from instabilities?

Do the Stackelbergstyle algorithms (PAL and MAL) enable stable and sample efficient learning?

Do MAL and PAL exhibit different learning characteristics and strengths? Can we characterize the situations where PAL might be better than MAL, and vice versa?
Task Suite We study the behavior of algorithms on a suite of continuous control tasks consisting of: DClawTurn, DKittyOrient, 7DOFReacher, and InHandPen. The tasks are illustrated in Figure 1 and further details are provided in Appendix B.1. The DClaw and DKitty tasks use physically accurate models of robots [53, 54]. The Reacher task is a representative whole arm manipulation task, while the inhand dexterous manipulation task [55] serves as a representative highdimensional control task. In addition, we also present results with our algorithms in the OpenAI gym tasks in Appendix B.2.
Algorithm Details For all the algorithms of interest (GDA, BR, PAL, MAL), we represent the policy as well as the dynamics model with fully connected neural networks. We instantiate all of these algorithm families with modelbased natural policy gradient. Details about the implementation are provided in Appendix B. We use ensembles of dynamics models and entropy regularization to encourage robustness.
Comparison of learning algorithms We first study the performance of Stackelbergstyle algorithms (PAL, MAL) and compare against the performance of independent algorithms (GDA and BR). Our results, summarized in Figure 2, suggest that PAL and MAL can learn all the tasks efficiently. We observe near monotonic improvement, suggesting that the Stackelberg formulation enables stable learning. We also observe that PAL learns faster than MAL for the tasks we study. While GDA eventually achieves near100% success rate, it is considerably slower due to conservative nature of updates for both the policy and model. Furthermore, the performance fluctuates rapidly during course of learning, since it does not correspond to stable optimization of any objective. Finally, we observe that BR is unable to make consistent progress. As suggested earlier in Section 4, BR makes rapid changes to both model and policy which exacerbates the challenge of distribution mismatch.
As a point of comparison, we also plot results of SAC [51], a leading modelfree algorithm for the ROBEL tasks (results taken from Ahn et al. [54]). Although SAC is able to solve these tasks, it’s sample efficiency is comparable to GDA, and substantially slower than PAL and MAL. To compare against other modelbased algorithms, we turn to published results from prior work on OpenAI gym tasks. In Figure 3, we show that PAL and MAL significantly outperforms prior algorithms. In particular, PAL and MAL are 10 times as efficient as other modelbased and modelfree methods. PAL is also twice as efficient as MBPO [56], a state of the art hybrid modelbased and modelfree algorithm. Further details about this comparison are provided in Appendix B.2.
Overall our results indicate that PAL and MAL: (a) are substantially more sample efficient than prior modelbased and modelfree algorithms; (b) achieve the asymptotic performance of their modelfree counterparts; (c) can scale to highdimensional tasks with complex dynamics like dexterous manipulation; (d) can scale to tasks requiring extended rollout horizons (e.g. the OpenAI gym tasks).
Choosing between PAL and MAL Finally, we turn to studying relative strengths of PAL and MAL. For this, we consider two variations of the 7DOF reacher task (from Figure 1) corresponding to environment perturbations at an intermediate point of training. In the first case, we perturb the dynamics by changing the length of the forearm. In the second case, halfway through the training, we change the goal distribution to a different region of 3D space. Training curves are presented in Figure 4. Note that there is a performance drop at the time of introducing the perturbation.
For the first case of dynamics perturbation, we observe that PAL recovers faster. Since PAL learns the model aggressively using recent data, it can forget old inconsistent data and improve the policy using an accurate model. In contrast, MAL adapts the model conservatively, taking longer to forget old inconsistent data, ultimately biasing and slowing the policy learning. In the second experiment, the dynamics is stationary but the goal distribution changes midway. Note that the policy does not generalize zeroshot to the new goal distribution, and requires additional learning or finetuning. Since MAL learns a more broadly accurate model, it quickly adapts to the new goal distribution. In contrast, PAL conservatively changes the policy and takes longer to adapt to the new goal distribution.
Thus, in summary, we find that PAL is better suited for situations where the dynamics of the world can drift over time. In contrast, MAL is better suited for situations where the task or goal distribution can change over time, and related settings like multitask learning.
6 Related Work
MBRL and the closely related fields of adaptive control and system identification have a long and rich history (see [4, 2, 57] for overview). Early works in MBRL primarily focused on tabular reinforcement learning in a known generative model setting [5, 24]. However, this setting assumes access to a highly exploratory policy to collect data, which is often not available in practice. Subsequent works like E3 [58] and RMAX [59] attempt to lift this limitation, but rely heavily on tabular representations which are inadequate for modern applications like robotics. Coupled with advances in deep learning, there has been a surge of interest in incremental MBRL algorithms with rich function approximation. They generally fall into two sets of approaches, as we outline below.
The first set of approaches are largely inspired by trust region methods, and are similar to the PAL family from our work. A highly accurate “local” model is constructed around the visitation distribution of the current policy, and subsequently used to conservatively improve the policy. The trust region is intended to ensure that the model is accurate for all policies within it, thereby enabling monotonic performance improvement. GPS [7, 60], DPI [61], and related approaches [62, 63] learn a time varying linear model and perform a KLconstrained policy improvement step. Such a model representation is convenient for an iLQG [64] based policy update, but might be restrictive for complex dynamics beyond trajectorycentric RL. To remove these limitations, recent works have started to consider neural network representations for both the policy and dynamics model. However, somewhat surprisingly, a clean version from the PAL family has not been studied with neural network models [65]. The motivations presented by Xu et al. [66] and Kurutach et al. [67] resemble PAL, however their practical implementations do not strongly enforce the conservative nature of the policy update.
An alternate set of MBRL approaches take a view similar to MAL. Models are updated conservatively through data aggregation, while policies are aggressively optimized. Ross et al. [10] explicitly studied the role of data aggregation in MBRL. They presented an agnostic online learning view of MBRL and showed that data aggregation can lead to a noregret algorithm for learning the model, even with aggressive policy optimization. Subsequent works have used data augmentation and proposed additional components to enhance efficiency and stability, such as the use of model predictive control for fast/aggressive policy improvement [68, 69, 12] and uncertainty quantification through Bayesian models like Gaussian processes [70] and ensembles of dynamics models [50, 11, 12]. We refer readers to Wang et al. [65] for overview of recent MBRL advances.
We emphasize that while algorithm instances in the PAL and MAL families have been studied in the past, an overarching framework around them has been lacking. Our descriptions of the PAL and MAL families generalize and unify core insights from prior work and simplify them from the lens of abstraction. Furthermore, the game theoretic formulation enables us to form a connection between the PAL and MAL frameworks. We also note that the PAL and MAL families have similarities to multiple timescale algorithms [45, 71, 72] studied for actorcritic temporal difference learning. These ideas have also been extended to study minmax games like GANs [43]. However, they have not been extended to study modelbased RL.
We presented a modelbased setting where the model is used to directly improve the policy through rollout based optimization. However, models can be utilized in other ways too. Dyna [73] and MBPO [56] use a learned model to provide additional learning targets for an actorcritic algorithm through shorthorizon synthetic trajectories. MBVE [74], STEVE [75], and doublyrobust methods [76, 77, 78]
use modelbased rollouts to obtain more favorable biasvariance tradeoffs for offpolicy evaluation. Some of these works have noted that long horizon rollouts can exacerbate model bias. However, in our experiments, we were able to successfully perform rollouts of hundreds of steps. This is likely due to our practical implementation closely following the Stackelberg setting, which was explicitly designed to mitigate distribution shift and enable effective simulation. It is straightforward to extend PAL and MAL to a hybrid modelbased and modelfree algorithm. Similarly, approaches that bootstrap from model’s own predictions can improve multistep simulation
[79, 80]. We leave exploration of these directions for future work.7 Summary and Conclusion
In this work, we developed a new framework for MBRL that casts it as a two player game between a policy player and a model player. We established that at equilibrium: (1) the model accurately simulates the policy and predits its performance; (2) the policy is nearoptimal. We derived suboptimality bounds and made a connection to domain adaptation to characterize the quality of an equilibrium.
In order to solve the MBRL game, we constructed the Stackelberg version of the game. This has the advantage of: (1) effective gradient based workhorses to solve the Stackelberg optimization problem; (2) an effective objective function to track learning progress towards equilibrium. General continuous games possess neither of these characteristics. The Stackelberg game can take two forms based on which player we choose as the leader, resulding in two natural algorithm families, which we named PAL and MAL. Together they encompass, generalize, and unify a large collection of prior MBRL works. This greatly simplifies MBRL and particularly algorithm design from the lens of abstraction.
We developed practical versions of PAL and MAL using modelbased natural policy gradient. We demonstrated stable and sample efficient learning on a suite of control tasks, including state of the art results on OpenAI gym benchmarks. These results suggest that our practical variants of PAL and MAL: (a) are substantially more sample efficient than prior approaches; (b) achieve the same asymptotic results as modelfree counterparts; (c) can scale to highdimensional tasks with complex dynamics like dexterous manipulation; (d) can scale to tasks requiring rollouts of hundreds of timesteps.
More broadly, our work adds to a growing body of recent work which suggests that MBRL can be stable, sample efficient, and more adaptable (for example to new tasks). For future work, we hope to study alternate ways to solve the Stackelberg optimization; such as using the full implicit gradient term and unrolled optimization. Finally, although we presented our game theoretic framework in the context of MBRL, it is more broadly applicable for any surrogate based optimization including actorcritic methods. It would make for interesting future work to study broader extensions and implications.
Acknowledgements
We thank Emo Todorov, Sham Kakade, Sergey Levine, and Drew Bagnell for valuable feedback and discussions. We thank Michael Ahn and Michael Janner for sharing the baseline learning curves. The work was done by Aravind Rajeswaran during internship(s) at Google Brain, MTV.
References
 [1] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.
 [2] Karl Johan Åström and Richard M. Murray. Feedback systems an introduction for scientists and engineers. 2004.
 [3] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. In Wiley Series in Probability and Statistics, 1994.
 [4] Kumpati S. Narendra and Anuradha M. Annaswamy. Persistent excitation in adaptive systems. International Journal of Control, 1987.
 [5] Michael Kearns and Satinder P. Singh. Finitesample convergence rates for qlearning and indirect algorithms. In NIPS, 1998.
 [6] Alekh Agarwal, Sham M. Kakade, and Lin F. Yang. On the optimality of sparse modelbased planning for markov decision processes. ArXiv, abs/1906.03804, 2019.
 [7] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In NIPS, 2014.
 [8] Wen Sun, Geoffrey J. Gordon, Byron Boots, and J. Andrew Bagnell. Dual policy iteration. In NeurIPS, 2018.
 [9] Sham M. Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, 2002.
 [10] Stéphane Ross and J. Andrew Bagnell. Agnostic system identification for modelbased reinforcement learning. In ICML, 2012.
 [11] Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS, 2018.
 [12] Anusha Nagabandi, Kurt Konoglie, Sergey Levine, and Vikash Kumar. Deep dynamics models for learning dexterous manipulation. ArXiv, abs/1909.11652, 2019.
 [13] Florian Schäfer and Anima Anandkumar. Competitive gradient descent. In NeurIPS, 2019.
 [14] Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A followtheridge approach. ArXiv, abs/1910.07512, 2019.
 [15] Heinrich von Stackelberg. Market structure and equilibrium. 1934.
 [16] Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals of Operations Research, 153:235–256, 2007.
 [17] Aravind Rajeswaran, Chelsea Finn, Sham M. Kakade, and Sergey Levine. Metalearning with implicit gradients. In NeurIPS, 2019.
 [18] Tanner Fiez, Benjamin Chasnov, and Lillian J. Ratliff. Convergence of learning dynamics in stackelberg games. ArXiv, abs/1906.01217, 2019.
 [19] Stefanos Nikolaidis, Swaprava Nath, Ariel D. Procaccia, and Siddhartha S. Srinivasa. Gametheoretic modeling of human adaptation in humanrobot collaboration. 2017 12th ACM/IEEE International Conference on HumanRobot Interaction (HRI, pages 323–331, 2017.
 [20] Dorsa Sadigh, Nick Landolfi, S. Shankar Sastry, Sanjit A. Seshia, and Anca D. Dragan. Planning for cars that coordinate with people: leveraging effects on human actions for planning and active information gathering over human internal state. Autonomous Robots, 42:1405–1426, 2018.
 [21] Ofir Nachum and Bo Dai. Reinforcement learning via fenchelrockafellar duality. ArXiv, abs/2001.01866, 2020.
 [22] Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. In ICML, 2017.
 [23] Elad Hazan, Sham M. Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In ICML, 2018.
 [24] Anurag Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. ArXiv, abs/1908.00261, 2019.

[25]
Rémi Munos and Csaba Szepesvári.
Finitetime bounds for fitted value iteration.
Journal of Machine Learning Research
, 2008.  [26] Aravind Rajeswaran, Kendall Lowrey, Emanuel Todorov, and Sham Kakade. Towards Generalization and Simplicity in Continuous Control. In NIPS, 2017.
 [27] Shai BenDavid, John Blitzer, Koby Crammer, and Fernando C Pereira. Analysis of representations for domain adaptation. In NIPS, 2006.
 [28] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2015.

[29]
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell.
Adversarial discriminative domain adaptation.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 2962–2971, 2017.  [30] Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. What is local optimality in nonconvexnonconcave minimax optimization? ArXiv, abs/1902.00618, 2019.
 [31] Jakob N. Foerster, Richard Y. Chen, Maruan AlShedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponentlearning awareness. In AAMAS, 2017.
 [32] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, 2013.
 [33] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [34] Sham M Kakade. A natural policy gradient. In NIPS, 2002.
 [35] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In ICML, 2015.
 [36] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.
 [37] Shai ShalevShwartz. Online learning and online convex optimization. "Foundations and Trends in Machine Learning", 2012.
 [38] Nicolò CesaBianchi and Gábor Lugosi. Prediction, learning, and games. 2006.
 [39] Steven G. Krantz and Harold R. Parks. The implicit function theorem: History, theory, and applications. 2002.
 [40] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. International Conference on Machine Learning (ICML), 2017.
 [41] Alex Nichol, Joshua Achiam, and John Schulman. On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999, 2018.
 [42] Ian J. Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. ArXiv, abs/1406.2661, 2014.
 [43] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In NIPS, 2017.
 [44] Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein. Unrolled generative adversarial networks. ArXiv, abs/1611.02163, 2017.
 [45] Vijaymohan Konda and Vivek S. Borkar. Actorcritic  type learning algorithms for markov decision processes. SIAM J. Control and Optimization, 38:94–123, 1999.
 [46] Yuval Tassa, Tom Erez, and Emanuel Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 4906–4913. IEEE, 2012.
 [47] H. Brendan McMahan. Followtheregularizedleader and mirror descent: Equivalence theorems and l1 regularization. In AISTATS, 2011.
 [48] Kemin Zhou, John C. Doyle, and Keith Glover. Robust and Optimal Control. PrenticeHall, Inc., Upper Saddle River, NJ, USA, 1996.
 [49] Juliette Garcia and Fernando Fernández. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015.
 [50] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. In ICLR, 2016.
 [51] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, and Sergey Levine. Soft actorcritic algorithms and applications. ArXiv, abs/1812.05905, 2018.
 [52] Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta. Selfsupervised exploration via disagreement. ArXiv, abs/1906.04161, 2019.
 [53] Henry Zhu, Abhishek Gupta, Aravind Rajeswaran, Sergey Levine, and Vikash Kumar. Dexterous manipulation with deep reinforcement learning: Efficient, general, and lowcost. 2019 International Conference on Robotics and Automation (ICRA), pages 3651–3657, 2018.
 [54] Michael Ahn, Henry Zhu, Kristian Hartikainen, Hugo Ponte, Abhishek Gupta, Sergey Levine, and Vikash Kumar. ROBEL: RObotics BEnchmarks for Learning with lowcost robots. In Conference on Robot Learning (CoRL), 2019.
 [55] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2018.
 [56] Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Modelbased policy optimization. ArXiv, abs/1906.08253, 2019.
 [57] Lennart Ljung. System identification: Theory for the user. 1987.
 [58] Michael Kearns and Satinder P. Singh. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 49:209–232, 1998.
 [59] Ronen I. Brafman and Moshe Tennenholtz. Rmax  a general polynomial time algorithm for nearoptimal reinforcement learning. J. Mach. Learn. Res., 3:213–231, 2001.
 [60] Igor Mordatch and Emanuel Todorov. Combining the benefits of function approximation and trajectory optimization. In RSS, 2014.
 [61] Wen Sun, Geoffrey J. Gordon, Byron Boots, and J. Andrew Bagnell. Dual policy iteration. CoRR, abs/1805.10755, 2018.
 [62] Vikash Kumar, Emanuel Todorov, and Sergey Levine. Optimal control with learned local models: Application to dexterous manipulation. 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 378–383, 2016.
 [63] Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3381–3388, 2017.
 [64] Emanuel Todorov and Weiwei Li. A generalized iterative lqg method for locallyoptimal feedback control of constrained nonlinear stochastic systems. In ACC, 2005.
 [65] Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, S. Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking modelbased reinforcement learning. ArXiv, abs/1907.02057, 2019.
 [66] Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for modelbased reinforcement learning with theoretical guarantees. ArXiv, abs/1807.03858, 2018.
 [67] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Modelensemble trustregion policy optimization. ArXiv, abs/1802.10592, 2018.
 [68] Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M. Rehg, Byron Boots, and Evangelos Theodorou. Information theoretic mpc for modelbased reinforcement learning. 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721, 2017.
 [69] Kendall Lowrey, Aravind Rajeswaran, Sham Kakade, Emanuel Todorov, and Igor Mordatch. Plan Online, Learn Offline: Efficient Learning and Exploration via ModelBased Control. In International Conference on Learning Representations (ICLR), 2019.
 [70] Marc Peter Deisenroth and Carl E. Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In ICML, 2011.
 [71] Vijay R. Konda and John N. Tsitsiklis. Convergence rate of linear twotimescale stochastic approximation. In The Annals of Applied Probability, 2004.
 [72] Prasenjit Karmakar and Shalabh Bhatnagar. Two timescale stochastic approximation with controlled markov noise and offpolicy temporaldifference learning. Mathematics of Operations Research, 43:130–151, 2015.
 [73] Richard S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990.
 [74] Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I. Jordan, Joseph Gonzalez, and Sergey Levine. Modelbased value estimation for efficient modelfree reinforcement learning. CoRR, abs/1803.00101, 2018.
 [75] Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sampleefficient reinforcement learning with stochastic ensemble value expansion. arXiv preprint arXiv:1807.01675, 2018.
 [76] Nan Jiang and Lihong Li. Doubly robust offpolicy value evaluation for reinforcement learning. In ICML, 2015.
 [77] Philip S. Thomas and Emma Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. ArXiv, abs/1604.00923, 2016.
 [78] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust offpolicy evaluation. In ICML, 2018.
 [79] Arun Venkatraman, Martial Hebert, and J. Andrew Bagnell. Improving multistep prediction of learned time series models. In AAAI, 2015.
 [80] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. ArXiv, abs/1506.03099, 2015.
 [81] Sham M. Kakade, Michael Kearns, and John Langford. Exploration in metric state spaces. In ICML, 2003.
 [82] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In IROS, 2012.
 [83] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.
 [84] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Overcoming exploration in reinforcement learning with demonstrations. CoRR, abs/1709.10089, 2017.
 [85] Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. JMLR, 2001.
 [86] Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandre M. Bayen, Sham M. Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with actiondependent factorized baselines. In ICLR, 2018.
 [87] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. In ICLR, 2016.
Appendix A Theory
We provide the formal statements and proofs for theoretical results in the paper.
a.1 Performance with Global Models
Lemma 1 restated. (Simulation lemma) Suppose we have a model such that
and the reward function is such that . Then, we have
Proof.
Let and denote the value of policy starting from an arbitrary state in and respectively. For simplicity of notation, we also define
Before the proof, we note the following useful observations.

Since , the inequality also holds for an average over actions, i.e. .

Since the rewards are bounded, we can achieve a maximum reward of in each time step. Using a geometric summation with discounting , we have

Let be a realvalued function with bounded range, i.e. . Let and be two probability distribution (density) over the space . Then, we have
Using the above observations, we have the following inequalities:
Since the above bound holds for all states, we have that
Stated alternatively, the above inequality implies
Finally, note that the performance criteria and are simply the average of the value function over the initial state distribution. Since the above inequality holds for all states, it also holds for the average over initial state distribution. ∎
a.2 Performance with TaskDriven Local Models
In this section, we relax the global model requirement and consider the case where we have more local models, as well as the case of a policymodel equilibrium pair. We first provide a lemma that characterizes error amplification in local simulation.
Lemma 2.
(Error amplification in local simulation) Let and
be two Markov chains with the same initial state distribution. Let
and be the marginal distributions over states at time when following and respectively. Supposethen, the marginal distributions are bounded as:
Proof.
Let us fix a state , and let denote a “dummy” state variable. Then,
Using the above inequality, we have
where the last step uses the previous inequality recursively till , where the Markov chains have the same (initial) state distribution. ∎
The above lemma considers the error between two Markov chains. Note that fixing a policy in an MDP results in a Markov chain transition dynamics. Thus, fixing the policy, we can use the above lemma to compare the resulting Markov chains in and . Consider the following definitions:
The first distribution is the average state visitation distribution when executing in , and is the episode duration (could tend to in the nonepisodic case). The second distribution is the discounted state visitation distribution when executing in . Let and be their analogues in . When learning the dynamics model, we would minimize the prediction error under , while is dependent on rewards under . Let
be the marginal distribution at time when following in . Let be analogously defined when following in . Using these definitions, we first characterize the difference in performance of the same policy under and .
Lemma 3.
(Performance difference due to model error) Let and be two different MDPs differing only in their transition dynamics – and . Let the absolute value of rewards be bounded by . Fix a policy for both and , and let and be the resulting marginal state distributions at time . If the MDPs are such that
then, the performance difference is bounded as:
Proof.
Recall that the performance of a policy can be written as:
where the randomness for the second term is due to and . We can analogously write as well. Thus, the performance difference can be bounded as:
Also recall that we have
We can bound the discounted state visitation distribution as
where the last inequality uses Lemma 2. Notice that the final summation is an arithmetico–geometric series. When simplified, this results in
Using this bound for the performance difference yields the desired result. ∎
Remarks: The performance difference (due to model error) lemma we present is quite distinct and different from the performance difference lemma from [9]. Specifically, our lemma bounds the performance difference between the same policy in two different models. In contrast, the lemma from [9] characterizes the performance difference between two different policies in the same model.
Finally, we study the global performance guarantee when we have a policymodel pair close to equilibrium.
Theorem 1 restated. (Global performance of equilibrium pair) Suppose we have policymodel pair such that the following conditions hold simultaneously:
Let be an optimal policy such that . Then, the performance gap is bounded by:
Proof.
We first simplify the performance difference, and subsequently bound the different terms. Let to be an optimal policy in the model, so that . We can decompose the performance difference due to various contributions as:
Let us first consider TermII, which is related to the suboptimality in the planning problem. Notice that we have:
The first difference is since is the optimal policy in the model, and the second term is small due to the approximate equilibrium condition.
For TermIII, we will draw upon the model error performance difference lemma (Lemma 3). Note that the equilibrium condition of low error along with Pinsker’s inequality implies
Using this and Lemma 3, we have
Finally, TermI is a transfer learning term that measures the error of (which has low error under ) under the distribution of . The performance difference can be written as
Comments
There are no comments yet.