1 Introduction
The modelbased approach to reinforcement learning (RL) offers a unique framework for addressing three important artificial intelligence (AI) problems: understanding the dynamics of the environment through interaction, using the acquired knowledge for planning, and performing sequential decision making. One promise of modelbased RL is to enable sampleefficient learning
(Sutton, 1990; Deisenroth and Rasmussen, 2011; Levine and Abbeel, 2014). This advantage is wellunderstood in theory in settings such as valuefunction estimation (Azar et al., 2012), exploration in discounted and finitehorizon Markov decision processes (MDPs)
(Szita and Szepesvári, 2010; Dann and Brunskill, 2015), and exploration in contextual decision processes (Sun et al., 2018).Planning has long been a fundamental research area in AI (Hart et al., 1972; Russell and Norvig, 2016). In the context of modelbased RL, Sutton and Barto (2018) define planning as any process that goes from a model to a value function or a policy. They describe two notions of planning, namely decisiontime planning and background planning. In decisiontime planning, the agent utilizes the model during action selection by performing tree search (Kearns and Singh, 2002; Kocsis and Szepesvári, 2006; Silver et al., 2016). An agent is said to be performing background planning if it utilizes the model to update its value or policy network. Examples of background planning include the Dyna Architecture (Sutton, 1990; Sutton et al., 2008), or modelbased policysearch (Abbeel et al., 2006; Deisenroth and Rasmussen, 2011; Kurutach et al., 2018). We call an algorithm modelbased if it performs either background or decisiontime planning.
A key aspect distinguishing modelbased RL from traditional planning is that the model is learned from experience. As such, the model may be imperfect due to ineffective generalization (Abbeel et al., 2006; Nagabandi et al., 2018), inadequate exploration (Brafman and Tennenholtz, 2002; Shyam et al., 2018), overfitting (Asadi et al., 2018b), or irreducible errors in unrealizable settings (ShalevShwartz and BenDavid, 2014; Talvitie, 2014). More generally, a common view across various scientific disciplines is that all models are wrong, though some are still useful (Box, 1976; Wit et al., 2012).
Previous work explored learning models that look a single step ahead. To predict steps ahead, the starting point of a step is set to the end point of the previous step . Unfortunately, when the model is wrong, this procedure can interfere with successful planning (Talvitie, 2014; Venkatraman et al., 2015; Asadi et al., 2018b). A significant reason for this failure is that the model may produce a “fake” input, meaning an input that cannot possibly occur in the domain, which is then fed back to the unprepared onestep model. Notice that two sources of error coexist: The model is imperfect, and the model gets an inaccurate input in all but the first step. The interplay between the two errors leads to what is referred to as the compoundingerror problem (Asadi et al., 2018b). To mitigate the problem, Talvitie (2014) and Venkatraman et al. (2015) provided an approach, called hallucination, that prepares the model for the fake inputs generated by itself. In contrast, our approach to the problem is to avoid feeding such fake inputs altogether by using a multistep model. Though multistep algorithms are popular in the modelfree setting (Sutton and Barto, 2018; Asis et al., 2017; Singh and Sutton, 1996; Precup, 2000), extension to the modelbased setting remains less explored.
Our main contribution is to propose a Multistep Model for Modelbased RL (or simply ) that directly predicts the outcome of executing a sequence of actions. Learning the multistep model allows us to avoid feeding fake inputs to the model. We further introduce a novel rollout procedure in which the original first state of the rollout will be the starting point across all rollout steps. Our theory shows that, relative to the onestep model, learning the multistep model is more effective for value estimation. To this end, we study the hardness of learning the multistep model through the lens of Rademacher complexity (Bartlett and Mendelson, 2002). Finally, we empirically evaluate the multistep model and show its advantage relative to the onestep model in the context of background planning and decisiontime planning.
2 Background and Notation
To formulate the reinforcementlearning problem, we use finitehorizon Markov decsion processes (MDPs) with continuous states and discrete actions. See Puterman (2014) for a thorough treatment of MDPs, and Sutton and Barto (2018) for an introduction to reinforcement learning.
2.1 Lipschitz Continuity
Following previous work (Berkenkamp et al., 2017; Asadi et al., 2018b; Luo et al., 2018; Gelada et al., 2019) we make assumptions on the smoothness of models, characterized below.
Definition 1.
Given metric spaces and , is Lipschitz if ,
is finite. Similarly, is uniformly Lipschitz in if the quantity below is finite:
Definition  Agent’s Approximation  

probability of taking in  N/A  
N/A  
MDP state after taking in  
2.2 Rademacher Complexity
We use Rademacher complexity for sample complexity analysis. We define this measure, but for details see Bartlett and Mendelson [2002] or Mohri et al. [2012]. Also, see Jiang et al. (2015) and Lehnert et al. (2018) for previous applications of Rademacher in reinforcement learning.
Definition 2.
Consider , and a set of such functions . The Rademacher complexity of this set, , is defined as:
where
, referred to as Rademacher random variables, are drawn uniformly at random from
.The Rademacher variables could be thought of as independent and identically distributed noise. Under this view, the average quantifies the extent to which matches the noise. We have a high Rademacher complexity for a complicated hypothesis space that can accurately match noise. Conversely, a simple hypothesis space has a low Rademacher complexity.
2.3 TransitionModel Notation
Our theoretical results focus on MDPs with deterministic transitions. This common assumption (Abbeel et al., 2006; Talvitie, 2017) simplifies notation and analysis. Note that a deterministic environment, like nearly all Atari games (Bellemare et al., 2013), can still be quite complex and challenging for modelbased reinforcement learning (Azizzadenesheli et al., 2018). We introduce an extension of our results to a stochastic setting in the Appendix.
We use an overloaded notation for a transition model (see Table 1). Notice that our definitions support stochastic step transitions since these transitions are policy dependent, and that we allow for stochastic policies.
3 – A Multistep Model for Modelbased Reinforcement Learning
is an extension of the onestep model—rather than only predicting a single step ahead, it learns to predict steps ahead using different functions:
where . Finally, by , we mean the set of these functions:
This model is different than the few examples of multistep models studied in prior work: Sutton (1995) as well as van Seijen and Sutton (2015) considered multistep models that do not take actions as input, but are implicitly conditioned on the current policy. Similarly, option models are multistep models that are conditioned on one specific policy and a termination condition (Precup and Sutton, 1998; Sutton et al., 1999; Silver and Ciosek, 2012). Finally, Silver et al. (2017) introduced a multistep model that directly predicts next values, but the model is defined for prediction tasks.
We now introduce a new rollout procedure using the multistep model. Note that by an step rollout we mean sampling the next action using the agent’s fixed policy , then computing the next state using the agent’s model, and then iterating this procedure for more times. We now show a novel rollout procedure using the multistep model that obviates the need for the model to get its own output.
To this end, we derive an approximate experession for:
Key to our approach is to rewrite in terms of predictions conditioned on action sequences as shown below:
Observe that given the model introduced above, is actually available—we only need to focus on the quantity . Intuitively, we need to compute the probability of taking a sequence of actions of length in the next steps starting from . This probability is clearly determined by the states observed in the next steps, and could be written as follows:
We can compute if we have . Continuing for steps:
which we can compute given the first functions of , namely for .
Finally, to compute a rollout, we sample from by sampling from the policy at each step:
Notice that, in the above rollout with , we have used the first state as the starting point of every single rollout step. Crucially, we do not feed the intermediate state predictions to the model as input. We hypothesize that this approach can combat the compounding error problem by removing one source of error, namely feeding the model a noisy input, which is otherwise present in the rollout using the onestep model. We illustrate the rollout procedure in Figure 1 for a better juxtaposition of this new rollout procedure and the standard rollout procedure performed using the onestep model. In the rest of the paper, we test this hypothesis in theory and practice.
4 ValueFunction Error Bounds
We argued above that a multistep model can better deal with the compoundingerror problem. We now formalize this claim in the context of policy evaluation. Specifically, we show a bound on valuefunction estimation error of a fixed policy in terms of the error in the agent’s model, while highlighting similar bounds from prior work (Ross et al., 2011; Talvitie, 2017; Asadi et al., 2018b). All proofs can be found in the Appendix. Note also that in all expectations below actions and states are distributed according to the agent’s fixed policy and its stationary distribution, respectively.
Theorem 1.
Define the step value function , then
Moving to the multistep case with , we have the following result:
Theorem 2.
Note that, by leveraging in Theorem 2, we removed the factor from each summand of the bound in Theorem 1. This result suggests that is more conducive to valuefunction learning as long as step generalization error grows slowly with . In the next section, we show that this property holds under weak assumptions, concluding that the bound from Theorem 2 is an improvement over the bound from Theorem 1.
5 Analysis of Generalization Error
We now study the sample complexity of learning step dynamics. To formulate the problem, consider a dataset . Consider a scalarvalued function , where is the dimension of the state space, and a set of such functions . Consider a function and a set of functions
For each we learn a function such that:
Lemma 1.
For any function , , training error , and probability at least :
The second and third terms of the bound are not functions of . Therefore, to understand how generalization error grows as a function of , we look at the dependence of Rademacher complexity of to .
Lemma 2.
For the hypothesis space shown in Figure 7 (see Appendix):
In our experiments, we represent action sequences with hot vector so . We finally get the desired result:
Theorem 3.
Define the constant and the constant . Then:

with the onestep model,

with the multistep model,
Note the reduction of factor in the coefficient of , which is typically larger than .
6 Discussion
As training the multistep involves training different functions, the computational complexity of learning is times more than the complexity of learning the onestep model. However, the two rollout procedures shown in Figure 1 require equal computation. So in cases where planning is the bottleneck, the overall computational complexity of the two cases should be similar. Specifically, in experiments we observed that the overall running time under the multistep model was always less than double the running time under the onestep model.
Previous work has identified various ways of improving onestep models: quantifying model uncertainty during planning (Deisenroth and Rasmussen, 2011; Gal et al., 2018), hallucinating training examples (Talvitie, 2014; Azizzadenesheli et al., 2018; Venkatraman et al., 2015; Oh et al., 2015; Bengio et al., 2015), using ensembles (Kurutach et al., 2018; Shyam et al., 2018), or model regularization (Jiang et al., 2015; Asadi et al., 2018b). These methods are independent of the multistep model idea; future work will explore the benefits of combining these advances with .
7 Empirical Results
The goal of our experiments is to investigate if the multistep model can perform better than the onestep model in several modelbased scenarios. We specifically set up experiments in background planning and decisiontime planning to test this hypothesis. Note also that we provide code for all experiments in our supplementary material.
7.1 Background Planning
As mentioned before, one use case for models is to enable fast valuefunction estimation. We compare the utility of the onestep model and the multistep model for this purpose. For this experiment, we used the allaction variant of actorcritic algorithm, in which the value function (or the critic) is used to estimate the policy gradient (Sutton et al., 2000; Allen et al., 2017). Note that it is standard to learn the value function modelfree:
where . Mnih et al. (2016) generalize this objective to a multistep target . Crucially, in the modelfree case, we compute using the single trajectory observed during environmental interaction. However, because the policy is stochastic,
is a random variable with some variance. To reduce variance, we can use a learned model to generate an arbitrary number of rollouts (5 in our experiments), compute
for each rollout, and average them. We compared the effectiveness of both onestep and the multistep models in generating useful rollouts for learning. To ensure meaningful comparison, for each algorithm, we perform the same number of valuefunction updates. We used three standard RL domains from Open AI Gym (Brockman et al., 2016), namely Cart Pole, Acrobot, and Lunar Lander. Results are summarized in Figures 2 and 3.To better understand the advantage of the multistep model, we show perepisode transition error under the two models. Figure 8 (see Appendix) clearly indicates that the multistep model is more accurate for longer horizons. This is consistent with the theoretical results presented earlier. Note that in this experiment we did not use the model for action selection, and simply queried the policy network and sampled an action from the distribution provided by the network given a state input.
7.2 Decisiontime Planning
We now use the model for action selection. A common actionselection strategy is to choose , called the modelfree strategy, hereafter. Our goal is to compare the utility of modelfree strategy with its modelbased counterparts. Our desire is to compare the effectiveness of the onestep model with the multistep model in this scenario.
A key choice in decisiontime planning is the strategy used to construct the tree. One approach is to expand the tree for each action in each observed state (Azizzadenesheli et al., 2018). The main problem with this strategy is that the number of nodes grow exponentially. Alternatively, using a learned actionvalue function , at each state we can only expand the most promising action . Clearly, given the same amount of computation, the second strategy can benefit from performing deeper look aheads. The two strategies are illustrated in Figure 4 (left).
Note that because the model is trained from experience, it is still only accurate up to a certain depth. Therefore, when we reach a specified planning horizon, , we simply use as an estimate of future sum of rewards from the leaf node . While this estimate can be erroneous, we observed that it is necessary to consider, because otherwise the agent will be myopic in the sense that it only looks at the shortterm effects of its actions.
The second question is how to determine the best action given the built tree. One possibility is to add all rewards to the value of the leaf node, and go with the action that maximizes this number. As shown in Figure 4 (right), another idea is to use an ensemble where the final value of the action is computed using the mean of the
different estimates along the rollout. This idea is based on the notion that in machine learning averaging many estimates can often lead to a better estimate than the individual ones
(Schapire, 2003; Caruana et al., 2004).The two tree expansion strategies, and the two actionvalue estimation strategies together constitute four possible combinations. To find the most effective combination, we first performed an experiment in the Lunar Lander setting where, given different pretrained functions, we computed the improvement that the modelbased policy offers relative to the modelfree policy. We trained these function using the DQN algorithm (Mnih et al., 2013) and stored weights every 100 episodes, giving us 20 snapshots of . The models were also trained using the same amount of data that a particular was trained on. We then tested the four strategies (no learning was performed during testing). For each episode, we took the frozen network of that episode, and compared the performance of different policies given and the trained models. In this case, by performance we mean average episode return over 20 episodes.
Results, averaged over 200 runs, are presented in Figure 5 (left), and show an advantage for the ensemble and optimalaction combination (labeled optimal ensemble). Note that, in all four cases, the model used for tree search was the onestep model, and so this served as an experiment to find the best combination under this model. We then performed the same experiment with the multistep model as shown in Figure 5 (right) using the best combination (i.e. optimal action expansion with ensemble value computation). We clearly see that is more useful in this scenario as well.
We further investigated whether the superiority in terms of action selection can actually accelerate DQN training as well. In this scenario, we ran DQN under different policies, namely modelfree, modelbased with the onestep model, and modelbased with . In all cases, we chose a random action with probability for exploration. See Figure 6, which again shows the benefit of the multistep model for decisiontime planning in modelbased RL.
8 Conclusion
We introduced an approach to multistep modelbased RL and provided results that suggest its promise. We introduced the model along with a new rollout procedure that exploits it. The combination was proven useful from a theoretical and practical point of view. Together, our results made a strong case for multistep modelbased RL.
9 Future Work
An important avenue for future work is to better explore methods appropriate to stochastic environments. (See Appendix for an extension in this direction.) Another thread to explore is an ensemble method enabled by the model. Specifically, we can show that an step prediction can be estimated in exponentially many ways, and combining these estimates can lead to a better final estimate (see Appendix).
References
 Abbeel et al. [2006] Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement learning. In ICML, 2006.
 Allen et al. [2017] Cameron Allen, Kavosh Asadi, Melrose Roderick, Abdelrahman Mohamed, George Konidaris, and Michael Littman. Mean actor critic. arXiv preprint arXiv:1709.00503, 2017.
 Asadi et al. [2018a] Kavosh Asadi, Evan Cater, Dipendra Misra, and Michael L. Littman. Equivalence between Wasserstein and valueaware modelbased reinforcement learning. In ICML workshop on Prediction and Generative Modeling in Reinforcement Learning, 2018.
 Asadi et al. [2018b] Kavosh Asadi, Dipendra Misra, and Michael L. Littman. Lipschitz continuity in modelbased reinforcement learning. In ICML, 2018.
 Asis et al. [2017] Kristopher De Asis, J. Fernando HernandezGarcia, G. Zacharias Holland, and Richard S. Sutton. Multistep reinforcement learning: A unifying algorithm. CoRR, abs/1703.01327, 2017.
 Azar et al. [2012] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert Kappen. On the sample complexity of reinforcement learning with a generative model. In International Conference on Machine Learning, 2012.
 Azizzadenesheli et al. [2018] Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C. Lipton, and Animashree Anandkumar. Sampleefficient deep RL with generative adversarial tree search. CoRR, abs/1806.05780, 2018.
 Bartlett and Mendelson [2002] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2002.
 Bellemare et al. [2013] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013.

Bengio et al. [2015]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer.
Scheduled sampling for sequence prediction with recurrent neural networks.
In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.  Berkenkamp et al. [2017] Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe modelbased reinforcement learning with stability guarantees. In Advances in neural information processing systems, pages 908–918, 2017.
 Box [1976] George EP Box. Science and statistics. Journal of the American Statistical Association, 1976.
 Brafman and Tennenholtz [2002] Ronen I Brafman and Moshe Tennenholtz. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
 Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Caruana et al. [2004] Rich Caruana, Alexandru NiculescuMizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In ICML, 2004.
 Dann and Brunskill [2015] Christoph Dann and Emma Brunskill. Sample complexity of episodic fixedhorizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
 Deisenroth and Rasmussen [2011] Marc Peter Deisenroth and Carl Edward Rasmussen. PILCO: A modelbased and dataefficient approach to policy search. In ICML, 2011.
 Dempster et al. [1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 1977.
 Gal et al. [2018] Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian neural network dynamics models. In ICML, 2018.
 Gelada et al. [2019] Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. DeepMDP: Learning continuous latent space models for representation learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2170–2179, Long Beach, California, USA, 09–15 Jun 2019. PMLR.

Hart et al. [1972]
Peter E Hart, Nils J Nilsson, and Bertram Raphael.
Correction to a formal basis for the heuristic determination of minimum cost paths.
ACM SIGART Bulletin, pages 28–29, 1972.  Jiang et al. [2015] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
 Kearns and Singh [2002] Michael J. Kearns and Satinder P. Singh. Nearoptimal reinforcement learning in polynomial time. Machine Learning, 2002.
 Kocsis and Szepesvári [2006] Levente Kocsis and Csaba Szepesvári. Bandit based MonteCarlo planning. In ECML, 2006.
 Kurutach et al. [2018] Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Modelensemble trustregion policy optimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, 2018.
 Ledoux and Talagrand [2013] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
 Lehnert et al. [2018] Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of long horizon problems. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 Levine and Abbeel [2014] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
 Luo et al. [2018] Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for modelbased deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
 Machado et al. [2018] Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
 Maurer [2016] Andreas Maurer. A vectorcontraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3–17. Springer, 2016.
 Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, et al. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
 Moerland et al. [2017] Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Learning multimodal transition dynamics for modelbased reinforcement learning. arXiv preprint arXiv:1705.00470, 2017.
 Mohri et al. [2012] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
 Moore and Atkeson [1993] Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1), 1993.
 Nagabandi et al. [2018] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
 Oh et al. [2015] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Actionconditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871, 2015.
 Precup and Sutton [1998] Doina Precup and Richard S Sutton. Multitime models for temporally abstract planning. In Advances in neural information processing systems, pages 1050–1056, 1998.
 Precup [2000] Doina Precup. Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
 Puterman [2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

Ross et al. [2011]
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to noregret online learning.
In AISTATS, 2011.  Rummery and Niranjan [1994] GA Rummery and M Niranjan. Online Qlearning using connectionist systems. Technical Report CUED/FINFENG/TR 166, U. of Cambridge, Dept. of Engineering, 1994.
 Russell and Norvig [2016] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.
 Schapire [2003] Robert E Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
 ShalevShwartz and BenDavid [2014] Shai ShalevShwartz and Shai BenDavid. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 Shyam et al. [2018] Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez. Modelbased active exploration. CoRR, abs/1810.12162, 2018.
 Silver and Ciosek [2012] David Silver and Kamil Ciosek. Compositional planning using optimal option models. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1267–1274. Omnipress, 2012.
 Silver et al. [2016] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
 Silver et al. [2017] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel DulacArnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: Endtoend learning and planning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3191–3199. JMLR. org, 2017.
 Singh and Sutton [1996] Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 22(13):123–158, 1996.
 Sun et al. [2018] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Modelbased reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540, 2018.
 Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
 Sutton et al. [1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211, 1999.
 Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Neurips, 2000.
 Sutton et al. [2008] Richard S. Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael H. Bowling. Dynastyle planning with linear function approximation and prioritized sweeping. In UAI, 2008.
 Sutton [1990] Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990.
 Sutton [1995] Richard S Sutton. TD models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995. Elsevier, 1995.
 Szita and Szepesvári [2010] István Szita and Csaba Szepesvári. Modelbased reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 1031–1038, 2010.
 Talvitie [2014] Erik Talvitie. Model regularization for stable sample rollouts. In UAI, 2014.
 Talvitie [2017] Erik Talvitie. Selfcorrecting models for modelbased reinforcement learning. In AAAI, 2017.
 van Seijen and Sutton [2015] Harm van Seijen and Rich Sutton. A deeper look at planning as learning from replay. In ICML, 2015.
 Venkatraman et al. [2015] Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multistep prediction of learned time series models. In AAAI, 2015.
 Wit et al. [2012] Ernst Wit, Edwin van den Heuvel, and JanWillem Romeijn. ‘All models are wrong…’: An introduction to model uncertainty. Statistica Neerlandica, 66(3), 2012.
10 Appendix
10.1 Proofs
See 2
Proof.
First note that:
So we rather focus on the set of scalarvalued functions . Moreover, to represent the model, we set to be neural network with bounded weights as characterized in Figure 7.
Now note that , so by Theorem 4.12 in Ledoux and Talagrand [2013]:
and , so using CauchyShwartz:  
Due to Jensen’s inequality for concave function :  
∎
Lemma 3.
Asadi et al. [2018a] Define the step value function:
and assume a Lipschitz reward function with constant . Then:
See 1
Proof.
We reach the desired result by expanding the first term for more times. ∎
See 2
Proof.
∎
See 1
Proof.
We heavily make use of techniques provided by Bartlett and Mendelson [2002]. First note that, due to the definition of , we clearly have:
(1)  
We define to be the in the right hand side of the above bound:
We can bound in terms of :
Comments
There are no comments yet.