The model-based approach to reinforcement learning (RL) offers a unique framework for addressing three important artificial intelligence (AI) problems: understanding the dynamics of the environment through interaction, using the acquired knowledge for planning, and performing sequential decision making. One promise of model-based RL is to enable sample-efficient learning(Sutton, 1990; Deisenroth and Rasmussen, 2011; Levine and Abbeel, 2014). This advantage is well-understood in theory in settings such as value-function estimation (Azar et al., 2012)
, exploration in discounted and finite-horizon Markov decision processes (MDPs)(Szita and Szepesvári, 2010; Dann and Brunskill, 2015), and exploration in contextual decision processes (Sun et al., 2018).
Planning has long been a fundamental research area in AI (Hart et al., 1972; Russell and Norvig, 2016). In the context of model-based RL, Sutton and Barto (2018) define planning as any process that goes from a model to a value function or a policy. They describe two notions of planning, namely decision-time planning and background planning. In decision-time planning, the agent utilizes the model during action selection by performing tree search (Kearns and Singh, 2002; Kocsis and Szepesvári, 2006; Silver et al., 2016). An agent is said to be performing background planning if it utilizes the model to update its value or policy network. Examples of background planning include the Dyna Architecture (Sutton, 1990; Sutton et al., 2008), or model-based policy-search (Abbeel et al., 2006; Deisenroth and Rasmussen, 2011; Kurutach et al., 2018). We call an algorithm model-based if it performs either background or decision-time planning.
A key aspect distinguishing model-based RL from traditional planning is that the model is learned from experience. As such, the model may be imperfect due to ineffective generalization (Abbeel et al., 2006; Nagabandi et al., 2018), inadequate exploration (Brafman and Tennenholtz, 2002; Shyam et al., 2018), overfitting (Asadi et al., 2018b), or irreducible errors in unrealizable settings (Shalev-Shwartz and Ben-David, 2014; Talvitie, 2014). More generally, a common view across various scientific disciplines is that all models are wrong, though some are still useful (Box, 1976; Wit et al., 2012).
Previous work explored learning models that look a single step ahead. To predict steps ahead, the starting point of a step is set to the end point of the previous step . Unfortunately, when the model is wrong, this procedure can interfere with successful planning (Talvitie, 2014; Venkatraman et al., 2015; Asadi et al., 2018b). A significant reason for this failure is that the model may produce a “fake” input, meaning an input that cannot possibly occur in the domain, which is then fed back to the unprepared one-step model. Notice that two sources of error co-exist: The model is imperfect, and the model gets an inaccurate input in all but the first step. The interplay between the two errors leads to what is referred to as the compounding-error problem (Asadi et al., 2018b). To mitigate the problem, Talvitie (2014) and Venkatraman et al. (2015) provided an approach, called hallucination, that prepares the model for the fake inputs generated by itself. In contrast, our approach to the problem is to avoid feeding such fake inputs altogether by using a multi-step model. Though multi-step algorithms are popular in the model-free setting (Sutton and Barto, 2018; Asis et al., 2017; Singh and Sutton, 1996; Precup, 2000), extension to the model-based setting remains less explored.
Our main contribution is to propose a Multi-step Model for Model-based RL (or simply ) that directly predicts the outcome of executing a sequence of actions. Learning the multi-step model allows us to avoid feeding fake inputs to the model. We further introduce a novel rollout procedure in which the original first state of the rollout will be the starting point across all rollout steps. Our theory shows that, relative to the one-step model, learning the multi-step model is more effective for value estimation. To this end, we study the hardness of learning the multi-step model through the lens of Rademacher complexity (Bartlett and Mendelson, 2002). Finally, we empirically evaluate the multi-step model and show its advantage relative to the one-step model in the context of background planning and decision-time planning.
2 Background and Notation
To formulate the reinforcement-learning problem, we use finite-horizon Markov decsion processes (MDPs) with continuous states and discrete actions. See Puterman (2014) for a thorough treatment of MDPs, and Sutton and Barto (2018) for an introduction to reinforcement learning.
2.1 Lipschitz Continuity
Given metric spaces and , is Lipschitz if ,
is finite. Similarly, is uniformly Lipschitz in if the quantity below is finite:
|probability of taking in||N/A|
|MDP state after taking in|
2.2 Rademacher Complexity
We use Rademacher complexity for sample complexity analysis. We define this measure, but for details see Bartlett and Mendelson  or Mohri et al. . Also, see Jiang et al. (2015) and Lehnert et al. (2018) for previous applications of Rademacher in reinforcement learning.
Consider , and a set of such functions . The Rademacher complexity of this set, , is defined as:
where , referred to as Rademacher random variables, are drawn uniformly at random from
, referred to as Rademacher random variables, are drawn uniformly at random from.
The Rademacher variables could be thought of as independent and identically distributed noise. Under this view, the average quantifies the extent to which matches the noise. We have a high Rademacher complexity for a complicated hypothesis space that can accurately match noise. Conversely, a simple hypothesis space has a low Rademacher complexity.
2.3 Transition-Model Notation
Our theoretical results focus on MDPs with deterministic transitions. This common assumption (Abbeel et al., 2006; Talvitie, 2017) simplifies notation and analysis. Note that a deterministic environment, like nearly all Atari games (Bellemare et al., 2013), can still be quite complex and challenging for model-based reinforcement learning (Azizzadenesheli et al., 2018). We introduce an extension of our results to a stochastic setting in the Appendix.
We use an overloaded notation for a transition model (see Table 1). Notice that our definitions support stochastic -step transitions since these transitions are policy dependent, and that we allow for stochastic policies.
3 – A Multi-step Model for Model-based Reinforcement Learning
is an extension of the one-step model—rather than only predicting a single step ahead, it learns to predict steps ahead using different functions:
where . Finally, by , we mean the set of these functions:
This model is different than the few examples of multi-step models studied in prior work: Sutton (1995) as well as van Seijen and Sutton (2015) considered multi-step models that do not take actions as input, but are implicitly conditioned on the current policy. Similarly, option models are multi-step models that are conditioned on one specific policy and a termination condition (Precup and Sutton, 1998; Sutton et al., 1999; Silver and Ciosek, 2012). Finally, Silver et al. (2017) introduced a multi-step model that directly predicts next values, but the model is defined for prediction tasks.
We now introduce a new rollout procedure using the multi-step model. Note that by an -step rollout we mean sampling the next action using the agent’s fixed policy , then computing the next state using the agent’s model, and then iterating this procedure for more times. We now show a novel rollout procedure using the multi-step model that obviates the need for the model to get its own output.
To this end, we derive an approximate experession for:
Key to our approach is to rewrite in terms of predictions conditioned on action sequences as shown below:
Observe that given the model introduced above, is actually available—we only need to focus on the quantity . Intuitively, we need to compute the probability of taking a sequence of actions of length in the next steps starting from . This probability is clearly determined by the states observed in the next steps, and could be written as follows:
We can compute if we have . Continuing for steps:
which we can compute given the first functions of , namely for .
Finally, to compute a rollout, we sample from by sampling from the policy at each step:
Notice that, in the above rollout with , we have used the first state as the starting point of every single rollout step. Crucially, we do not feed the intermediate state predictions to the model as input. We hypothesize that this approach can combat the compounding error problem by removing one source of error, namely feeding the model a noisy input, which is otherwise present in the rollout using the one-step model. We illustrate the rollout procedure in Figure 1 for a better juxtaposition of this new rollout procedure and the standard rollout procedure performed using the one-step model. In the rest of the paper, we test this hypothesis in theory and practice.
4 Value-Function Error Bounds
We argued above that a multi-step model can better deal with the compounding-error problem. We now formalize this claim in the context of policy evaluation. Specifically, we show a bound on value-function estimation error of a fixed policy in terms of the error in the agent’s model, while highlighting similar bounds from prior work (Ross et al., 2011; Talvitie, 2017; Asadi et al., 2018b). All proofs can be found in the Appendix. Note also that in all expectations below actions and states are distributed according to the agent’s fixed policy and its stationary distribution, respectively.
Define the -step value function , then
Moving to the multi-step case with , we have the following result:
Note that, by leveraging in Theorem 2, we removed the factor from each summand of the bound in Theorem 1. This result suggests that is more conducive to value-function learning as long as -step generalization error grows slowly with . In the next section, we show that this property holds under weak assumptions, concluding that the bound from Theorem 2 is an improvement over the bound from Theorem 1.
5 Analysis of Generalization Error
We now study the sample complexity of learning -step dynamics. To formulate the problem, consider a dataset . Consider a scalar-valued function , where is the dimension of the state space, and a set of such functions . Consider a function and a set of functions
For each we learn a function such that:
For any function , , training error , and probability at least :
The second and third terms of the bound are not functions of . Therefore, to understand how generalization error grows as a function of , we look at the dependence of Rademacher complexity of to .
For the hypothesis space shown in Figure 7 (see Appendix):
In our experiments, we represent action sequences with -hot vector so . We finally get the desired result:
Define the constant and the constant . Then:
with the one-step model,
with the multi-step model,
Note the reduction of factor in the coefficient of , which is typically larger than .
As training the multi-step involves training different functions, the computational complexity of learning is -times more than the complexity of learning the one-step model. However, the two rollout procedures shown in Figure 1 require equal computation. So in cases where planning is the bottleneck, the overall computational complexity of the two cases should be similar. Specifically, in experiments we observed that the overall running time under the multi-step model was always less than double the running time under the one-step model.
Previous work has identified various ways of improving one-step models: quantifying model uncertainty during planning (Deisenroth and Rasmussen, 2011; Gal et al., 2018), hallucinating training examples (Talvitie, 2014; Azizzadenesheli et al., 2018; Venkatraman et al., 2015; Oh et al., 2015; Bengio et al., 2015), using ensembles (Kurutach et al., 2018; Shyam et al., 2018), or model regularization (Jiang et al., 2015; Asadi et al., 2018b). These methods are independent of the multi-step model idea; future work will explore the benefits of combining these advances with .
7 Empirical Results
The goal of our experiments is to investigate if the multi-step model can perform better than the one-step model in several model-based scenarios. We specifically set up experiments in background planning and decision-time planning to test this hypothesis. Note also that we provide code for all experiments in our supplementary material.
7.1 Background Planning
As mentioned before, one use case for models is to enable fast value-function estimation. We compare the utility of the one-step model and the multi-step model for this purpose. For this experiment, we used the all-action variant of actor-critic algorithm, in which the value function (or the critic) is used to estimate the policy gradient (Sutton et al., 2000; Allen et al., 2017). Note that it is standard to learn the value function model-free:
where . Mnih et al. (2016) generalize this objective to a multi-step target . Crucially, in the model-free case, we compute using the single trajectory observed during environmental interaction. However, because the policy is stochastic,
is a random variable with some variance. To reduce variance, we can use a learned model to generate an arbitrary number of rollouts (5 in our experiments), computefor each rollout, and average them. We compared the effectiveness of both one-step and the multi-step models in generating useful rollouts for learning. To ensure meaningful comparison, for each algorithm, we perform the same number of value-function updates. We used three standard RL domains from Open AI Gym (Brockman et al., 2016), namely Cart Pole, Acrobot, and Lunar Lander. Results are summarized in Figures 2 and 3.
To better understand the advantage of the multi-step model, we show per-episode transition error under the two models. Figure 8 (see Appendix) clearly indicates that the multi-step model is more accurate for longer horizons. This is consistent with the theoretical results presented earlier. Note that in this experiment we did not use the model for action selection, and simply queried the policy network and sampled an action from the distribution provided by the network given a state input.
7.2 Decision-time Planning
We now use the model for action selection. A common action-selection strategy is to choose , called the model-free strategy, hereafter. Our goal is to compare the utility of model-free strategy with its model-based counterparts. Our desire is to compare the effectiveness of the one-step model with the multi-step model in this scenario.
A key choice in decision-time planning is the strategy used to construct the tree. One approach is to expand the tree for each action in each observed state (Azizzadenesheli et al., 2018). The main problem with this strategy is that the number of nodes grow exponentially. Alternatively, using a learned action-value function , at each state we can only expand the most promising action . Clearly, given the same amount of computation, the second strategy can benefit from performing deeper look aheads. The two strategies are illustrated in Figure 4 (left).
Note that because the model is trained from experience, it is still only accurate up to a certain depth. Therefore, when we reach a specified planning horizon, , we simply use as an estimate of future sum of rewards from the leaf node . While this estimate can be erroneous, we observed that it is necessary to consider, because otherwise the agent will be myopic in the sense that it only looks at the short-term effects of its actions.
The second question is how to determine the best action given the built tree. One possibility is to add all rewards to the value of the leaf node, and go with the action that maximizes this number. As shown in Figure 4 (right), another idea is to use an ensemble where the final value of the action is computed using the mean of the
different estimates along the rollout. This idea is based on the notion that in machine learning averaging many estimates can often lead to a better estimate than the individual ones(Schapire, 2003; Caruana et al., 2004).
The two tree expansion strategies, and the two action-value estimation strategies together constitute four possible combinations. To find the most effective combination, we first performed an experiment in the Lunar Lander setting where, given different pretrained functions, we computed the improvement that the model-based policy offers relative to the model-free policy. We trained these function using the DQN algorithm (Mnih et al., 2013) and stored weights every 100 episodes, giving us 20 snapshots of . The models were also trained using the same amount of data that a particular was trained on. We then tested the four strategies (no learning was performed during testing). For each episode, we took the frozen network of that episode, and compared the performance of different policies given and the trained models. In this case, by performance we mean average episode return over 20 episodes.
Results, averaged over 200 runs, are presented in Figure 5 (left), and show an advantage for the ensemble and optimal-action combination (labeled optimal ensemble). Note that, in all four cases, the model used for tree search was the one-step model, and so this served as an experiment to find the best combination under this model. We then performed the same experiment with the multi-step model as shown in Figure 5 (right) using the best combination (i.e. optimal action expansion with ensemble value computation). We clearly see that is more useful in this scenario as well.
We further investigated whether the superiority in terms of action selection can actually accelerate DQN training as well. In this scenario, we ran DQN under different policies, namely model-free, model-based with the one-step model, and model-based with . In all cases, we chose a random action with probability for exploration. See Figure 6, which again shows the benefit of the multi-step model for decision-time planning in model-based RL.
We introduced an approach to multi-step model-based RL and provided results that suggest its promise. We introduced the model along with a new rollout procedure that exploits it. The combination was proven useful from a theoretical and practical point of view. Together, our results made a strong case for multi-step model-based RL.
9 Future Work
An important avenue for future work is to better explore methods appropriate to stochastic environments. (See Appendix for an extension in this direction.) Another thread to explore is an ensemble method enabled by the model. Specifically, we can show that an -step prediction can be estimated in exponentially many ways, and combining these estimates can lead to a better final estimate (see Appendix).
- Abbeel et al.  Pieter Abbeel, Morgan Quigley, and Andrew Y Ng. Using inaccurate models in reinforcement learning. In ICML, 2006.
- Allen et al.  Cameron Allen, Kavosh Asadi, Melrose Roderick, Abdel-rahman Mohamed, George Konidaris, and Michael Littman. Mean actor critic. arXiv preprint arXiv:1709.00503, 2017.
- Asadi et al. [2018a] Kavosh Asadi, Evan Cater, Dipendra Misra, and Michael L. Littman. Equivalence between Wasserstein and value-aware model-based reinforcement learning. In ICML workshop on Prediction and Generative Modeling in Reinforcement Learning, 2018.
- Asadi et al. [2018b] Kavosh Asadi, Dipendra Misra, and Michael L. Littman. Lipschitz continuity in model-based reinforcement learning. In ICML, 2018.
- Asis et al.  Kristopher De Asis, J. Fernando Hernandez-Garcia, G. Zacharias Holland, and Richard S. Sutton. Multi-step reinforcement learning: A unifying algorithm. CoRR, abs/1703.01327, 2017.
- Azar et al.  Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert Kappen. On the sample complexity of reinforcement learning with a generative model. In International Conference on Machine Learning, 2012.
- Azizzadenesheli et al.  Kamyar Azizzadenesheli, Brandon Yang, Weitang Liu, Emma Brunskill, Zachary C. Lipton, and Animashree Anandkumar. Sample-efficient deep RL with generative adversarial tree search. CoRR, abs/1806.05780, 2018.
- Bartlett and Mendelson  Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2002.
- Bellemare et al.  Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. JAIR, 2013.
Bengio et al. 
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer.
Scheduled sampling for sequence prediction with recurrent neural networks.In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
- Berkenkamp et al.  Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. In Advances in neural information processing systems, pages 908–918, 2017.
- Box  George EP Box. Science and statistics. Journal of the American Statistical Association, 1976.
- Brafman and Tennenholtz  Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
- Brockman et al.  Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Caruana et al.  Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In ICML, 2004.
- Dann and Brunskill  Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826, 2015.
- Deisenroth and Rasmussen  Marc Peter Deisenroth and Carl Edward Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In ICML, 2011.
- Dempster et al.  Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 1977.
- Gal et al.  Yarin Gal, Rowan McAllister, and Carl Edward Rasmussen. Improving pilco with bayesian neural network dynamics models. In ICML, 2018.
- Gelada et al.  Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. DeepMDP: Learning continuous latent space models for representation learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2170–2179, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
Hart et al. 
Peter E Hart, Nils J Nilsson, and Bertram Raphael.
Correction to a formal basis for the heuristic determination of minimum cost paths.ACM SIGART Bulletin, pages 28–29, 1972.
- Jiang et al.  Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
- Kearns and Singh  Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002.
- Kocsis and Szepesvári  Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In ECML, 2006.
- Kurutach et al.  Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-ensemble trust-region policy optimization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- Ledoux and Talagrand  Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.
- Lehnert et al.  Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of long horizon problems. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Levine and Abbeel  Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
- Luo et al.  Yuping Luo, Huazhe Xu, Yuanzhi Li, Yuandong Tian, Trevor Darrell, and Tengyu Ma. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858, 2018.
- Machado et al.  Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562, 2018.
- Maurer  Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3–17. Springer, 2016.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Mnih et al.  Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, et al. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
- Moerland et al.  Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Learning multimodal transition dynamics for model-based reinforcement learning. arXiv preprint arXiv:1705.00470, 2017.
- Mohri et al.  Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
- Moore and Atkeson  Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine learning, 13(1), 1993.
- Nagabandi et al.  Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE, 2018.
- Oh et al.  Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871, 2015.
- Precup and Sutton  Doina Precup and Richard S Sutton. Multi-time models for temporally abstract planning. In Advances in neural information processing systems, pages 1050–1056, 1998.
- Precup  Doina Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, page 80, 2000.
- Puterman  Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
Ross et al. 
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to no-regret online learning.In AISTATS, 2011.
- Rummery and Niranjan  GA Rummery and M Niranjan. On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, U. of Cambridge, Dept. of Engineering, 1994.
- Russell and Norvig  Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,, 2016.
- Schapire  Robert E Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
- Shalev-Shwartz and Ben-David  Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Shyam et al.  Pranav Shyam, Wojciech Jaskowski, and Faustino Gomez. Model-based active exploration. CoRR, abs/1810.12162, 2018.
- Silver and Ciosek  David Silver and Kamil Ciosek. Compositional planning using optimal option models. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1267–1274. Omnipress, 2012.
- Silver et al.  David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016.
- Silver et al.  David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3191–3199. JMLR. org, 2017.
- Singh and Sutton  Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 22(1-3):123–158, 1996.
- Sun et al.  Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based reinforcement learning in contextual decision processes. arXiv preprint arXiv:1811.08540, 2018.
- Sutton and Barto  Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.
- Sutton et al.  Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
- Sutton et al.  Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Neurips, 2000.
- Sutton et al.  Richard S. Sutton, Csaba Szepesvári, Alborz Geramifard, and Michael H. Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. In UAI, 2008.
- Sutton  Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine Learning Proceedings 1990, pages 216–224. Elsevier, 1990.
- Sutton  Richard S Sutton. TD models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995. Elsevier, 1995.
- Szita and Szepesvári  István Szita and Csaba Szepesvári. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038, 2010.
- Talvitie  Erik Talvitie. Model regularization for stable sample rollouts. In UAI, 2014.
- Talvitie  Erik Talvitie. Self-correcting models for model-based reinforcement learning. In AAAI, 2017.
- van Seijen and Sutton  Harm van Seijen and Rich Sutton. A deeper look at planning as learning from replay. In ICML, 2015.
- Venkatraman et al.  Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of learned time series models. In AAAI, 2015.
- Wit et al.  Ernst Wit, Edwin van den Heuvel, and Jan-Willem Romeijn. ‘All models are wrong…’: An introduction to model uncertainty. Statistica Neerlandica, 66(3), 2012.
First note that:
So we rather focus on the set of scalar-valued functions . Moreover, to represent the model, we set to be neural network with bounded weights as characterized in Figure 7.
Now note that , so by Theorem 4.12 in Ledoux and Talagrand :
|and , so using Cauchy-Shwartz:|
|Due to Jensen’s inequality for concave function :|
Asadi et al. [2018a] Define the -step value function:
and assume a Lipschitz reward function with constant . Then:
We reach the desired result by expanding the first term for more times. ∎