1 Introduction
Reinforcement learning (Sutton and Barto, 1998)
has become the framework for studying AI agents that learn, plan, and act in uncertain and sequential environments. This success is partly due to simple and general reinforcement learning algorithms that, when equipped with neural networks, can solve complex domains such as games with massive state spaces
(Mnih et al., 2015; Silver et al., 2016). Games usually provide a simulator which can be used to experience a large amount of environmental interaction. A good simulator is unavailable in many applications, so it is imperative to learn a good behavior with fewer environmental interactions.When environmental interaction is expensive but computation is cheap, the modelbased approach to reinforcement learning offers a solution by summarizing past experience using a model informally thought of as an internal simulator of the environment (Sutton and Barto, 1998). An agent can then use the learned model in different capacities. For example, a model can be used to perform tree search (Kocsis and Szepesvári, 2006; Silver et al., 2016)
, to update the valuefunction estimate
(Sutton, 1990; Parr et al., 2008; Sutton et al., 2008), to update an explicitly represented policy (Abbeel et al., 2006; Deisenroth and Rasmussen, 2011), or to make optionconditional predictions (Sutton et al., 1999; Sorg and Singh, 2010).Regardless of how a model is utilized, effectiveness of planning depends on the accuracy of the model. A widely accepted view is that all models are imperfect, but some are still useful (Box, 1976). Generally speaking, modeling errors can be due to overfitting or underfitting, or simply due to the fact that the setting is agnostic, and so, some irreducible error exists. In reinforcement learning, for example, partial observability and nonMarkovian dynamics can yield an agnostic setting.
In modelbased setting it is common to learn a onestep model of the environment. When performing rollouts using a onestep model, the output of the model is then used in the subsequent step as the input. It is discovered, however, that in this case even small modeling errors, which are unavoidable as argued above, can severly degrade multistep predictions (Talvitie, 2017; Venkatraman et al., 2015; Asadi et al., 2018b). Intuitively, the problem is that the model is not trained to perform reasonably on its own output. The rollout process can then get derailed by moving out of the state space after a few steps.
Here we study a direct approach to computing multistep predictions. We propose a simple multistep model that learns to predict the outcome of an action sequence with variable length. We show that in terms of prediction accuracy our proposed model outperforms a onestep model when the prediction horizon is large. We further show that the model can provide the outcome of running a policy rather than just a fixed sequence of actions. To show the effectiveness of the multistep model, we use it for valuefunction optimization in the context of actorcritic reinforcement learning. Moreover, we report preliminary results on Atari Breakout showing that the multistep model outperforms the onestep model in terms of frame prediction after multiple timesteps.
2 Background and Notation
We briefly present the background and the notation used to articulate our results. For a complete background on reinforcement learning see Sutton and Barto (1998), and for a complete background on MDPs see Puterman (2014).
We focus on the reinforcement learning problem in which an agent interacts with an environment to maximize longterm reward. The problem is formulated by a Markov Decision Process, or simply an MDP. The tuple
represents an MDP where we assume a continuous statespace , and a discrete actionspace . Our goal is to find a policy that achieves high discounted cumulative reward.Here and represent the reward and transition dynamics of the MDP. More specifically is the reward after taking an action in a state and moving to a next state . This is denoted by . For transition dynamics we use an overloaded notation. In the simplest case, transition function can take as input a state, a next state, and an action:
Similarly, we define a transition function conditioned on a state and a sequence of actions:
as well as conditioned on a state and a policy:
It is useful to compute the longterm goodness of taking an action in a state . Referred to as the value function, this is defined as follows:
3 PolicyConditional Prediction
In this section we study the following basic question: How can we learn to predict the outcome of executing a policy in a state given a horizon . More specifically, our aim is to compute (or sample from) the following distribution:
We explain the standard approach in the next section, and introduce our approach in the subsequent section.
3.1 Singlestep Model
The common way of making the step prediction is to break the problem into singlestep prediction problems as follows:
The onestep probability distribution could further be written as:
Note that in a continuous statespace it is easier to work with a deterministic model, rather than the full distribution, so following prior work (Parr et al., 2008; Sutton et al., 2008) we use a deterministic approximation that learns to output the expected next state by minimizing mean squared error:
We represent this model, as well as all parameterized functions, using deep neural networks LeCun et al. (2015). This approximation can be poor in general, but is theoretically justified even when environment is stochastic (Parr et al., 2008; Sutton et al., 2008).
Finally, using this model, we can easily sample a state steps into the future:
Figure 1 illustrates the rollout process. The downside of this approach is that model’s output is provided as input times, yet the model is not trained on its own output. In practice, this causes the compounding error problem.
3.2 Multistep Model
We now articulate an alternative approach to computing multistep predictions. As we will later on show, it is quite easy to use experience to learn models conditioned on action sequences. So key to our approach is to rewrite the policyconditional prediction in terms of predictions conditioned on action sequences.
This approximation is perfect if the setting is deterministic. Now note that is conditioned on an action sequence, so we only need to worry about the other term, namely . We can rewrite this as follows:
Continuing for more steps we get:
Putting it together, we have:
In simple terms, the multistep model consists of functions where each function takes as input an action sequence of length along with a state, and predicts the state after steps. Finally, we can easily sample from the model as follows:
We illustrate the rollout process using Figure 2. Note the distinction between the two rollout scenarios. Unlike the onestep case, we never feed the output of a transition function to another transition function, thereby avoiding the compounding error problem.
4 Experiments
In this section we discuss preliminary empirical results comparing the two models. We evaluate the models in terms of prediction accuracy and usefulness for planning.
4.1 Control Domains
We ran our first experiments on three control domains, namely Cart Pole, Acrobot, and Lunar Lander. Public implementation of these domains are available on the web (Brockman et al., 2016). We used allactions version of the actorcritic algorithm (Barto et al., 1983; Sutton et al., 2000; Asadi et al., 2017) as the reinforcement learner where updates to actor and critic were performed at the end of each episode. We report all hyperparameters in the Appendix.
Our first goal is to compare the accuracy of the two models. To this end, we performed one episode of environmental interaction using the policy network, then measured the accuracy of models on the (just terminated) episode. An episode of experience was used for model training only after model’s prediction accuracy was computed for the episode.
In order to measure prediction accuracy for a given horizon and a given state in the episode, we computed the squared difference between model’s prediction and the state observed after steps. We then took average over all states within an episode. These quantities were separately computed for both models. We present our findings in Figure 3. The multistep model generally performed better than the onestep model, other than when the horizon was short in the Acrobot problem.
Although low squared error may be an indication of a good model, we actually care about effectiveness of a model during planning. To this end, we used the models for planning, specifically for valuefunction optimization. Recall that given a batch of experience many reinforcement learning algorithms perform valuefunction optimization using some variant of the following update rule:
For example, TD(0) uses the target . Alternatively, using the multistep model we can compute as shown in Algorithm 1. We can then compare the effectiveness of valuefunction optimization using the two models as well as using modelfree baselines. Note that to ensure a fair comparison, we performed a fixed number of updates to the critic network, and just varied the type of updates. Results are presented in Figure 4.
When planning with a onestep model, it was difficult to even outperform modelfree learners, unless when we carefully chose the prediction horizon. On the other hand, planning with the multistep model was clearly more effective than modelfree baselines, and outperformed the onestep model except in the Acrobot domain and for small values of horizon. We also observed an invertedU shape with respect to prediction horizon. We observe that there exists a tradeoff: with short prediction horizon there is little value in lookaheads, and for very large prediction horiozn lookaheads can be misleading due to large error. Therefore, an intermediate value works best. Jiang et al. (2015) provided theoretical results that confirm this empirical observation.
4.2 Atari Breakout
Our desire is to use the multistep model on larger domains. To this end, we trained one and multistep models on Atari Breakout using its public implementation (Brockman et al., 2016). Our dataset consisted of episodes from a trained DQN with an epsilongreedy policy. Our goal is to use the model for planning in future work, so we only focus on comparing multistep prediction error here.
The architecture of our onestep and multistep models were mainly based on Oh et al. (2015). It consisted of an encoder and a decoder which predicts a frame n steps ahead. For the multistep model, sequence of actions were fed into n distinct linear layers, then integrated into the network through a multiplication gate as they do in Oh et al. (2015). We trained the onestep model and multistep models, by minimizing mean squared error, on a dataset of 100 episodes of environmental interaction. We then tested the models on a heldout dataset.
As shown in Figure 5, we observed that the onestep model performs well for the first 3 time steps, but then becomes inaccurate very quickly. Rollouts using the onestep model did not accurately capture ball and paddle location as seen in Figure 6. On the other hand, the multistep model accurately predicted the location of the paddle (and sometimes the ball). The multistep models’ error is much lower than that of the onestep model, although the models increase in error slightly as the horizon increases. This result, though preliminary, is an evidence that the multistep model can be more effective for planning in Atari.
5 Related Work
There has been numerous previous studies concerning approximate modelbased reinforcement learning. Linear models were one of the earliest classes of approximate models studied in the literature (Parr et al., 2008; Sutton et al., 2008). The focus of these papers was learning a singlestep linear model and the solution found when planning with the model. Their fundamental result was to show that the value function found by planner is the same as the value function found by linear temporal difference learning. Consistent with our findings, these papers reported little empirical benefits from onestep models.
Deisenroth and Rasmussen (2011) argued that naive planning without explicitly incorporating model uncertainty can be a problem. They proposed a modelbased policysearch method based on gaussian processes to come up with a closedform policygradient update. This line of research has recently become more popular (Gal et al., 2016), but note that in these studies planning was performed using a onestep model. We hypothesize that our multistep model can also benefit from uncertainty encorporation during planning.
The compounding error problem has been observed in multiple prior studies (Venkatraman et al., 2015; Talvitie, 2014, 2017; Asadi et al., 2018b). In the context of time series, an algorithm was presented that improves multistep predictions by training the model on its own outputs (Venkatraman et al., 2015). Talvitie (2014) presented a similar algorithm for reinforcement learning, referred to it as hallucination, and showed that in some cases the new training scheme outperforms onestep methods without the hallucination technique. Later on this idea was theoretically investigated (Talvitie, 2017), and it was shown that pathological cases may arise if the policy is stochastic.
One can combine modelbased and modelfree approaches (Sutton, 1990; Yao et al., 2009; Asadi, 2015; Nagabandi et al., 2018). The motivation here is to get the benefits of both approaches. Interestingly, it is recently discovered that the process of decision making in the brain is also implemented using two processes that closely resemble modelfree (habitual) and modelbased (goalbased) learning. Two existing theories are that the two processes compete (Daw et al., 2005) or cooperate (Gershman et al., 2014) to make the final decisions.
Another interesting line of work is to learn models that, though imperfect, are tailored to the specific planning algorithm that is going to use them. For example, the model can be wrong, but it can still be very useful in terms of computing the fixed point of value function (Farahmand et al., 2017; Pires and Szepesvári, 2016)
. This idea is closely related to minimizing the Wasserstein metric as the loss function
(Asadi et al., 2018a). Two recent papers (Silver et al., 2017; Oh et al., 2017) proposed a similar idea, namely to learn abstract state representations useful to predict values of states observed multiple timesteps ahead .A few prior works have considered multistep models. Perhaps the first attempt was presented by Sutton (1995) in which tabular multistep models were studied. Later on (Sutton et al., 1999) proposed options, closedloop policies with start and termination conditions. They then discussed the idea of learning option models in the tabular case. The idea of learning option models was later extended to the linear case (Sorg and Singh, 2010). The main limitation here is that we learn a model per option, and so when a new option is considered, a new model should be learned from scratch. Finally, Vanseijen and Sutton (2015) articulated a multistep model in the linear case and showed connections to temporal difference learning with eligibility traces. However the model is only valid for the current policy which is a limitation during planning.
In terms of empirical results, successful attempts to use onestep approximate models are rare. Indeed it is possible to learn reasonable onestep models of Atari games (Oh et al., 2015), as well as other challenging domains (Feinberg et al., 2018), but these models are usually not very useful for multistep rollouts. Machado et al. (2018) noted that the main issue is the compounding error problem associated with onestep models. This is consistent with our findings on the preliminary Atari experiment, and so, our multistep model can provide a promising path to successful planning on Atari games by avoiding the compounding error problem.
We have noticed that the idea of learning multistep models conditioned on action sequences is being explored in an independent and concurrent work^{1}^{1}1This work is an anonymous submission at ICLR 2019.. The work focuses on a crossentropy method that, inspired by model predictive control, learns best actionsequences by optimizing from a candidate list of action sequences provided to the agent. In contrast, we used the multistep model for valuefunction optimization for actorcritic in which an explicit policy representation was used during planning.
6 Conclusion and Future Work
We presented a simple approach to multistep modelbased reinforcement learning as an alternative for onestep model learning. We found the multistep model to be more useful than the onestep model on domains considered in this paper. We believe that the discovery of this model is an important step towards modelbased reinforcement learning in the approximate setting.
Consistent with previous work, we found that composition of imperfect onestep models can in general be catastrophic. Moreover, we found that multistep rollouts are necessary in order to get the real benefits of planning. Based on our results, we believe that onestep models offer limited merits, and that composition of onestep models should in general be avoided.
The experiments reported in this work are still preliminary. That said, we believe that the reported results make a strong case for the multistep model.
References
 Abbeel et al. (2006) Abbeel, P., Quigley, M., and Ng, A. Y. (2006). Using inaccurate models in reinforcement learning. In Machine Learning, Proceedings of the TwentyThird International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 2529, 2006, pages 1–8.
 Asadi (2015) Asadi, K. (2015). Strengths, weaknesses, and combinations of modelbased and modelfree reinforcement learning. Master’s thesis, University of Alberta.
 Asadi et al. (2017) Asadi, K., Allen, C., Roderick, M., Mohamed, A.r., Konidaris, G., and Littman, M. (2017). Mean actor critic. arXiv preprint arXiv:1709.00503.
 Asadi et al. (2018a) Asadi, K., Cater, E., Misra, D., and Littman, M. L. (2018a). Equivalence between wasserstein and valueaware modelbased reinforcement learning. arXiv preprint arXiv:1806.01265.
 Asadi et al. (2018b) Asadi, K., Misra, D., and Littman, M. L. (2018b). Lipschitz continuity in modelbased reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 264–273.
 Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, pages 834–846.
 Box (1976) Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356):791–799.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.
 Daw et al. (2005) Daw, N. D., Niv, Y., and Dayan, P. (2005). Uncertaintybased competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12):1704.
 Deisenroth and Rasmussen (2011) Deisenroth, M. P. and Rasmussen, C. E. (2011). PILCO: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28  July 2, 2011, pages 465–472.
 Farahmand et al. (2017) Farahmand, A.m., Barreto, A., and Nikovski, D. (2017). Valueaware loss function for modelbased reinforcement learning. In Artificial Intelligence and Statistics, pages 1486–1494.
 Feinberg et al. (2018) Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. (2018). Modelbased value estimation for efficient modelfree reinforcement learning. arXiv preprint arXiv:1803.00101.
 Gal et al. (2016) Gal, Y., McAllister, R., and Rasmussen, C. E. (2016). Improving pilco with bayesian neural network dynamics models. In DataEfficient Machine Learning workshop, ICML.
 Gershman et al. (2014) Gershman, S. J., Markman, A. B., and Otto, A. R. (2014). Retrospective revaluation in sequential decision making: A tale of two systems. Journal of Experimental Psychology: General, 143(1):182.
 Jiang et al. (2015) Jiang, N., Kulesza, A., Singh, S., and Lewis, R. (2015). The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems.
 Kocsis and Szepesvári (2006) Kocsis, L. and Szepesvári, C. (2006). Bandit based montecarlo planning. In Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 1822, 2006, Proceedings, pages 282–293.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436.
 Machado et al. (2018) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533.
 Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE.
 Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Actionconditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871.
 Oh et al. (2017) Oh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Advances in Neural Information Processing Systems, pages 6118–6128.

Parr et al. (2008)
Parr, R., Li, L., Taylor, G., PainterWakefield, C., and Littman, M. L.
(2008).
An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning.
In Machine Learning, Proceedings of the TwentyFifth International Conference (ICML 2008), Helsinki, Finland, June 59, 2008, pages 752–759.  Pires and Szepesvári (2016) Pires, B. Á. and Szepesvári, C. (2016). Policy error bounds for modelbased reinforcement learning with factored linear models. In Conference on Learning Theory, pages 121–151.
 Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489.
 Silver et al. (2017) Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., DulacArnold, G., Reichert, D. P., Rabinowitz, N. C., Barreto, A., and Degris, T. (2017). The predictron: Endtoend learning and planning. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 3191–3199. PMLR.
 Sorg and Singh (2010) Sorg, J. and Singh, S. (2010). Linear options. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1Volume 1, pages 31–38. International Foundation for Autonomous Agents and Multiagent Systems.
 Sutton (1990) Sutton, R. S. (1990). Integrated modeling and control based on reinforcement learning. In Advances in Neural Information Processing Systems 3, [NIPS Conference, Denver, Colorado, USA, November 2629, 1990], pages 471–478.
 Sutton (1995) Sutton, R. S. (1995). Td models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995, pages 531–539. Elsevier.
 Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning  an introduction. Adaptive computation and machine learning. MIT Press.
 Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
 Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(12):181–211.
 Sutton et al. (2008) Sutton, R. S., Szepesvári, C., Geramifard, A., and Bowling, M. H. (2008). Dynastyle planning with linear function approximation and prioritized sweeping. In UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 912, 2008, pages 528–536.
 Talvitie (2014) Talvitie, E. (2014). Model regularization for stable sample rollouts. In UAI, pages 780–789. AUAI Press.
 Talvitie (2017) Talvitie, E. (2017). Selfcorrecting models for modelbased reinforcement learning. In AAAI, pages 2597–2603. AAAI Press.
 Vanseijen and Sutton (2015) Vanseijen, H. and Sutton, R. (2015). A deeper look at planning as learning from replay. In International conference on machine learning, pages 2314–2322.
 Venkatraman et al. (2015) Venkatraman, A., Hebert, M., and Bagnell, J. A. (2015). Improving multistep prediction of learned time series models. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, AAAI’15, pages 3024–3030.
 Yao et al. (2009) Yao, H., Bhatnagar, S., Diao, D., Sutton, R. S., and Szepesvári, C. (2009). Multistep dyna planning for policy evaluation and control. In Advances in Neural Information Processing Systems, pages 2187–2195.
7 Appendix
Here we report hyperparameters to help readers reproduce our results. Upon publication of the paper, we will publicize our code as well.
Cart Pole  Acrobot  Lunar Lander  
(discount rate)  0.9999  0.9999  0.9999  
critic network: hidden layers  1  1  1  

64  64  64  

[0.1, 0.05, 0.025, 0.01]  [0.005,0.0025,0.001,0.0005]  [0.025,0.01,0.005]  

20  20  20  
critic network: batch size  32  32  32  
actor network: hidden layers  1  1  1  

64  64  64  
actor network: step size  0.005  0.0005  0.001  

1  1  1  
actor network: batch size  length of episode  length of episode  length of episode  
transition functions: hidden layers  1  1  1  

64  64  64  

0.001  0.01  0.001  

100  20  100  
transition functions: batch size  128  1024  128  
reward model: hidden layers  1  1  1  

128  128  128  
reward model: step size  0.1  0.1  0.00025  

10  10  10  
reward model: batch size  128  128  128  
maximum buffer size  8000  5000  20000  
target network update frequency  1 episode  1 episode  1 episode 
batch size  4 
down sampling residual blocks  4 
up sample residual blocks  4 
residual unit size  128 
latent size  2048 
final conv size  32 
adam learning rate  .0001 
Comments
There are no comments yet.