Reinforcement learning (Sutton and Barto, 1998)
has become the framework for studying AI agents that learn, plan, and act in uncertain and sequential environments. This success is partly due to simple and general reinforcement learning algorithms that, when equipped with neural networks, can solve complex domains such as games with massive state spaces(Mnih et al., 2015; Silver et al., 2016). Games usually provide a simulator which can be used to experience a large amount of environmental interaction. A good simulator is unavailable in many applications, so it is imperative to learn a good behavior with fewer environmental interactions.
When environmental interaction is expensive but computation is cheap, the model-based approach to reinforcement learning offers a solution by summarizing past experience using a model informally thought of as an internal simulator of the environment (Sutton and Barto, 1998). An agent can then use the learned model in different capacities. For example, a model can be used to perform tree search (Kocsis and Szepesvári, 2006; Silver et al., 2016)
, to update the value-function estimate(Sutton, 1990; Parr et al., 2008; Sutton et al., 2008), to update an explicitly represented policy (Abbeel et al., 2006; Deisenroth and Rasmussen, 2011), or to make option-conditional predictions (Sutton et al., 1999; Sorg and Singh, 2010).
Regardless of how a model is utilized, effectiveness of planning depends on the accuracy of the model. A widely accepted view is that all models are imperfect, but some are still useful (Box, 1976). Generally speaking, modeling errors can be due to overfitting or underfitting, or simply due to the fact that the setting is agnostic, and so, some irreducible error exists. In reinforcement learning, for example, partial observability and non-Markovian dynamics can yield an agnostic setting.
In model-based setting it is common to learn a one-step model of the environment. When performing rollouts using a one-step model, the output of the model is then used in the subsequent step as the input. It is discovered, however, that in this case even small modeling errors, which are unavoidable as argued above, can severly degrade multi-step predictions (Talvitie, 2017; Venkatraman et al., 2015; Asadi et al., 2018b). Intuitively, the problem is that the model is not trained to perform reasonably on its own output. The rollout process can then get derailed by moving out of the state space after a few steps.
Here we study a direct approach to computing multi-step predictions. We propose a simple multi-step model that learns to predict the outcome of an action sequence with variable length. We show that in terms of prediction accuracy our proposed model outperforms a one-step model when the prediction horizon is large. We further show that the model can provide the outcome of running a policy rather than just a fixed sequence of actions. To show the effectiveness of the multi-step model, we use it for value-function optimization in the context of actor-critic reinforcement learning. Moreover, we report preliminary results on Atari Breakout showing that the multi-step model outperforms the one-step model in terms of frame prediction after multiple timesteps.
2 Background and Notation
We briefly present the background and the notation used to articulate our results. For a complete background on reinforcement learning see Sutton and Barto (1998), and for a complete background on MDPs see Puterman (2014).
We focus on the reinforcement learning problem in which an agent interacts with an environment to maximize long-term reward. The problem is formulated by a Markov Decision Process, or simply an MDP. The tuplerepresents an MDP where we assume a continuous state-space , and a discrete action-space . Our goal is to find a policy that achieves high -discounted cumulative reward.
Here and represent the reward and transition dynamics of the MDP. More specifically is the reward after taking an action in a state and moving to a next state . This is denoted by . For transition dynamics we use an overloaded notation. In the simplest case, transition function can take as input a state, a next state, and an action:
Similarly, we define a transition function conditioned on a state and a sequence of actions:
as well as conditioned on a state and a policy:
It is useful to compute the long-term goodness of taking an action in a state . Referred to as the value function, this is defined as follows:
3 Policy-Conditional Prediction
In this section we study the following basic question: How can we learn to predict the outcome of executing a policy in a state given a horizon . More specifically, our aim is to compute (or sample from) the following distribution:
We explain the standard approach in the next section, and introduce our approach in the subsequent section.
3.1 Single-step Model
The common way of making the -step prediction is to break the problem into single-step prediction problems as follows:
The one-step probability distribution could further be written as:
Note that in a continuous state-space it is easier to work with a deterministic model, rather than the full distribution, so following prior work (Parr et al., 2008; Sutton et al., 2008) we use a deterministic approximation that learns to output the expected next state by minimizing mean squared error:
We represent this model, as well as all parameterized functions, using deep neural networks LeCun et al. (2015). This approximation can be poor in general, but is theoretically justified even when environment is stochastic (Parr et al., 2008; Sutton et al., 2008).
Finally, using this model, we can easily sample a state steps into the future:
Figure 1 illustrates the rollout process. The downside of this approach is that model’s output is provided as input times, yet the model is not trained on its own output. In practice, this causes the compounding error problem.
3.2 Multi-step Model
We now articulate an alternative approach to computing multi-step predictions. As we will later on show, it is quite easy to use experience to learn models conditioned on action sequences. So key to our approach is to rewrite the policy-conditional prediction in terms of predictions conditioned on action sequences.
This approximation is perfect if the setting is deterministic. Now note that is conditioned on an action sequence, so we only need to worry about the other term, namely . We can rewrite this as follows:
Continuing for more steps we get:
Putting it together, we have:
In simple terms, the multi-step model consists of functions where each function takes as input an action sequence of length along with a state, and predicts the state after steps. Finally, we can easily sample from the model as follows:
We illustrate the rollout process using Figure 2. Note the distinction between the two rollout scenarios. Unlike the one-step case, we never feed the output of a transition function to another transition function, thereby avoiding the compounding error problem.
In this section we discuss preliminary empirical results comparing the two models. We evaluate the models in terms of prediction accuracy and usefulness for planning.
4.1 Control Domains
We ran our first experiments on three control domains, namely Cart Pole, Acrobot, and Lunar Lander. Public implementation of these domains are available on the web (Brockman et al., 2016). We used all-actions version of the actor-critic algorithm (Barto et al., 1983; Sutton et al., 2000; Asadi et al., 2017) as the reinforcement learner where updates to actor and critic were performed at the end of each episode. We report all hyper-parameters in the Appendix.
Our first goal is to compare the accuracy of the two models. To this end, we performed one episode of environmental interaction using the policy network, then measured the accuracy of models on the (just terminated) episode. An episode of experience was used for model training only after model’s prediction accuracy was computed for the episode.
In order to measure prediction accuracy for a given horizon and a given state in the episode, we computed the squared difference between model’s prediction and the state observed after steps. We then took average over all states within an episode. These quantities were separately computed for both models. We present our findings in Figure 3. The multi-step model generally performed better than the one-step model, other than when the horizon was short in the Acrobot problem.
Although low squared error may be an indication of a good model, we actually care about effectiveness of a model during planning. To this end, we used the models for planning, specifically for value-function optimization. Recall that given a batch of experience many reinforcement learning algorithms perform value-function optimization using some variant of the following update rule:
For example, TD(0) uses the target . Alternatively, using the multi-step model we can compute as shown in Algorithm 1. We can then compare the effectiveness of value-function optimization using the two models as well as using model-free baselines. Note that to ensure a fair comparison, we performed a fixed number of updates to the critic network, and just varied the type of updates. Results are presented in Figure 4.
When planning with a one-step model, it was difficult to even outperform model-free learners, unless when we carefully chose the prediction horizon. On the other hand, planning with the multi-step model was clearly more effective than model-free baselines, and outperformed the one-step model except in the Acrobot domain and for small values of horizon. We also observed an inverted-U shape with respect to prediction horizon. We observe that there exists a trade-off: with short prediction horizon there is little value in look-aheads, and for very large prediction horiozn look-aheads can be misleading due to large error. Therefore, an intermediate value works best. Jiang et al. (2015) provided theoretical results that confirm this empirical observation.
4.2 Atari Breakout
Our desire is to use the multi-step model on larger domains. To this end, we trained one and multi-step models on Atari Breakout using its public implementation (Brockman et al., 2016). Our dataset consisted of episodes from a trained DQN with an epsilon-greedy policy. Our goal is to use the model for planning in future work, so we only focus on comparing multi-step prediction error here.
The architecture of our one-step and multi-step models were mainly based on Oh et al. (2015). It consisted of an encoder and a decoder which predicts a frame n steps ahead. For the multi-step model, sequence of actions were fed into n distinct linear layers, then integrated into the network through a multiplication gate as they do in Oh et al. (2015). We trained the one-step model and multi-step models, by minimizing mean squared error, on a dataset of 100 episodes of environmental interaction. We then tested the models on a held-out dataset.
As shown in Figure 5, we observed that the one-step model performs well for the first 3 time steps, but then becomes inaccurate very quickly. Rollouts using the one-step model did not accurately capture ball and paddle location as seen in Figure 6. On the other hand, the multi-step model accurately predicted the location of the paddle (and sometimes the ball). The multi-step models’ error is much lower than that of the one-step model, although the models increase in error slightly as the horizon increases. This result, though preliminary, is an evidence that the multi-step model can be more effective for planning in Atari.
5 Related Work
There has been numerous previous studies concerning approximate model-based reinforcement learning. Linear models were one of the earliest classes of approximate models studied in the literature (Parr et al., 2008; Sutton et al., 2008). The focus of these papers was learning a single-step linear model and the solution found when planning with the model. Their fundamental result was to show that the value function found by planner is the same as the value function found by linear temporal difference learning. Consistent with our findings, these papers reported little empirical benefits from one-step models.
Deisenroth and Rasmussen (2011) argued that naive planning without explicitly incorporating model uncertainty can be a problem. They proposed a model-based policy-search method based on gaussian processes to come up with a closed-form policy-gradient update. This line of research has recently become more popular (Gal et al., 2016), but note that in these studies planning was performed using a one-step model. We hypothesize that our multi-step model can also benefit from uncertainty encorporation during planning.
The compounding error problem has been observed in multiple prior studies (Venkatraman et al., 2015; Talvitie, 2014, 2017; Asadi et al., 2018b). In the context of time series, an algorithm was presented that improves multi-step predictions by training the model on its own outputs (Venkatraman et al., 2015). Talvitie (2014) presented a similar algorithm for reinforcement learning, referred to it as hallucination, and showed that in some cases the new training scheme outperforms one-step methods without the hallucination technique. Later on this idea was theoretically investigated (Talvitie, 2017), and it was shown that pathological cases may arise if the policy is stochastic.
One can combine model-based and model-free approaches (Sutton, 1990; Yao et al., 2009; Asadi, 2015; Nagabandi et al., 2018). The motivation here is to get the benefits of both approaches. Interestingly, it is recently discovered that the process of decision making in the brain is also implemented using two processes that closely resemble model-free (habitual) and model-based (goal-based) learning. Two existing theories are that the two processes compete (Daw et al., 2005) or cooperate (Gershman et al., 2014) to make the final decisions.
Another interesting line of work is to learn models that, though imperfect, are tailored to the specific planning algorithm that is going to use them. For example, the model can be wrong, but it can still be very useful in terms of computing the fixed point of value function (Farahmand et al., 2017; Pires and Szepesvári, 2016)
. This idea is closely related to minimizing the Wasserstein metric as the loss function(Asadi et al., 2018a). Two recent papers (Silver et al., 2017; Oh et al., 2017) proposed a similar idea, namely to learn abstract state representations useful to predict values of states observed multiple timesteps ahead .
A few prior works have considered multi-step models. Perhaps the first attempt was presented by Sutton (1995) in which tabular multi-step models were studied. Later on (Sutton et al., 1999) proposed options, closed-loop policies with start and termination conditions. They then discussed the idea of learning option models in the tabular case. The idea of learning option models was later extended to the linear case (Sorg and Singh, 2010). The main limitation here is that we learn a model per option, and so when a new option is considered, a new model should be learned from scratch. Finally, Vanseijen and Sutton (2015) articulated a multi-step model in the linear case and showed connections to temporal difference learning with eligibility traces. However the model is only valid for the current policy which is a limitation during planning.
In terms of empirical results, successful attempts to use one-step approximate models are rare. Indeed it is possible to learn reasonable one-step models of Atari games (Oh et al., 2015), as well as other challenging domains (Feinberg et al., 2018), but these models are usually not very useful for multi-step rollouts. Machado et al. (2018) noted that the main issue is the compounding error problem associated with one-step models. This is consistent with our findings on the preliminary Atari experiment, and so, our multi-step model can provide a promising path to successful planning on Atari games by avoiding the compounding error problem.
We have noticed that the idea of learning multi-step models conditioned on action sequences is being explored in an independent and concurrent work111This work is an anonymous submission at ICLR 2019.. The work focuses on a cross-entropy method that, inspired by model predictive control, learns best action-sequences by optimizing from a candidate list of action sequences provided to the agent. In contrast, we used the multi-step model for value-function optimization for actor-critic in which an explicit policy representation was used during planning.
6 Conclusion and Future Work
We presented a simple approach to multi-step model-based reinforcement learning as an alternative for one-step model learning. We found the multi-step model to be more useful than the one-step model on domains considered in this paper. We believe that the discovery of this model is an important step towards model-based reinforcement learning in the approximate setting.
Consistent with previous work, we found that composition of imperfect one-step models can in general be catastrophic. Moreover, we found that multi-step rollouts are necessary in order to get the real benefits of planning. Based on our results, we believe that one-step models offer limited merits, and that composition of one-step models should in general be avoided.
The experiments reported in this work are still preliminary. That said, we believe that the reported results make a strong case for the multi-step model.
- Abbeel et al. (2006) Abbeel, P., Quigley, M., and Ng, A. Y. (2006). Using inaccurate models in reinforcement learning. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, pages 1–8.
- Asadi (2015) Asadi, K. (2015). Strengths, weaknesses, and combinations of model-based and model-free reinforcement learning. Master’s thesis, University of Alberta.
- Asadi et al. (2017) Asadi, K., Allen, C., Roderick, M., Mohamed, A.-r., Konidaris, G., and Littman, M. (2017). Mean actor critic. arXiv preprint arXiv:1709.00503.
- Asadi et al. (2018a) Asadi, K., Cater, E., Misra, D., and Littman, M. L. (2018a). Equivalence between wasserstein and value-aware model-based reinforcement learning. arXiv preprint arXiv:1806.01265.
- Asadi et al. (2018b) Asadi, K., Misra, D., and Littman, M. L. (2018b). Lipschitz continuity in model-based reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 264–273.
- Barto et al. (1983) Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, pages 834–846.
- Box (1976) Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, 71(356):791–799.
- Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym.
- Daw et al. (2005) Daw, N. D., Niv, Y., and Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature neuroscience, 8(12):1704.
- Deisenroth and Rasmussen (2011) Deisenroth, M. P. and Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 465–472.
- Farahmand et al. (2017) Farahmand, A.-m., Barreto, A., and Nikovski, D. (2017). Value-aware loss function for model-based reinforcement learning. In Artificial Intelligence and Statistics, pages 1486–1494.
- Feinberg et al. (2018) Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. (2018). Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101.
- Gal et al. (2016) Gal, Y., McAllister, R., and Rasmussen, C. E. (2016). Improving pilco with bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, ICML.
- Gershman et al. (2014) Gershman, S. J., Markman, A. B., and Otto, A. R. (2014). Retrospective revaluation in sequential decision making: A tale of two systems. Journal of Experimental Psychology: General, 143(1):182.
- Jiang et al. (2015) Jiang, N., Kulesza, A., Singh, S., and Lewis, R. (2015). The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems.
- Kocsis and Szepesvári (2006) Kocsis, L. and Szepesvári, C. (2006). Bandit based monte-carlo planning. In Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings, pages 282–293.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436.
- Machado et al. (2018) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–562.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
- Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7559–7566. IEEE.
- Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pages 2863–2871.
- Oh et al. (2017) Oh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Advances in Neural Information Processing Systems, pages 6118–6128.
Parr et al. (2008)
Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L.
An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning.In Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008, pages 752–759.
- Pires and Szepesvári (2016) Pires, B. Á. and Szepesvári, C. (2016). Policy error bounds for model-based reinforcement learning with factored linear models. In Conference on Learning Theory, pages 121–151.
- Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
- Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489.
- Silver et al. (2017) Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D. P., Rabinowitz, N. C., Barreto, A., and Degris, T. (2017). The predictron: End-to-end learning and planning. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 3191–3199. PMLR.
- Sorg and Singh (2010) Sorg, J. and Singh, S. (2010). Linear options. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pages 31–38. International Foundation for Autonomous Agents and Multiagent Systems.
- Sutton (1990) Sutton, R. S. (1990). Integrated modeling and control based on reinforcement learning. In Advances in Neural Information Processing Systems 3, [NIPS Conference, Denver, Colorado, USA, November 26-29, 1990], pages 471–478.
- Sutton (1995) Sutton, R. S. (1995). Td models: Modeling the world at a mixture of time scales. In Machine Learning Proceedings 1995, pages 531–539. Elsevier.
- Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press.
- Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063.
- Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211.
- Sutton et al. (2008) Sutton, R. S., Szepesvári, C., Geramifard, A., and Bowling, M. H. (2008). Dyna-style planning with linear function approximation and prioritized sweeping. In UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9-12, 2008, pages 528–536.
- Talvitie (2014) Talvitie, E. (2014). Model regularization for stable sample rollouts. In UAI, pages 780–789. AUAI Press.
- Talvitie (2017) Talvitie, E. (2017). Self-correcting models for model-based reinforcement learning. In AAAI, pages 2597–2603. AAAI Press.
- Vanseijen and Sutton (2015) Vanseijen, H. and Sutton, R. (2015). A deeper look at planning as learning from replay. In International conference on machine learning, pages 2314–2322.
- Venkatraman et al. (2015) Venkatraman, A., Hebert, M., and Bagnell, J. A. (2015). Improving multi-step prediction of learned time series models. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 3024–3030.
- Yao et al. (2009) Yao, H., Bhatnagar, S., Diao, D., Sutton, R. S., and Szepesvári, C. (2009). Multi-step dyna planning for policy evaluation and control. In Advances in Neural Information Processing Systems, pages 2187–2195.
Here we report hyperparameters to help readers reproduce our results. Upon publication of the paper, we will publicize our code as well.
|Cart Pole||Acrobot||Lunar Lander|
|critic network: hidden layers||1||1||1|
|[0.1, 0.05, 0.025, 0.01]||[0.005,0.0025,0.001,0.0005]||[0.025,0.01,0.005]|
|critic network: batch size||32||32||32|
|actor network: hidden layers||1||1||1|
|actor network: step size||0.005||0.0005||0.001|
|actor network: batch size||length of episode||length of episode||length of episode|
|transition functions: hidden layers||1||1||1|
|transition functions: batch size||128||1024||128|
|reward model: hidden layers||1||1||1|
|reward model: step size||0.1||0.1||0.00025|
|reward model: batch size||128||128||128|
|maximum buffer size||8000||5000||20000|
|target network update frequency||1 episode||1 episode||1 episode|
|down sampling residual blocks||4|
|up sample residual blocks||4|
|residual unit size||128|
|final conv size||32|
|adam learning rate||.0001|