Planning agents find actions at each decision point by considering future scenarios from their current state against a model of their world (Lavalle, 1998; Kocsis & Szepesvári, 2006; Stentz, 1995; van den Berg et al., 2006). Though typically slower at decision-time than model-free agents, agents which use planning can be configured and tuned with explicit constraints. Planning based methods can also reduce the compounding of errors for sequential decisions by directly testing long term consequences from action choices, balancing exploitation and exploration, and generally limiting issues with long-term credit assignment.
Model-free reinforcement learning approaches are often sample inefficient, requiring millions of steps to jointly learn environment features and a control policy. Agents which employ decision-time planning techniques, on the other hand, do not explicitly require any training prior to decision time. However, to perform well, planning-based agents need a very accurate future model of their environment for evaluating actions. A perfect model of the future to perform forward planning is usually not possible outside of computer games or simulations. In this paper, we demonstrate how we can leverage recent improvements in generative modeling to create powerful dynamics models that can be used for forward planning.
In this paper we discuss an approach for learning conditional models of an environment in an unsupervised manner, and demonstrate the utility of this model for use with decision-time planning in a dynamic environment. Autoregressive models have shown great results in generating raw images, video, and audio (van den Oord et al., 2016a, b; Kalchbrenner et al., 2016), but have generally been considered too slow for use in decision making agents (Buesing et al., 2018). However, in (van den Oord et al., 2017b), the authors show that these autoregressive models can be used as a generative prior over the latent space of discrete encoder/decoder models. Operating over these concise latent representations of the data instead of pixel-space greatly reduces the time needed for generation, making these models feasible for use in decision-making agents.
Learning accurate models of the environment has long been a goal in model-based reinforcement learning and unsupervised learning. Recent work has shown the power of learning action-conditional models for training decision-making agents with perceptual models(Ha & Schmidhuber, 2018; Schmidhuber, 2015; Buesing et al., 2018; Oh et al., 2015; Graves, 2013) and combining planning and with environment models (Silver et al., 2016b; Zhang et al., 2018; Pascanu et al., 2017; Guez et al., 2018; Anthony et al., 2017; Guez et al., 2018).
For real-world agents, semantic information is often more relevant than perceptual input for task performance and planning (Luc et al., 2017). Our experimentation over semantic space shows that for our task, a VQ-VAE model greatly outperforms VAE (Kingma & Welling, 2013)
reconstructions. Instead of assuming normally distributed priors and posteriors as in a typical VAE architecture, VQ-VAEs learns categorical distributions in the latent space where the samples from the distributions are indexes to an embedding table. Van den Oord et al.(van den Oord et al., 2017b) demonstrates the benefits of learning action-condition and action-independent forward predictions over VQ-VAE latent space. We build upon this work by combining it with a classical method for planning in order to navigate in an environment with numerous dynamic obstacles and a moving target.
We test our forward-model with a powerful anytime planning method, Monte-Carlo Tree Search (MCTS) (Kocsis & Szepesvári, 2006). Given an accurate representation of the future and sufficient time to compute, MCTS performs well (Pepels et al., 2014), even when faced with large state or action spaces. MCTS works by rolling out
many sequences of actions possible future scenarios to acquire an approximate (Monte Carlo) estimate of the value of taking a specific action from a particular state. For a full overview of MCTS and its many variants, please refer to(Browne & Powley, 2012). MCTS has been used in a wide variety of search and planning problems where a model of the world is available for querying (Silver et al., 2016a; Guo et al., 2014a; Bellemare et al., 2012; Lipovetzky et al., 2015; Guo et al., 2014b). The performance of MCTS is critically dependent on having an accurate forward model of the environment, making it an ideal fit for testing our autoregressive conditional generative forward model.
We consider a fully-observable task in which an agent must navigate to a dynamic goal location without contact with moving obstacles. At each time step , the agent realizes an observation and must execute an action . In our experiments, the observation is an image constituting the full view of an action-independent, two-dimensional environment. The action space consists of actions, where each action moves a fixed amount in a specific direction, diagonal included. We learn a conditional forward model of this environment as described in Section 3.2 and query at decision time for action selection with MCTS.
Our problem is similar to those faced by autonomous underwater vehicles (AUVs) navigating in a busy harbor while try to avoid traveling underneath passing ships (Arvind et al., 2013). In order to successfully accomplish this tasks, the robot needs reliable dynamics models of the obstacles (ships) and goals in the environment so it can plan effectively against a realistic estimate of future states.
3.1 Environment Description
We introduce a navigation environment (depicted in the first column of Figure 1) which consists of a configurable world with dynamic obstacles and a moving goal. Movement about the environment is continuous, but collision and goal checking is quantized to the nearest pixel. In each episode the size agent and
size goal are initialized to a random location, and the goal is given a random vector direction and a fixed velocity. The agent must then reach the moving goal within a limited number of steps without colliding with an obstacle. At each timestep the agent has the choice ofactions. These actions indicate one of equally spaced angles and a constant speed. In these experiments, we test two agents, one at pixels per timestep ( goal agent) and one agent at pixels per timestep ( goal agent). The goal moves about the environment at a fixed random angle and fixed speed of pixels per timestep. The goal also reflects off of world boundaries, making good modeling of goal dynamics important to success.
The environment is divided into obstacle lanes which span the environment horizontally. At the beginning of each episode, the lanes are randomly assigned to carry of
classes of obstacles and a direction of movement (left to right or right to left). Each obstacle class is parameterized by a color and a distribution which describes average obstacle speed and length. Obstacles maintain a constant speed after entering the environment, pass through the edges of the environment, and are deleted after their entire body exits the observable space. The number of obstacles introduced into the environment at each timestep is controlled by a Poisson distribution, configured by thelevel parameter. For the results reported in this paper we set the level to , however there is support for a variety of difficulty settings. At each time step, the observation consists of the agent’s current location and the full quantized pixel space including the goal and obstacles.
An agent receives a reward of for entering the same pixel-space as the goal and a reward for entering the same pixel-space as an obstacle. Both events cause the episode to end. The agent has a limited number of actions before the game times out, resulting in a reward of . This step limit is dependent on the speed of the agent and the size of the grid. For these experiments, the agent has steps and the agent has steps before the game ends.
A key component which makes our approach computationally feasible is that the environments of concern are not action conditional, meaning dynamics in the world continue regardless of what actions are chosen. This means that generated future frames can be shared across all rollouts in MCTS, greatly reducing the overall sample cost for the autoregressive model. Combined with the speed improvements from generating in a compressed space given by VQ-VAE, forward generation can be accomplished in reasonable time. It is also possible to take a similar approach in action-conditional spaces, but this would increase the number of needed generations from the model during MCTS rollout by a large amount.
3.2 Model Description
We utilize a two-phase training procedure on the agent-independent, environment described in the previous section. First we learn a compact, discrete representation (denoted ) of individual pixel-space frames with a VQ-VAE model (van den Oord et al., 2017b) with discretized logistic mixture likelihood (Salimans et al., 2017) for the reconstruction loss. In the second stage, an autoregressive generative model, a conditional gated PixelCNN (van den Oord et al., 2016a) is trained to predict one-step ahead representations of sequential frames when conditioned on previous representations. To introduce Markovian conditions, the conditional gated PixelCNN is fed a spatial conditioning map of past encodings, in addition to the current step. The resulting PixelCNN learns a model corresponding to , where each dimension of is conditioned on all valid dimensions relative to the current position via autoregressive masking, and also conditioned on the previous frames by a spatial conditioning map (van den Oord et al., 2016a) which is fed as input. Combined with the previously trained VQ-VAE decoder this results in a model which generates frame ahead, given previous frames. It is possible to generate an arbitrary number of frames forward given an initial frames, by chaining step generations though we expect results to degrade as forward trajectory lengths increase.
column describes the number of steps completed on average by an agent, calculated only from episodes in which the agent avoided dying (smaller is better), along with the standard deviation. When tested on the same episodes, a random agent reached the goal once atX speed and never at X speed.
3.3 MCTS Planning
Our MCTS agent is characterized by rollout length, number of rollouts, and temperature. We vary rollout length from to , but hold the number of rollouts to and temperature to for all experiments. We also use a goal-oriented prior for node selection as described by prior work using PUCT MCTS (Rosin, 2011; Silver et al., 2017). This prior biases tree expansion during rollouts such that actions in the direction of the predicted goal are more likely to be chosen. Adding goal information to the state has been found to improve agents in other scenarios (Sukhbaatar et al., 2017), and we found that this simple prior greatly improved performance compared to a uniform prior, resulting in shorter average rollout lengths.
The VQ-VAE encoder consists of strided convolutional layers with a kernel size and sizes of , , , . The first layers have strides of and the last layer has a stride of . This configuration compresses an input size of down to a space of . For learning the vector quantization codebook, we set K=, resulting in a compression of in bits over each frame, considering there are pixel-values used in the input image (requiring bits to encode minimally). The VQ-VAE decoder inverts this process using transpose convolutions, and appropriate stride values which mimic the decoder settings but in reverse order.
Training was performed for epochs with a minibatch size of over example frames which were generated from running the environment. We use an Adam optimizer (Kingma & Ba, 2014) with the learning rate set to , and the discretized mixture of logistics loss (Salimans et al., 2017). From the trained VQ-VAE model, we generate a new dataset consisting of ordered values given by our model over previously unseen episodes which are each frames long. The PixelCNN (van den Oord et al., 2016a) is trained over these generated s for epochs with a batch size of . We employ categorical cross-entropy loss and the Adam optimizer (learning rate is set to ) for predicting the discrete ”label” of each dimension. We condition each prediction on a spatial map consisting of the previous frame’s s (van den Oord et al., 2016a).
Our experiments (see Table 2) demonstrate the feasibility of using conditional autoregressive models for forward planning. Example playout gifs can be found in the code repository at https:github.com/johannah/trajectories. We compare agents using our forward model to an agent which has access to an oracle of the environment. The oracle agent is used as an upper-bound on performance, as although this perfect representation of the future environment is not available in realistic tasks it is the theoretical best we can expect generative model to do. In all of the compared models, we first use a mid point ”average” estimate from the discretized mixture of logistics distribution, but in those denoted by sampled, we also sample an additional or times from the model and take the pixel-wise max of the predicted obstacle values. We find this results in a more conservative, but noisier estimate of the car locations. We take the median location of goal estimates over all of the samples to set the directional MCTS prior.
Errors in the forward predictions (see Figure 1) can cause the agents to make catastrophic decisions, resulting in lower performance when compared to the oracle. False negatives, in particular (shown in red in Figure 1), result in the agent mistaking an obstacle for free space. Some of these mistakes are unavoidable as we step farther from the given state as we can only model obstacles that are in the scene at the current time step. This characteristic limits the efficacy of the lengths we can model forward in time and is a phenomena also discussed in Luc et al. (Luc et al., 2017).
Perhaps unsurprisingly, our results show that the faster () agent had an easier time reaching the goal before running out of time. Agents which utilize longer rollouts were likely hampered by our decision to hold the number of rollouts constant over all of our experiments. Overall, longer rollouts were more likely to die off in their future states and thus often failed to come up with aggressive paths.
Each future timestep prediction with our VQ-VAE + PixelCNN takes approximately seconds on a TitanX-Pascal GPU. An average action decision with our best performing agent ( Samples with step rollouts) takes approximately seconds. Beyond using VQ-VAE to reduce the input space to PixelCNN, no other methods for improving the speed of autoregressive generation were employed. Recent publications in this area (van den Oord et al., 2017a; Kalchbrenner et al., 2018; Ramachandran et al., 2017) show massive improvements in generation speed for autoregressive models and are directly applicable to this work.
We show that the two-stage pipeline of VQ-VAE (van den Oord et al., 2017b) combined with a PixelCNN prior conditioned on previous frames captures important semantic structure in a dynamic, goal oriented environment. The resulting samples are usable for model-based planning with MCTS over generated future states. Our agent avoids moving obstacles and reliably intercepts a non-stationary goal in the dynamic test environment introduced in this work, demonstrating the efficacy of this approach for planning in dynamic environments.
- Anthony et al. (2017) Anthony, Thomas, Tian, Zheng, and Barber, David. Thinking fast and slow with deep learning and tree search. CoRR, abs/1705.08439, 2017.
- Arvind et al. (2013) Arvind, Pereira, Jonathan, Binney, Geoffrey, Hollinger, and Gaurav, Sukhatme. Risk‐aware path planning for autonomous underwater vehicles using predictive ocean models. Journal of Field Robotics, 30(5):741–762, 2013. doi: 10.1002/rob.21472.
- Bellemare et al. (2012) Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. CoRR, abs/1207.4708, 2012.
- Browne & Powley (2012) Browne, Cb and Powley, Edward. A survey of monte carlo tree search methods. Intelligence and AI, 4(1):1–49, 2012. ISSN 1943-068X. doi: 10.1109/TCIAIG.2012.2186810.
- Buesing et al. (2018) Buesing, Lars, Weber, Theophane, Racanière, Sébastien, Eslami, S. M. Ali, Rezende, Danilo Jimenez, Reichert, David P., Viola, Fabio, Besse, Frederic, Gregor, Karol, Hassabis, Demis, and Wierstra, Daan. Learning and querying fast generative models for reinforcement learning. CoRR, abs/1802.03006, 2018.
- Graves (2013) Graves, Alex. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013.
- Guez et al. (2018) Guez, Arthur, Weber, Théophane, Antonoglou, Ioannis, Simonyan, Karen, Vinyals, Oriol, Wierstra, Daan, Munos, Rémi, and Silver, David. Learning to search with mctsnets. CoRR, abs/1802.04697, 2018.
- Guo et al. (2014a) Guo, Xiaoxiao, Singh, Satinder, Lee, Honglak, Lewis, Richard L, and Wang, Xiaoshi. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 3338–3346. Curran Associates, Inc., 2014a.
- Guo et al. (2014b) Guo, Xiaoxiao, Singh, Satinder, Lee, Honglak, Lewis, Richard L, and Wang, Xiaoshi. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 27, pp. 3338–3346. Curran Associates, Inc., 2014b.
- Ha & Schmidhuber (2018) Ha, David and Schmidhuber, Jürgen. World models. CoRR, abs/1803.10122, 2018.
- Kalchbrenner et al. (2016) Kalchbrenner, Nal, van den Oord, Aäron, Simonyan, Karen, Danihelka, Ivo, Vinyals, Oriol, Graves, Alex, and Kavukcuoglu, Koray. Video pixel networks. CoRR, abs/1610.00527, 2016.
- Kalchbrenner et al. (2018) Kalchbrenner, Nal, Elsen, Erich, Simonyan, Karen, Noury, Seb, Casagrande, Norman, Lockhart, Edward, Stimberg, Florian, van den Oord, Aäron, Dieleman, Sander, and Kavukcuoglu, Koray. Efficient neural audio synthesis. CoRR, abs/1802.08435, 2018.
- Kingma & Ba (2014) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kingma & Welling (2013) Kingma, Diederik P. and Welling, Max. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
- Kocsis & Szepesvári (2006) Kocsis, Levente and Szepesvári, Csaba. Bandit based monte-carlo planning. In Proceedings of the 17th European Conference on Machine Learning, ECML’06, pp. 282–293, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3-540-45375-X, 978-3-540-45375-8. doi: 10.1007/11871842˙29.
- Lavalle (1998) Lavalle, Steven M. Rapidly-exploring random trees: A new tool for path planning. Technical report, 1998.
Lipovetzky et al. (2015)
Lipovetzky, Nir, Ramirez, Miquel, and Geffner, Hector.
Classical planning with simulators: Results on the atari video games.
Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, pp. 1610–1616. AAAI Press, 2015. ISBN 978-1-57735-738-4.
- Luc et al. (2017) Luc, Pauline, Neverova, Natalia, Couprie, Camille, Verbeek, Jacob, and LeCun, Yann. Predicting deeper into the future of semantic segmentation. ICCV, 2017.
- Oh et al. (2015) Oh, Junhyuk, Guo, Xiaoxiao, Lee, Honglak, Lewis, Richard, and Singh, Satinder. Action-conditional video prediction using deep networks in atari games. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pp. 2863–2871, Cambridge, MA, USA, 2015. MIT Press.
- Pascanu et al. (2017) Pascanu, Razvan, Li, Yujia, Vinyals, Oriol, Heess, Nicolas, Buesing, Lars, Racanière, Sébastien, Reichert, David P., Weber, Theophane, Wierstra, Daan, and Battaglia, Peter. Learning model-based planning from scratch. CoRR, abs/1707.06170, 2017.
- Pepels et al. (2014) Pepels, T., Winands, M. H. M., and Lanctot, M. Real-time monte carlo tree search in ms pac-man. IEEE Transactions on Computational Intelligence and AI in Games, 6(3):245–257, Sept 2014. ISSN 1943-068X. doi: 10.1109/TCIAIG.2013.2291577.
- Ramachandran et al. (2017) Ramachandran, Prajit, Paine, Tom Le, Khorrami, Pooya, Babaeizadeh, Mohammad, Chang, Shiyu, Zhang, Yang, Hasegawa-Johnson, Mark A, Campbell, Roy H, and Huang, Thomas S. Fast generation for convolutional autoregressive models. arXiv preprint arXiv:1704.06001, 2017.
- Rosin (2011) Rosin, Christopher D. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, Mar 2011. ISSN 1573-7470. doi: 10.1007/s10472-011-9258-6.
- Salimans et al. (2017) Salimans, Tim, Karpathy, Andrej, Chen, Xi, and Kingma, Diederik P. Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications. In ICLR, 2017.
- Schmidhuber (2015) Schmidhuber, Jürgen. On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. CoRR, abs/1511.09249, 2015.
Silver et al. (2016a)
Silver, David, Huang, Aja, Maddison, Chris J., Guez, Arthur, Sifre, Laurent,
van den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis,
Panneershelvam, Veda, Lanctot, Marc, Dieleman, Sander, Grewe, Dominik, Nham,
John, Kalchbrenner, Nal, Sutskever, Ilya, Lillicrap, Timothy, Leach,
Madeleine, Kavukcuoglu, Koray, Graepel, Thore, and Hassabis, Demis.
Mastering the game of Go with deep neural networks and tree search.Nature, 529(7587):484–489, January 2016a. doi: 10.1038/nature16961.
- Silver et al. (2016b) Silver, David, van Hasselt, Hado, Hessel, Matteo, Schaul, Tom, Guez, Arthur, Harley, Tim, Dulac-Arnold, Gabriel, Reichert, David P., Rabinowitz, Neil C., Barreto, André, and Degris, Thomas. The predictron: End-to-end learning and planning. CoRR, abs/1612.08810, 2016b.
- Silver et al. (2017) Silver, David, Schrittwieser, Julian, Simonyan, Karen, Antonoglou, Ioannis, Huang, Aja, Guez, Arthur, Hubert, Thomas, Baker, Lucas, Lai, Matthew, Bolton, Adrian, Chen, Yutian, Lillicrap, Timothy, Hui, Fan, Sifre, Laurent, van den Driessche, George, Graepel, Thore, and Hassabis, Demis. Mastering the game of go without human knowledge. Nature, 550:354–, October 2017.
- Stentz (1995) Stentz, Anthony. The focussed d* algorithm for real-time replanning. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’95, pp. 1652–1659, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. ISBN 1-55860-363-8.
- Sukhbaatar et al. (2017) Sukhbaatar, Sainbayar, Lin, Zeming, Kostrikov, Ilya, Synnaeve, Gabriel, Szlam, Arthur, and Fergus, Rob. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint arXiv:1703.05407, 2017.
- van den Berg et al. (2006) van den Berg, J., Ferguson, D., and Kuffner, J. Anytime path planning and replanning in dynamic environments. In Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pp. 2366–2371, May 2006. doi: 10.1109/ROBOT.2006.1642056.
- van den Oord et al. (2016a) van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, kavukcuoglu, koray, Vinyals, Oriol, and Graves, Alex. Conditional image generation with pixelcnn decoders. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, pp. 4790–4798. Curran Associates, Inc., 2016a.
- van den Oord et al. (2017a) van den Oord, Aäron, Li, Yazhe, Babuschkin, Igor, Simonyan, Karen, Vinyals, Oriol, Kavukcuoglu, Koray, van den Driessche, George, Lockhart, Edward, Cobo, Luis C., Stimberg, Florian, Casagrande, Norman, Grewe, Dominik, Noury, Seb, Dieleman, Sander, Elsen, Erich, Kalchbrenner, Nal, Zen, Heiga, Graves, Alex, King, Helen, Walters, Tom, Belov, Dan, and Hassabis, Demis. Parallel wavenet: Fast high-fidelity speech synthesis. CoRR, abs/1711.10433, 2017a.
- van den Oord et al. (2017b) van den Oord, Aaron, Vinyals, Oriol, and kavukcuoglu, koray. Neural discrete representation learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 6306–6315. Curran Associates, Inc., 2017b.
- van den Oord et al. (2016b) van den Oord, Aäron, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alexander, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for raw audio. In Arxiv, 2016b.
- Zhang et al. (2018) Zhang, Amy, Lerer, Adam, Sukhbaatar, Sainbayar, Fergus, Rob, and Szlam, Arthur. Composable planning with attributes. CoRR, abs/1803.00512, 2018.