Planning in Dynamic Environments with Conditional Autoregressive Models

11/25/2018 ∙ by Johanna Hansen, et al. ∙ 0

We demonstrate the use of conditional autoregressive generative models (van den Oord et al., 2016a) over a discrete latent space (van den Oord et al., 2017b) for forward planning with MCTS. In order to test this method, we introduce a new environment featuring varying difficulty levels, along with moving goals and obstacles. The combination of high-quality frame generation and classical planning approaches nearly matches true environment performance for our task, demonstrating the usefulness of this method for model-based planning in dynamic environments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Planning agents find actions at each decision point by considering future scenarios from their current state against a model of their world (Lavalle, 1998; Kocsis & Szepesvári, 2006; Stentz, 1995; van den Berg et al., 2006). Though typically slower at decision-time than model-free agents, agents which use planning can be configured and tuned with explicit constraints. Planning based methods can also reduce the compounding of errors for sequential decisions by directly testing long term consequences from action choices, balancing exploitation and exploration, and generally limiting issues with long-term credit assignment.

Model-free reinforcement learning approaches are often sample inefficient, requiring millions of steps to jointly learn environment features and a control policy. Agents which employ decision-time planning techniques, on the other hand, do not explicitly require any training prior to decision time. However, to perform well, planning-based agents need a very accurate future model of their environment for evaluating actions. A perfect model of the future to perform forward planning is usually not possible outside of computer games or simulations. In this paper, we demonstrate how we can leverage recent improvements in generative modeling to create powerful dynamics models that can be used for forward planning.

In this paper we discuss an approach for learning conditional models of an environment in an unsupervised manner, and demonstrate the utility of this model for use with decision-time planning in a dynamic environment. Autoregressive models have shown great results in generating raw images, video, and audio (van den Oord et al., 2016a, b; Kalchbrenner et al., 2016), but have generally been considered too slow for use in decision making agents (Buesing et al., 2018). However, in (van den Oord et al., 2017b), the authors show that these autoregressive models can be used as a generative prior over the latent space of discrete encoder/decoder models. Operating over these concise latent representations of the data instead of pixel-space greatly reduces the time needed for generation, making these models feasible for use in decision-making agents.

2 Background

Learning accurate models of the environment has long been a goal in model-based reinforcement learning and unsupervised learning. Recent work has shown the power of learning action-conditional models for training decision-making agents with perceptual models

(Ha & Schmidhuber, 2018; Schmidhuber, 2015; Buesing et al., 2018; Oh et al., 2015; Graves, 2013) and combining planning and with environment models (Silver et al., 2016b; Zhang et al., 2018; Pascanu et al., 2017; Guez et al., 2018; Anthony et al., 2017; Guez et al., 2018).

For real-world agents, semantic information is often more relevant than perceptual input for task performance and planning (Luc et al., 2017). Our experimentation over semantic space shows that for our task, a VQ-VAE model greatly outperforms VAE (Kingma & Welling, 2013)

reconstructions. Instead of assuming normally distributed priors and posteriors as in a typical VAE architecture, VQ-VAEs learns categorical distributions in the latent space where the samples from the distributions are indexes to an embedding table. Van den Oord et al.

(van den Oord et al., 2017b) demonstrates the benefits of learning action-condition and action-independent forward predictions over VQ-VAE latent space. We build upon this work by combining it with a classical method for planning in order to navigate in an environment with numerous dynamic obstacles and a moving target.

We test our forward-model with a powerful anytime planning method, Monte-Carlo Tree Search (MCTS) (Kocsis & Szepesvári, 2006). Given an accurate representation of the future and sufficient time to compute, MCTS performs well (Pepels et al., 2014), even when faced with large state or action spaces. MCTS works by rolling out

many sequences of actions possible future scenarios to acquire an approximate (Monte Carlo) estimate of the value of taking a specific action from a particular state. For a full overview of MCTS and its many variants, please refer to

(Browne & Powley, 2012). MCTS has been used in a wide variety of search and planning problems where a model of the world is available for querying (Silver et al., 2016a; Guo et al., 2014a; Bellemare et al., 2012; Lipovetzky et al., 2015; Guo et al., 2014b). The performance of MCTS is critically dependent on having an accurate forward model of the environment, making it an ideal fit for testing our autoregressive conditional generative forward model.

3 Experiments

We consider a fully-observable task in which an agent must navigate to a dynamic goal location without contact with moving obstacles. At each time step , the agent realizes an observation and must execute an action . In our experiments, the observation is an image constituting the full view of an action-independent, two-dimensional environment. The action space consists of actions, where each action moves a fixed amount in a specific direction, diagonal included. We learn a conditional forward model of this environment as described in Section 3.2 and query at decision time for action selection with MCTS.

Our problem is similar to those faced by autonomous underwater vehicles (AUVs) navigating in a busy harbor while try to avoid traveling underneath passing ships (Arvind et al., 2013). In order to successfully accomplish this tasks, the robot needs reliable dynamics models of the obstacles (ships) and goals in the environment so it can plan effectively against a realistic estimate of future states.

Figure 1: This figure illustrates forward rollout steps by the oracle (left column), our sample model (middle column), and the error in the model (right column). The number of steps from the given state is indicated in the oracle plot’s title. In the first two columns, free space is violet, moving obstacles are cyan, and the goal is yellow. In the third column, we illustrate obstacle error in the model as follows: false negatives (predicted free space where there should be an obstacle) are red and false positives (predicted obstacle where there was free space) are blue. The true goal is plotted in yellow and the predicted goal is plotted in orange (perfect goal prediction is orange).

3.1 Environment Description

We introduce a navigation environment (depicted in the first column of Figure 1) which consists of a configurable world with dynamic obstacles and a moving goal. Movement about the environment is continuous, but collision and goal checking is quantized to the nearest pixel. In each episode the size agent and

size goal are initialized to a random location, and the goal is given a random vector direction and a fixed velocity. The agent must then reach the moving goal within a limited number of steps without colliding with an obstacle. At each timestep the agent has the choice of

actions. These actions indicate one of equally spaced angles and a constant speed. In these experiments, we test two agents, one at pixels per timestep ( goal agent) and one agent at pixels per timestep ( goal agent). The goal moves about the environment at a fixed random angle and fixed speed of pixels per timestep. The goal also reflects off of world boundaries, making good modeling of goal dynamics important to success.

The environment is divided into obstacle lanes which span the environment horizontally. At the beginning of each episode, the lanes are randomly assigned to carry of

classes of obstacles and a direction of movement (left to right or right to left). Each obstacle class is parameterized by a color and a distribution which describes average obstacle speed and length. Obstacles maintain a constant speed after entering the environment, pass through the edges of the environment, and are deleted after their entire body exits the observable space. The number of obstacles introduced into the environment at each timestep is controlled by a Poisson distribution, configured by the

level parameter. For the results reported in this paper we set the level to , however there is support for a variety of difficulty settings. At each time step, the observation consists of the agent’s current location and the full quantized pixel space including the goal and obstacles.

An agent receives a reward of for entering the same pixel-space as the goal and a reward for entering the same pixel-space as an obstacle. Both events cause the episode to end. The agent has a limited number of actions before the game times out, resulting in a reward of . This step limit is dependent on the speed of the agent and the size of the grid. For these experiments, the agent has steps and the agent has steps before the game ends.

A key component which makes our approach computationally feasible is that the environments of concern are not action conditional, meaning dynamics in the world continue regardless of what actions are chosen. This means that generated future frames can be shared across all rollouts in MCTS, greatly reducing the overall sample cost for the autoregressive model. Combined with the speed improvements from generating in a compressed space given by VQ-VAE, forward generation can be accomplished in reasonable time. It is also possible to take a similar approach in action-conditional spaces, but this would increase the number of needed generations from the model during MCTS rollout by a large amount.

3.2 Model Description

We utilize a two-phase training procedure on the agent-independent, environment described in the previous section. First we learn a compact, discrete representation (denoted ) of individual pixel-space frames with a VQ-VAE model (van den Oord et al., 2017b) with discretized logistic mixture likelihood (Salimans et al., 2017) for the reconstruction loss. In the second stage, an autoregressive generative model, a conditional gated PixelCNN (van den Oord et al., 2016a) is trained to predict one-step ahead representations of sequential frames when conditioned on previous representations. To introduce Markovian conditions, the conditional gated PixelCNN is fed a spatial conditioning map of past encodings, in addition to the current step. The resulting PixelCNN learns a model corresponding to , where each dimension of is conditioned on all valid dimensions relative to the current position via autoregressive masking, and also conditioned on the previous frames by a spatial conditioning map (van den Oord et al., 2016a) which is fed as input. Combined with the previously trained VQ-VAE decoder this results in a model which generates frame ahead, given previous frames. It is possible to generate an arbitrary number of frames forward given an initial frames, by chaining step generations though we expect results to degrade as forward trajectory lengths increase.


Rollout Steps 1 3 5 10
Technique G T D S G T D S G T D S G T D S

2Oracle
100 0 0 34x17 100 0 0 3618 100 0 0 4528 100 4 0 6849
2Mid 78 0 22 3317 88 0 12 4018 91 2 7 6540 52 25 23 11167
25 Samples 84 0 16 3417 1 5 4627 89 5 6 7551 55 23 22 11270
210 Samples 85 0 15 3518 88 0 12 4626 89 9 2 7656 55 31 14 12468
1Oracle 72 25 3 187151 67 32 1 209154 60 40 0 22464 66 34 0 216156
15 Samples 31 3 55 46 21 33 196153 41 3 27 259155 39 46 15 294143
Table 2: This table compares agents using MCTS for forward planning on varying models (oracle and ours with varying levels of sampling from the generative model), rollout lengths (1, 3, 5 and 10), and agent speed (2X agents are twice as fast as the goal and 1X agents are the same speed as the goal). All agents were tested over the same set of random episodes, with MCTS performing rollouts at each decision time. The values in columns G, T, and D stand for the number of games in which the described agent reached the goal (G), ran out of time before reaching the goal (T), or died (D) by running into an obstacle. The S

column describes the number of steps completed on average by an agent, calculated only from episodes in which the agent avoided dying (smaller is better), along with the standard deviation. When tested on the same episodes, a random agent reached the goal once at

X speed and never at X speed.
Table 1: Performance Comparison over 100 Episodes

3.3 MCTS Planning

Our MCTS agent is characterized by rollout length, number of rollouts, and temperature. We vary rollout length from to , but hold the number of rollouts to and temperature to for all experiments. We also use a goal-oriented prior for node selection as described by prior work using PUCT MCTS (Rosin, 2011; Silver et al., 2017). This prior biases tree expansion during rollouts such that actions in the direction of the predicted goal are more likely to be chosen. Adding goal information to the state has been found to improve agents in other scenarios (Sukhbaatar et al., 2017), and we found that this simple prior greatly improved performance compared to a uniform prior, resulting in shorter average rollout lengths.

3.4 Training

The VQ-VAE encoder consists of strided convolutional layers with a kernel size and sizes of , , , . The first layers have strides of and the last layer has a stride of . This configuration compresses an input size of down to a space of . For learning the vector quantization codebook, we set K=, resulting in a compression of in bits over each frame, considering there are pixel-values used in the input image (requiring bits to encode minimally). The VQ-VAE decoder inverts this process using transpose convolutions, and appropriate stride values which mimic the decoder settings but in reverse order.

Training was performed for epochs with a minibatch size of over example frames which were generated from running the environment. We use an Adam optimizer (Kingma & Ba, 2014) with the learning rate set to , and the discretized mixture of logistics loss (Salimans et al., 2017). From the trained VQ-VAE model, we generate a new dataset consisting of ordered values given by our model over previously unseen episodes which are each frames long. The PixelCNN (van den Oord et al., 2016a) is trained over these generated s for epochs with a batch size of . We employ categorical cross-entropy loss and the Adam optimizer (learning rate is set to ) for predicting the discrete ”label” of each dimension. We condition each prediction on a spatial map consisting of the previous frame’s s (van den Oord et al., 2016a).

4 Performance

Our experiments (see Table 2) demonstrate the feasibility of using conditional autoregressive models for forward planning. Example playout gifs can be found in the code repository at https:github.com/johannah/trajectories. We compare agents using our forward model to an agent which has access to an oracle of the environment. The oracle agent is used as an upper-bound on performance, as although this perfect representation of the future environment is not available in realistic tasks it is the theoretical best we can expect generative model to do. In all of the compared models, we first use a mid point ”average” estimate from the discretized mixture of logistics distribution, but in those denoted by sampled, we also sample an additional or times from the model and take the pixel-wise max of the predicted obstacle values. We find this results in a more conservative, but noisier estimate of the car locations. We take the median location of goal estimates over all of the samples to set the directional MCTS prior.

Errors in the forward predictions (see Figure 1) can cause the agents to make catastrophic decisions, resulting in lower performance when compared to the oracle. False negatives, in particular (shown in red in Figure 1), result in the agent mistaking an obstacle for free space. Some of these mistakes are unavoidable as we step farther from the given state as we can only model obstacles that are in the scene at the current time step. This characteristic limits the efficacy of the lengths we can model forward in time and is a phenomena also discussed in Luc et al. (Luc et al., 2017).

Perhaps unsurprisingly, our results show that the faster () agent had an easier time reaching the goal before running out of time. Agents which utilize longer rollouts were likely hampered by our decision to hold the number of rollouts constant over all of our experiments. Overall, longer rollouts were more likely to die off in their future states and thus often failed to come up with aggressive paths.

Each future timestep prediction with our VQ-VAE + PixelCNN takes approximately seconds on a TitanX-Pascal GPU. An average action decision with our best performing agent ( Samples with step rollouts) takes approximately seconds. Beyond using VQ-VAE to reduce the input space to PixelCNN, no other methods for improving the speed of autoregressive generation were employed. Recent publications in this area (van den Oord et al., 2017a; Kalchbrenner et al., 2018; Ramachandran et al., 2017) show massive improvements in generation speed for autoregressive models and are directly applicable to this work.

5 Conclusion

We show that the two-stage pipeline of VQ-VAE (van den Oord et al., 2017b) combined with a PixelCNN prior conditioned on previous frames captures important semantic structure in a dynamic, goal oriented environment. The resulting samples are usable for model-based planning with MCTS over generated future states. Our agent avoids moving obstacles and reliably intercepts a non-stationary goal in the dynamic test environment introduced in this work, demonstrating the efficacy of this approach for planning in dynamic environments.

References