Task-Agnostic Dynamics Priors for Deep Reinforcement Learning

05/13/2019 ∙ by Yilun Du, et al. ∙ 0

While model-based deep reinforcement learning (RL) holds great promise for sample efficiency and generalization, learning an accurate dynamics model is often challenging and requires substantial interaction with the environment. A wide variety of domains have dynamics that share common foundations like the laws of classical mechanics, which are rarely exploited by existing algorithms. In fact, humans continuously acquire and use such dynamics priors to easily adapt to operating in new environments. In this work, we propose an approach to learn task-agnostic dynamics priors from videos and incorporate them into an RL agent. Our method involves pre-training a frame predictor on task-agnostic physics videos to initialize dynamics models (and fine-tune them) for unseen target environments. Our frame prediction architecture, SpatialNet, is designed specifically to capture localized physical phenomena and interactions. Our approach allows for both faster policy learning and convergence to better policies, outperforming competitive approaches on several different environments. We also demonstrate that incorporating this prior allows for more effective transfer between environments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 12

page 13

page 14

page 15

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in deep reinforcement learning (RL) have largely relied on model-free approaches, demonstrating strong performance on a variety of domains (Silver et al., 2016; Mnih et al., 2013; Kempka et al., 2016; Zhang et al., 2018c). However, model-free techniques do not have good sample efficiency (Sutton, 1990) and are difficult to adapt to new tasks or domains (Nichol et al., 2018). A key reason for this is a single value function is used to represent both the agent’s policy and its knowledge of environment dynamics, which can result in heavy overfitting to a particular task (Zhang et al., 2018b). On the other hand, model-based RL allows for decoupling the dynamics model from the policy, enabling better generalization and transfer across tasks (Zhang et al., 2018a)

. The challenge with model-based RL, however, lies in estimating an accurate dynamics model of the environment while simultaneously using it to learn a policy, often leading to sub-optimal policies and slower learning. One way to alleviate this problem is to initialize dynamics models with universal

task-agnostic priors that allow for more efficient and stable model-based learning.

Figure 1: Two different environments with object dynamics that obey the common laws of physics (top: PhysWorld, bottom: Atari Pong). Agents that have a knowledge of general physics will be able to adapt quickly to either environment.

For example, consider learning dynamics models for the two different scenarios shown in Figure 1 (top and bottom). Both environments contain a variety of objects moving with different velocities and rotations. Current approaches require a large number of samples to learn a robust transition model of either world. For instance, in the first environment, inferring that the orange circle is a freely moving object will require observing the circle moving in a variety of different directions. Or to understand the laws governing elastic collisions between two bodies (e.g. the circle and the grey rectangle) requires observing several instances of collisions at various angles and velocities. On the other hand, humans have reliable priors that allow for understanding dynamics of new environments quickly (Dubey et al., 2018) – one such prior is an understanding of physical laws of motion. In this work, we demonstrate that learning a task-agnostic dynamics prior (e.g. concepts like velocity, acceleration or elasticity) allows for accurate and more efficient estimation of the dynamics of new environments, resulting in better control policies.

In order to obtain a prior for physical dynamics, we perform unsupervised learning over raw videos containing moving objects. Specifically, we train a dynamics model to predict the next frame given the previous k frames, over a wide variety of scenarios with moving objects. The parameters of the model implicitly capture general laws of physics, which are useful in predicting entity movements. We initialize the dynamics model of the environment with these pre-trained parameters and fine-tune them using transitions from the specific task, while simultaneously learning a policy for the task. The dynamics model is used to predict future frames up to a finite horizon, which are then used as additional input into a policy network, similar to the approach of

(Weber et al., 2017). Importantly, our frame prediction model is not action-conditional like most prior work that employs such models in reinforcement learning (Oh et al., 2015; Weber et al., 2017).

Learning a good future frame model is challenging mainly for two reasons: a) the large dimensionality of the output space with arbitrary moving objects and interactions, and b) the partial observability in environments (Mathieu et al., 2015). Prior approaches (Oh et al., 2015)

suffered from error compounding since they encode the entire image into a single vector before decoding the output, thereby missing out fine-grained spatial information. Others like the ConvLSTM 

(Xingjian et al., 2015)

are better at capturing spatio-temporal interactions but suffer from poor generalization due the use of additive update equations. To overcome these issues, we propose a new architecture (SpatialNet) that consists of a convolutional encoder, a spatial memory block, and a convolutional decoder that better captures localized dynamics. The spatial memory module operates by performing convolution operations over a temporal 3-dimensional state representation that keeps spatial information intact. This allows the network, which includes residual connections, to capture localized physics of objects such as directional movements and collisions in a fine-grained manner as well as efficiently keep track of static background information. This results in lower prediction error, better generalization and invariance to the size of inputs.

We evaluate our approach on three different RL scenarios. First, we consider PhysWorld, a suite of randomized 2D physics-focused games, where learning object movement is crucial to a successful policy. Next we consider PhysShooter3D, a 3D environment with rigid body dynamics and partial observations. Finally, we also evaluate on a stochastic variant of the popular ALE framework consisting of Atari games (Machado et al., 2017a). In all scenarios, we first demonstrate the value of learning a task-agnostic prior for model dynamics -— for instance, our agent achieves up to 130% higher performance on a shooting game, PhysShooter and 56.5% higher on the Atari game of Asteroids, compared to the most competitive baseline. Further, we also show that the dynamics model fine-tuned on these tasks transfers better to new tasks. For instance, our model achieves a relative score improvement of 26.9% on transfer from PhysForage to PhysShooter (both games from PhysWorld), significantly higher than a score improvement of 5.4% using a policy-transfer baseline.

2 Related Work

There are two main lines of work that are closely related to this paper. The first is that of learning and using generic video prediction models for reinforcement learning (Oh et al., 2015; Finn et al., 2016; Weber et al., 2017). The key idea is to train a model to predict future frames on the target task and hallucinate additional trajectories that can help an agent learn faster. The second direction is to incorporate physics priors into parameterized dynamics models for future state prediction (Nguyen-Tuong and Peters, 2010; Kansky et al., 2017). The former path requires only pixel inputs but does not generalize well across tasks. The latter has the potential to generalize but requires manual specification of priors. Our work aims to combine the best of both worlds – learn a frame prediction model that is task-agnostic and captures an effective notion of physics to serve as a useful prior.

Video prediction models. Our frame prediction model is closest in spirit to the ConvolutionalLSTM model which has been applied to several domains (Xingjian et al., 2015; Zhu et al., 2017; Ke et al., 2017). Similar architectures that incorporate differentiable memory modules (Patraucean et al., 2015) or relational intermediates (Watters et al., 2017) have been proposed, with applications to deep RL (Parisotto and Salakhutdinov, 2017). While the ConvLSTM model is reasonably effective at predicting future frames, the additive LSTM update equations are not well suited to capture localized physical interactions.While the model theoretically can learn to ignore unnecessary operations, optimizing the parameters effectively is difficult because of a lack of proper inductive bias in the architecture. Our architecture is simpler and more natural at capturing physical dynamics and entity movements – this allows for better generalization as we demonstrate in our experiments.

Several recent methods have also combined policy learning with future frame prediction in different ways. Action-conditioned frame prediction has been used to simulate trajectories for policy learning (Oh et al., 2015; Finn et al., 2016; Weber et al., 2017). Predicted frames have also been used to incentivize exploration in agents, via hashing (Yin et al., 2017) or using the prediction error to provide intrinsic rewards (Pathak et al., 2017). The main departure of our work from these papers is that we learn a frame prediction model that is not conditioned on actions, and from videos not related to a task, which allows us to employ the model on a variety of tasks.

Parameterized physics models. Several recent papers have explored the idea of incorporating physics priors into learning dynamics models of environments (Nguyen-Tuong and Peters, 2010; Cutler et al., 2014; Cutler and How, 2015; Scholz et al., 2014; Kansky et al., 2017; Battaglia et al., 2016; Mrowca et al., 2018; Xie et al., 2016). More recent work trained an object-oriented dynamics predictor by segmenting input frames into sets of objects (Zhu et al., 2018)

. While all these approaches demonstrate the importance of having relevant priors to sample efficient model learning, they all require some form of manual parameterization. In contrast, we learn physics priors in the form of the parameters of a predictive neural network, only using raw videos.

Decoupling dynamics from policy. Our work also relates to previous approaches on decoupling the agent’s knowledge of the environment dynamics from its task-oriented policy. Successor representations (Dayan, 1993) decompose the agent’s value function into a feature-based state representation and a reward projection operator, resulting in better exploration of the state space (Kulkarni et al., 2016; Barreto et al., 2017; Machado et al., 2017b). While these state abstractions help with exploration, such representations do not explicitly capture dynamics models of the environment. More recent work has proposed approaches to learn separate models for dynamics and rewards and use it to perform online planning (Zhang et al., 2018a) or learn independently controllable factors in the environment (Thomas et al., 2017). However, these assume access to task-specific transitions, while we learn a prior from task-independent videos and demonstrate its usefulness in learning different environment dynamics.

3 Framework

Our goal is to demonstrate that acquiring task-agnostic dynamics priors from raw videos helps agents learn faster in new environments. To this end, our approach consists of two phases:

  1. Pre-training a dynamics predictor: We first train a suitable neural network architecture to predict pixels in the next frame given the previous frames of a video. In this work, we use videos of objects moving according to classical mechanics, without any extra annotations.

  2. Reinforcement learning: We use the pre-trained frame predictor from the previous phase to initialize the dynamics model for an RL agent. This dynamics model is used to predict a few frames into the future, which is used as additional context for the control policy. The dynamics model is also simultaneously fine-tuned using trajectories from the environment.

We first describe how we use the frame prediction model for reinforcement learning, and then discuss different options for a frame predictor, including our new architecture, SpatialNet.

3.1 Reinforcement Learning with Dynamics Predictors

There are several ways one can incorporate a dynamics model into a reinforcement learning setup. One approach is to use the model to generate synthetic trajectories and use them in addition to observed transitions while training a policy (Oh et al., 2015; Feinberg et al., 2018). Another option is to perform rollouts from the current step using the model and then use the predicted states as additional context input to the policy (Weber et al., 2017). Our method is similar to the latter – we use our learned dynamics model to predict future frames and concatenate these frames along with the current frame to form the input to our policy network. There are two differences however – (1) we predict future state observations without conditioning on the actions of the agent and without rewards since our dynamics model is task agnostic, and (2) we do not use a global encoding for future frames, but instead stack the frames and use convolution operations to extract local dynamic information.

Formally, consider a standard Markov Decision Process (MDP) setup represented by the tuple

, where is the set of all possible state configurations, is the set of actions available to the agent, is the transition distribution, and is the reward function. Assuming our dynamics model to be , and given the current state , we first apply our prediction model iteratively to obtain future state predictions:

We then train a policy network to output actions using all these predicted states as input in addition to the current state:

(1)

For the policy network, we follow the architecture described in (Mnih et al., 2015) and use the Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm for learning from rewards obtained in the task. We call this agent an Intuitive Physics Agent (IPA) since it first learns an intuitive prior of physical interactions.

We update policy parameters by using the PPO loss:

where and the advantage, , is computed using the value function . Simultaneously, we also update the parameters of the dynamics model using the transitions from the environment with a pixel prediction loss (described in Section 2.) However, policy gradients are not back-propagated to the dynamics predictor.

3.2 Dynamics Prediction

Prior work has investigated a variety of frame prediction models. LSTM-based recurrent networks (Oh et al., 2015) are not ideal for this task since they encode the entire scene into a single latent vector, thereby losing the localized spatio-temporal correlations that are important for making accurate physical predictions. On the other hand, the ConvLSTM (Xingjian et al., 2015) architecture has localized spatio-temporal correlations, but is not able to accurately maintain global dynamics of entities due to LSTM state updates and limited separation of stationary and non-stationary objects. (as also seen in our experiments in Section 4.1).

Predicting the physical behavior of an entity requires a model that can perform two crucial operations – 1) isolation of the dynamics of each entity, and 2) accurate modeling of localized spaces and interactions around the entity. In order to satisfy both desiderata, we propose a new architecture, SpatialNet, which uses a spatial memory that explicitly encodes dynamics that are updated with object movement through convolutions. This allows us to implicitly capture and maintain localized physics, such as entity velocities and collisions between entities, in our frame prediction model and results in significantly lower long term prediction error.

Figure 2: Overview of the SpatialNet architecture. SpatialNet takes an RBG image as input and passes it into encoder () consisting of two residual blocks to form an input encoding . is processed by a spatial memory module () to obtain an output representation , which is used by the decoder () to predict the next frame. The spatial memory stores meta information about each entity and its locality. See Section 3 for more details.

SpatialNet Architecture SpatialNet is conceptually simple and consists of three modules (Figure 2). The first module is a standard convolutional encoder that converts an input image into a 3D representational map . The second module is a spatial memory block, , that converts and the hidden state from the previous timestep into an output representation and new hidden state . Finally, we have a convolutional decoder that predicts the next frame from . Both the encoder and decoder modules ( and ) use two convolutional layers each with residual connections.

We implement the spatial memory block as a 2D convolution operation. The module takes in a previous hidden state and input at timestep , both of shape where is the number of channels and is the dimensionality of the grid. We then perform the following operations:

(2)

where denotes a convolution, denotes concatenation, , , , are convolutional kernels and is a non-linearity (we use ELU (Clevert et al., 2015)). The module first encodes a combination of and into a proposal state , using two convolutions . acts like a dynamics simulator and generates a new hidden state , which captures the localized predictions for the next state around each entity. Finally, uses and to produce , encoding information about the entire frame to be rendered by subsequent decoding.

Intuitively, the SpatialNet architecture biases the module towards storing relevant physics information about each entity in a block of pixels at the entity’s corresponding location. This information is sequentially updated through the convolutions, while static information such as background texture is passed directly through the input encoding (see Figure 5 of appendix). We note that our spatial memory is not action-conditional, which allows us to learn from task-independent videos, as well as generalize better to new environments. Given training videos

, we learn the parameters of the model using a standard MSE-based loss function,

.

SpatialNet is inspired by the ConvLSTM model but is different from ConvLSTM in that while ConvLSTM performs an additive state updation operation (), SpatialNet uses convolutions to update the hidden state (Eqn. 2). This allows SpatialNet to better simulate moving objects and physical interactions. Another difference is that SpatialNet has residual connections, which provides a more straightforward inductive bias towards maintaining both static and dynamic information across states.

Ego-dynamics One important feature of our dynamics predictor is that it is not conditioned on the action(s) of the agent, i.e. it does not account for ego-dynamics. We make this choice in order to make the dynamics prediction model task-agnostic. As we demonstrate in our experiments (Section 4.3, this makes our approach generalize well to a variety of different tasks, and learn faster in transfer experiments.

4 Experiments

We perform two empirical studies to evaluate our hypothesis. First, we evaluate various frame prediction models, including our proposed SpatialNet, in terms of their capacity to predict future states and model physical interactions (Sections 4.1 and  4.2). Then, we investigate the use of these dynamics predictors for policy learning in different environments (Section 4.3).

Physics video dataset

In order to train a prediction model specifically for physical interaction, we generate a new video dataset, PhysVideos, using a 2-D physics engine (Pymunk, ). Each video in the dataset has frames of size with 4-8 different shapes (such as squares or circles) moving inside a room with up to 3 randomly generated interior walls (see Figure 1 (top)). Objects are initialized with random positions and velocities, a friction coefficient of 0.9 and elasticity of 0.95, resulting in diverse object movements across each trajectory. Being able to predict the future in this type of environment requires 2-dimensional physics reasoning, such as inferring velocity from past movement, anticipating changes in momentum due to collisions, and predicting rotations of each object. We generate 5000 different trajectories in total – 4500 for training a dynamics predictor and 500 for testing – with each trajectory having a length of 125 steps. See supplementary material for sample trajectories.

4.1 Frame Prediction

In this section, we evaluate various frame prediction models on their accuracy across different horizons. We report results on the 500 trajectories from the test set of PhysVideos.

Baselines We compare our model, SpatialNet, with the following baselines:

  1. RCNet: the model of (Oh et al., 2015) modified to work without action-conditioning, i.e. .

  2. ConvLSTM (Xingjian et al., 2015): this model replaces all the inner operations of an LSTM with convolutions. We use a kernel size of 5 and the same encoders and decoders as in SpatialNet.

  3. ConvLSTM + Residual: a modified version of ConvLSTM with added residual connections from input to output of the LSTM cell.

We train all prediction models using mean squared error (MSE) loss. We use the Adam optimizer (Kingma and Ba, 2015) in our experiments with a learning rate of .

Model 1 step 5 step 10 step Objects Lost
RCNet (Oh et al., 2015) 0.0061 0.0140 0.0268 1.0
ConvLSTM (Xingjian et al., 2015) 0.0026 0.0303 0.0503 0.4
ConvLSTM + Residual 0.0026 0.0141 0.0210 0.45
SpatialNet 0.0024 0.0114 0.0176 0.13
Table 1: MSE for multi-step prediction on PhysVideos (lower is better). All models were trained with 1 step prediction loss. SpatialNet suffers least from compound errors during prediction, and is able to maintain objects and dynamics more consistently (Figure 3). Number of objects lost (after 20 steps) was determined manually by evaluating 15 random videos in the test set.
Figure 3: Visualization of multi-step predictions of SpatialNet, RCNet, and ConvLSTM variants, along with ground truth (GT). After 20 steps of self prediction, SpatialNet maintains the internal wall and all seven objects in the scene while RCNet (Oh et al., 2015) loses the internal wall and 3 of the moving objects. ConvLSTM loses shape information and has less accurate dynamics prediction. SpatialNet is most consistent in obeying physical laws.

Results From Table 1, we see that SpatialNet achieves a lower test MSE compared to all the baselines, especially for multi-step predictions. This suggests that SpatialNet encourages better dynamic generalization compared to RCNet and ConvLSTM. We can also observe from Figure 3, that SpatialNet is able to accurately maintain the number of objects in the video even after 20 steps, while the baselines suffer from merging of objects (RCNet) or loss of shape information (ConvLSTM). Further, SpatialNet is also able to maintain background details such as walls that are quickly lost in RCNet, as the spatial memory structure allows the input to easily remember fixed background information. We also find that the spatial memory’s overall structure allows it to be very resistant to input noise as well as better generalize to unseen environments – please see the supplementary material for detailed analyses.

4.2 Predicting physical parameters

To further probe the representations learned by the frame prediction models, we test their ability to predict physical properties of environments (e.g. elasticity or drag) from videos. We train a 2 layer classification model on top of the hidden state representations produced by SpatialNet/ConvLSTM to predict one of 3 values for elasticity/drag -— low, medium or high. Only the classification layers are trained, while the rest of the parameters are kept fixed (except for full train).

From Table 2, we see that randomly initialized parameters or SpatialNet trained on Atari Pong don’t do well, indicating that they don’t capture physics. SpatialNet trained on PhysVideos gets an accuracy of around 69% on drag prediction (close to the fully trained model accuracy of 78%). This shows that the pre-training indeed helps the model acquire priors over physical dynamics. Further, the low numbers of the model trained on Atari Pong indicate that task-specific frame prediction may not generalize well.

Model Drag Elasticity
SpatialNet (random init) 35.8 43.8
SpatialNet (PT on Atari Pong) 35.0 33.6
ConvLSTM (PT on PhysVideos) 57.2 53.2
SpatialNet (PT on PhysVideos) 69.8 56.9
SpatialNet (full train) 78.5 67.8
Table 2: Accuracies on predicting drag and elasticity from video frames (PT = pre-training)

4.3 Reinforcement Learning

In this section, we describe the use of SpatialNet to accelerate reinforcement learning. We first train SpatialNet on the physics video dataset described in the previous section. Then, we use the pre-trained SpatialNet model as a future frame predictor for a reinforcement learning agent. We perform empirical evaluations on three different platforms - a suite of 2D games (PhysWorld), a 3D partially observable environment, and a stochastic version of Atari games (Machado et al., 2017a). We demonstrate that IPA with SpatialNet pre-training outperforms existing approaches in all platforms. The IPA architecture also allows for effective decoupling of environment dynamics from agent policy, which results in better transfer performance across tasks.

Experimental setup

In our experiments, we use SpatialNet to predict the next kWe find k=3 to work well in our experiments. future frames. We then stack the current frame with the k predicted frames and use this as input to a model free policy. We use the Adam optimizer with learning rate to train model predictions and the same set of hyper-parameters for training all policy agents as those used in (Schulman et al., 2017). For our policy network, we use the architecture described in (Mnih et al., 2015). We report numbers averaged over 3 different random seeds.

Baselines

We compare our agent (IPA) with a number of different baselines:

  1. PPO: A standard implementation of Proximal Policy Gradient (PPO) (Schulman et al., 2017), which is model-free and uses the current frame with the last k frames to output an action. The number of frames provided to PPO is the same as that provided to IPA.

  2. PPO + VF: PPO with value function expansion (Feinberg et al., 2018), which uses a dynamics predictor to obtain a more consistent estimate of the current state’s value.

  3. I2A: Imagination Augmented Agent (Weber et al., 2017) uses a combination of past frames and a recurrent encoding of future rollouts§§§Rollouts are future frames predicted by SpatialNet. as input to the policy.

  4. ISP: A variant of IPA that uses the hidden layer of SpatialNet directly as input to a policy network.

  5. JISP : ISP with auxiliary frame prediction loss.

  6. Other frame predictors: Finally, we also consider baselines where we augment our agent, IPA, with future frames predicted by RCNet (Oh et al., 2015) and ConvLSTM (Xingjian et al., 2015).

PhysWorld

PhysGoal PhysForage PhysShooter
PPO 17.9 (0.8) 44.2 (5.4) 23.2 (1.2)
PPO + VF 19.2 (2.4) 40.4 (5.4) 26.1 (2.9)
I2A + SpatialNet (action-cond) 4.2 (0.4) 23.7 (3.1) 16.5 (1.8)
I2A + SpatialNet 16.4 (6.2) 20.8 (2.0) 19.3 (0.7)
IPA + RCNet 20.7 (3.1) 46.3 (23.4) 31.7 (1.0)
IPA + ConvLSTM 21.6 (2.1) 39.5 (7.0) 29.1 (1.6)
ISP 15.2 (1.2) 45.3 (5.5) 18.6 (1.1)
JISP 18.2 (5.5) 124.3 (27.1) 28.6 (1.5)
IPA + SpatialNet (Blink) 24.6 (2.8) 48.5 (5.3) 31.0 (1.9)
IPA + SpatialNet (PhysVideos) 30.8 (5.2) 50.6 (11.5) 42.3 (2.9)
Table 3:

Average scores (with standard deviation) obtained in PhysWorld environments by various agents after 10 million frames of training. Scores are rewards over 100 episodes, averaged over runs with 3 different random seeds. IPA + SpatialNet consistently outperforms the other approaches. RCNet, SpatialNet, ConvLSTM are pretrained on PhysVideos. PPO+VF = PPO with Value Function Expansion. SpatialNet (Blink) refers to a model trained on videos with blinking objects. We add 500K additional frames to the PPO baselines to account for the frames used in pre-training for the other models.

We first consider PhysWorld, a new collection of three physics-centric 2D games that we created. These games require an agent to accurately predict object movements and rotations in order to perform well. All three tasks have an environment consisting of around 10 randomly moving boxes and circles as well as up to three internal impassable walls. PhysGoal is a navigation task to reach goals while avoiding objects, PhysForage is an object gathering task, and PhysShooter requires a stationary agent to shoot selected moving objects while preventing collisions. The objects in each of these environments are different colors and sizes than those used to train the dynamics predictor in Section 4. We provide a detailed description of each task in the supplementary material. We emphasize that the main parameters (like object velocities, rotations,etc.) in the PhysWorld games are fully randomized for each episode. To obtain good performance, agents need a good understanding of general physics and cannot just memorize frames.

Figure 4: Training curves on PhysWorld and MSE curve (bottom-right) for predicting future frames in PhysShooter.

Results: We detail the performance of our approach compared to the baselines in Table 3 and show learning curves in Figure 4. Quantitatively, we find that our approach, IPA + SpatialNet (PhysVideos), obtains significant gains over most baselines in all three tasks in PhysWorld using IPA with SpatialNet. We find that IPA with RCNet or ConvLSTM provides less benefits, due to slower learning than SpatialNet. We also find PPO with value expansions (PPO+VF) also provides slight gains, but significantly less than the gains conferred by IPA. I2A leads to no gains in performance, since it generates a global encoding of an image, destroying local dynamics information of objects. Both ISP and JISP perform worse than IPA except on PhysForage. On PhysForage, we find that JISP performs better, likely due to increased policy capacity compared to IPA (i.e. more parameters). We observe that SpatialNet trained on videos with blinking objects does not provide as much of a benefit, pointing to the fact that our full model is learning some aspects of dynamics beyond just object appearances.

IPA encourages the policy to take into account the future physics of objects, a bias crucial for good performance on each of the PhysWorld environments. Qualitatively, we observe that in all three environments, IPA agents navigate to goals and collect objects with more confidence, even if there are nearby obstacles nearby. In PhysShooter, IPA agents are much more able to hit objects further away on the map, which require multiple time-steps before collisions. Figure 4 demonstrates how having a good prior results in faster learning of the environment dynamics of PhysShooter.

Figure 4 shows the relative training rates of policies under PPO and IPA. In Phys-Shooter we see immediate benefits in using a physics model, as physics knowledge of the future is crucial as the agent only gets one action approximately every 4-5 frames. In Phys-Goal and Phys-Forage, we see long term benefits in knowing future physics as this knowledge allows the agents to more efficiently collect points.

PhysShooter3D

Additionally, we also evaluate on PhysShooter3D, a 3D physics game which we construct using Bullet (Coumans, 2010). We add gravity to the world and generate moving projectiles that follow bouncing parabolic trajectories. We then render 2D images from a particular viewpoint, causing moving objects to be partially or fully occluded at times. With these additional factors, learning dynamics is even more challenging. The game requires a stationary agent to fire bullets at selected 3D projectiles without itself being hit by any projectiles. We found that PPO obtained a score of while IPA + SpatialNet obtained and IPA using Ground truth frames obtained . This demonstrates that IPA generalizes well to partially observed settings, with still room for improvement by performing better frame prediction.

Stochastic Atari Games

In addition to PhysWorld and PhysWorld3D, we also investigate the performance of IPA on a stochastic version of the Arcade Learning Environment (ALE) (Bellemare et al., 2013), by adding sticky actions

, where an agent repeats its last action with probability

. This stochasticity was shown to be the most challenging type of randomization to add to ALE (Hausknecht and Stone, 2015; Machado et al., 2017a). We evaluate performance on all Atari games, a subset of which are shown in Table 4. All Atari experiments are run with 5 different seeds.

We emphasize that this is an out-of-domain evaluation – we use the prior trained on PhysVideos to initialize the dynamics predictor for Atari, which contains a significantly different pixel space. Further, not all Atari games are reliant on understanding physics and we do not expect our approach to provide significant gains on those environments.

Results: From Table 4, we observe that IPA outperforms PPO in 8 out of the 10 different tasksResults on all Atari games are in supplementary material. – these are all games that contain physical interactions between objects and benefit from our prior. In several games like Enduro, Breakout, Frostbite, FishingDerby and Assault, IPA provides benefits later on in training after the agent has figured out a good initial policy. In others like Asteroids and DemonAttack, IPA shows immediate boosts in training performance, resulting in faster policy learning. On Pong, where IPA performed worse than PPO, we found that the agents learned to place paddles at one particular location where without paddle movement, the ball would bounce and score points. Similarly, on Ice Hockey, where PPO outperformed IPA, we found that agents can learn a repetitive strategy to prolong the game indefinitely, removing the need for tracking dynamics information. Under such situations, there is no added advantages to predicting dynamics, explaining the reduced scores of IPA. We provide additional qualitative results, including frame predictions, in the supplementary material.

PPO I2A IPA
Assault 2932 (153) 3249.7 (378) 2968.4 (124)
Asteroids 1321 (233.5) 1340 (351) 2098 (102)
Breakout 19.7 (0.9) 18.7 (0.0) 23.4 (1.0)
DemonAttack 5510 (412) 5492 (233) 6793 (558)
Enduro 376.7 (10.5) 380 (8.0) 398.6 (23.0)
FishingDerby 6.7 (10.1) 12.1 (4.0) 9.3 (3.0)
Frostbite 1342 (2154) 1649 (2100) 1701 (2485)
IceHockey -5.9 (0.3) -6.3 (0.0) -6.1 (0.0)
Pong 6.6 (14.1) -1.4 (15.0) 2.2 (13.0)
Tennis -6.3 (2.1) -8 (4.0) -3.8 (1.0)
Table 4: Scores (and standard deviation) obtained on Stochastic Atari Environments with sticky actions (actions repeated with 50% probability at each step). Scores are average performance over 100 episodes after 10M training frames, over 5 different random seeds with included standard deviations.
Source env Agent Model Policy Reward
transfer transfer
None PPO - - 23.2
None IPA - - 35.42
PhysVideos IPA + SpatialNet Y - 42.27
PhysGoal PPO - Y 25.42
IPA + SpatialNet (Fix) Y N 26.30
IPA + SpatialNet (FT) Y N 42.83
IPA + SpatialNet (FT) Y Y 42.44
PhysForage PPO - Y 24.47
IPA + SpatialNet (Fix) Y N 30.30
IPA + SpatialNet (FT) Y N 53.66
IPA + SpatialNet (FT) Y Y 40.40
Table 5: Effects of model initialization and transfer on training policies in PhysShooter. Topmost section shows baseline PPO, random initialization of dynamics for IPA, and pre-trained IPA using PhysVideos. The bottom two sections demonstrate results while transferring different models from two other games – direct policy (PPO), transfer dynamics model and fix it (Fix), transfer dynamics and finetune (FT), and transfer both dynamics+policy and finetune. IPA allows decoupling of policy transfer from model transfer, allowing better transfer in cases of environment similarity but task dissimilarity. Scores obtained on the PhysWorld environments after training for 10M frames and evaluated by taking average rewards of the last 100 training episodes.

4.4 Transfer and Generalization

We now present some empirical results under the transfer scenario and provide some analysis of our model. Table 5 also shows the impact of initializing IPA with different pre-trained dynamics models on the PhysShooter environment. We find that initializing SpatialNet with random parameters does not perform very well, but using a SpatialNet pretrained on PhysVideos provides better performance (see Figure 4 for MSE errors). Moreover, we observe that transferring a SpatialNet model fine-tuned on a different task like PhysForage/PhysGoal results in even greater performance improvements. Interestingly, we note that transferring just the dynamics model in IPA results in a larger performance gains than transferring both model and policy. For instance, transferring the model from PhysForage results in a score of while transferring both model+policy gets a lower score of . The former is a 27% increase compared to using just PhysVideos () , while the latter results in a lower score. This provides further evidence that decoupling model learning from policy learning allows for better generalization.

5 Conclusion

We have proposed a new approach to model-based reinforcement learning by learning task-agnostic dynamics priors. First, we pre-train a frame prediction model (SpatialNet) on raw videos of a variety of objects in motion. We then use this network to initialize a dynamics model for an RL agent, which makes use of predicted frames as additional context for its policy. Through several experiments on three different domains, we demonstrate that our approach outperforms model-free techniques and approaches that learn environment dynamics from scratch. We also demonstrate the generalizability of our dynamics predictor through transfer learning experiments.

Acknowledgements

We would like to thank Alexander Botev, John Schulman, Tejas Kulkarni, Bowen Baker and the OpenAI team for providing helpful comments and suggestions.

References

Appendix A.1 Additional Dynamic Prediction Experiments

RCNet ConvLSTM SpatialNet (ours)
0 0.0061 0.0026 0.0024
0.1 0.0078 0.0030 0.0026
0.5 0.0268 0.0072 0.0062
Table 6: MSE loss on physics prediction data-set on on single-step prediction with test inputs corrupted with Gaussian noise of magnitude (model trained with no corruption). Due to its local nature, SpatialNet suffers less form errors in inputs and is able to maintain object numbers/dynamics more consistently even with domain shift.

a.1.1 Sensitivity to Corruption of Inputs

We investigate the effects of noisy observations in the input domain at test time on both SpatialNet and RCNet, by adding different amounts of Gaussian random noise to input images (Table 6). We find that SpatialNet is more resistant to noise addition. SpatialNet predictions are primarily local, preventing compounding of error from corrupted pixels elsewhere in the image whereas RCNet compresses all pixels into a latent space, where small errors can easily escalate.

a.1.2 Qualitative visualizations of Generalization Predictions

We provide visualizations of video prediction on each of the generalization datasets in Figure A1 and Figure A2.

Figure A1: Predictions of SpatialNet, RCNet on test data-set with objects twice as small and with twice the movement speed as trained on. All shown frames are one step predictions. SpatialNet is able to accurately generalize to smaller, faster objects while RCNet is unable to generate the shapes of the smaller objects and suffers from background degradation and ConvLSTM is unable to maintain shapes and dynamics.
Figure A2: Predictions of SpatialNet on input images of 168 x 168 when SpatialNet was trained on 84 x 84 images. Prediction shown are 1 step future predictions. SpatialNet is able to maintain physical consistency in at large input sizes.

a.1.3 Dataset Generalization.

We test generalization by evaluating on two unseen datasets. For the first, we create a test set where objects are half the size of the training set and initialized randomly with approximately twice the starting velocity. In this new regime, we found that RCNet had a MSE of 0.0115, ConvLSTM has a MSE of 0.0067, while SpatialNet had a MSE of 0.0039. We find RCNet is unable to maintain shapes of the smaller objects, sometimes omitting them, while ConvLSTM maintains shape but is unable to adapt to new dynamics as seen in Figure A1. In contrast, SpatialNet local structure allows it to generate new shapes, and its dynamic seperation allows better generalization. In the second dataset, we explore input size invariance. We create a second testing data-set consisting 16-32 random circles and squares and input images of size 168x168x3 (the density of objects per area is conserved). On this dataset, we obtained a MSE of 0.0042 compared to ConvLSTM of 0.0060, which is comparable to the MSE on the original test dataset of 0.0024, with qualitative images in Figure A2 showing that the spatial memories local structure allows to easily generalize to different input image sizes.

Appendix A.2 PhysWorld

We provide a description of the three games environments in PhysWorld:

PhysGoal: In this environment, an agent has to navigate to a large red goal. Each successful navigation (+1 reward) respawns the red goal at a random location while collision with balls or boxes terminates the episode (-1 reward).

PhysForage: Here, an agent has to collect moving balls while avoiding moving boxes. Each collected ball (+1 reward) will randomly respawn at a new location with a new velocity. Collision with boxes lead to termination of episode (-1 reward).

PhysShooter: In PhysShooter, the agent is stationary and has to choose an angle to shoot bullets. Each bullet travels through the environment until it hits a square (+1 reward) or circle (-1 reward) or leaves the screen. If a moving ball or box hits the agent (-1 reward), the episode is terminated. After firing a bullet, the agent cannot fire again until the bullet disappears.

Examples of agents playing the PhysWorld environments are given in Figure  A3.

Figure A3: Example agent game-play in each of the PhysWorld environments. In PhysGoal the dark blue agent attempts to reach a red goal while avoiding moving objects. In PhysForage the dark blue agent attempts to gather light blue circles while avoiding squares. In PhysShooter, the dark blue agent is immobile and chooses to fire bullet a green bullet at squares while avoiding circles. In Phys3DShooter, the grey fires turqoise bullets at purple spheres while avoiding blue spheres.

a.2.1 SpatialNet Predictions

Figure A4: Future image prediction on PhysGoal (left) and PhysShooter (right). First image is current observation, the next three are predicted. SpatialNet is able to predict future dynamics of boxes and balls and anticipate agent movement (PhysGoal) and agent shooting (PhysShooter).

Figure A4 shows the qualitative next 3 frame predictions of SpatialNet on each of the different PhysWorld environment with the first frame being the current observation. In PhysGoal, SpatialNet is able to infer the movement of the obstacles, the dark blue agent, and the red goal after agent collection. In PhysGather, SpatialNet is able to infer movement of obstacles as well as the gather of a circle. In PhysShooter, SpatialNet is able to anticipate a collision of the bullet with a moving obstacle and further infer the shooting of a green bullet by the agent.

Figure A5: Visualization of SpatialNet hidden state on PhysVideos (left), PhysGoal (middle) and Atari DemonAttack (right). Hidden state has high activations for moving objects while background objects such as walls (left), red goals (middle) and platforms (right) are not attended to as much.

a.2.2 Visualization of Spatial Memory

We provide visualization of the values of spatial memory hidden state while predicting future frames. We visualize the values of spatial memory on PhysVideos, PhysGoal and the Atari environment Demon Attack in Figure A5. To visualize, we take the mean across the channels of each grid pixel in the spatial memory hidden state. We find strong correspondence between high activation regions in the spatial memory and dynamic objects in the associated ground label of the dynamic objects. We further find that static background, such as walls in the input, goals and platforms appear to be passed along in input features.

Appendix A.3 Additional Atari experiments

Figure A6: Plots of policy performance trained with either PPO or IPA on all Atari environments on 5 different seeds. IPA sometimes leads to low learning early on the training due to rapid change of 3 predicted future frames. However, later on in training in many different environments, IPA provides performance gains by giving policies future trajectories.
Environment PPO D2A
Alien 1668.6 224.3 1485.5 281.0
Amidar 855.9 98.6 725.5 135.0
Assault 2939.2 153.2 2968.4 124.0
Asterix 2920.8 287.3 2334.0 184.0
Asteroids 1321.0 233.5 2098.4 102.0
Atlantis 323205.4 277643.2 289369.8 239469.0
BankHeist 310.4 44.0 334.3 29.0
BattleZone 26828.0 8472.0 16526.7 6986.0
BeamRider 553.1 28.4 1630.3 400.0
Bowling 46.6 5.2 64.3 13.0
Boxing 54.3 2.5 8.9 20.0
Breakout 19.7 0.9 23.4 1.0
Centipede 6043.7 990.6 6032.5 199.0
ChopperCommand 6549.4 1779.1 4112.0 1024.0
CrazyClimber 36893.2 463.9 38499.0 1221.0
DemonAttack 5510.9 412.5 6793.6 558.0
DoubleDunk -4.0 0.5 -3.8 0.0
Enduro 376.7 10.5 398.6 23.0
FishingDerby 6.7 10.1 9.3 3.0
Freeway 29.2 3.6 31.2 1.0
Frostbite 1342.5 2154.5 1701.1 2485.0
Gopher 904.0 42.3 941.1 56.0
Gravitar 574.9 36.2 627.2 25.0
IceHockey -5.9 0.3 -6.1 0.0
Jamesbond 598.9 112.1 454.3 34.0
Kangaroo 2842.4 2461.2 1373.0 445.0
Krull 5178.9 205.1 5219.3 129.0
KungFuMaster 13831.6 4483.6 13358.5 4352.0
MontezumaRevenge 0.0 0.0 129.7 122.0
MsPacman 1990.1 227.9 2097.3 259.0
NameThisGame 5406.4 278.0 5131.3 427.0
Pitfall -0.1 0.3 0.0 0.0
Pong 6.6 14.1 2.2 13.0
PrivateEye 95.6 5.4 99.6 0.0
Qbert 6981.0 548.0 6331.4 769.0
Riverraid 3411.0 201.9 3612.4 130.0
RoadRunner 19329.6 8472.6 20041.8 4906.0
Robotank 11.9 1.8 14.9 3.0
Seaquest 1426.0 43.5 1408.7 51.0
SpaceInvaders 902.4 66.0 1132.6 101.0
StarGunner 3450.0 801.5 5778.5 1584.0
Tennis -6.5 2.1 -3.8 1.0
TimePilot 4281.8 126.6 4580.0 314.0
Tutankham 128.5 12.3 118.2 35.0
UpNDown 15872.3 3995.3 16913.7 6344.0
Venture 930.2 137.9 946.7 167.0
VideoPinball 18878.1 1251.7 13981.2 2136.0
WizardOfWor 3835.6 404.7 4629.8 662.0
Zaxxon 7197.4 220.6 7271.0 264.0
Table 7: Scores obtained on Stochastic Atari Environments with sticky actions (actions repeated with 50% probability at each step). Scores are average performance over 100 episodes after 10M training frames, over 5 different random seeds.

We provide plots of training curves on all Atari environments in Figure A6 on provide on quantitative numbers in Figure 7.

Predictions on Atari

Environment MSE PD MSE DN Percent Advantage
Assault 0.00477 0.00522 9.4%
Asteroids 0.002506 0.002518 4.7%
Breakout 0.000417 0.000423 1.4%
DemonAttack 0.00433 0.00562 29.8%
Enduro 0.00576 0.00411 -28.7%
FishingDerby 0.00183 0.00192 4.9%
Frostbite 0.000965 0.00107 10.8%
IceHockey 0.000614 0.0013 111.7%
Pong 0.00636 0.00584 -8.2%
Tennis 0.00142 0.00132 -7.1%
Table 8:

MSE on Stochastic Atari Environments (a action is repeated with a geometric distribution with p=0.5) at 1 million training frames. MSE PD is trained with a model from physics dataset while MSE DN is trained with a model from scratch. We evaluate percentage advantage for initializing with a physics dataset as compared to from scratch. We average 12.9% decrease in MSE error using a initialization from pretraining on a physics dataset. The most negative environment, Enduro, involves a 3D landspace which initializing from model trained on a physics data set may be detrimental.

We also investigate the benefits (in terms of MSE) of initializing SpatialNet pretrained on the physics dataset compared to training with scratch in Figure 8. We evaluate the MSE error at 1 million frames and find that initializing with the physics dataset provides a 12.9% decrease in MSE error. We find that pretraining helps on 7 of the 10 Atari environments, with the most negatively impacted environment being Enduro, a 3D racecar environment in which the environmental prior encoded by the physics dataset may be detrimental. More significant gains in transfer may be achievable by using a large online database of 2D YouTube videos which cover even more of diversity of games.

SpatialNet Predictions We further visualize qualitative results on SpatialNet on training Atari in Figure A7. In general, across the Atari Suite, we found that SpatialNet is able to accurately model both the environment and agents behavior. In the figure, we seed that SpatialNet is able to accurately predict agent movement and ice block movement in Frostbite. On DemonAttack, SpatialNet is able to infer the falling of bullets. On Asteroid, SpatialNet is able to infer the movement of asteroids. Finally, on FishingDerby, SpatialNet is able to the right player capturing a fish and also predict that the left player is likely to catch a fish (indicated by the blurriness of the rod). We note that any blurriness in predicted output may in fact even be beneficial to the policy, since a policy can learn to interpret the input.

Figure A7: Visualization of model future state prediction on 4 games in Atari (Frostbite - upper left, DemonAttack - lower left, Asteroids - upper right, FishingDerby - lower right). SpatialNet is able to predict falling of bullets, the catching of fish, movement of asteroids, and the movement of tiles/future agent movement in different environments. First frame visualized is ground truth observation, next 3 frames are model future frame predictions.