Recent advances in deep reinforcement learning (RL) have largely relied on model-free approaches, demonstrating strong performance on a variety of domains (Silver et al., 2016; Mnih et al., 2013; Kempka et al., 2016; Zhang et al., 2018c). However, model-free techniques do not have good sample efficiency (Sutton, 1990) and are difficult to adapt to new tasks or domains (Nichol et al., 2018). A key reason for this is a single value function is used to represent both the agent’s policy and its knowledge of environment dynamics, which can result in heavy overfitting to a particular task (Zhang et al., 2018b). On the other hand, model-based RL allows for decoupling the dynamics model from the policy, enabling better generalization and transfer across tasks (Zhang et al., 2018a)
. The challenge with model-based RL, however, lies in estimating an accurate dynamics model of the environment while simultaneously using it to learn a policy, often leading to sub-optimal policies and slower learning. One way to alleviate this problem is to initialize dynamics models with universaltask-agnostic priors that allow for more efficient and stable model-based learning.
For example, consider learning dynamics models for the two different scenarios shown in Figure 1 (top and bottom). Both environments contain a variety of objects moving with different velocities and rotations. Current approaches require a large number of samples to learn a robust transition model of either world. For instance, in the first environment, inferring that the orange circle is a freely moving object will require observing the circle moving in a variety of different directions. Or to understand the laws governing elastic collisions between two bodies (e.g. the circle and the grey rectangle) requires observing several instances of collisions at various angles and velocities. On the other hand, humans have reliable priors that allow for understanding dynamics of new environments quickly (Dubey et al., 2018) – one such prior is an understanding of physical laws of motion. In this work, we demonstrate that learning a task-agnostic dynamics prior (e.g. concepts like velocity, acceleration or elasticity) allows for accurate and more efficient estimation of the dynamics of new environments, resulting in better control policies.
In order to obtain a prior for physical dynamics, we perform unsupervised learning over raw videos containing moving objects. Specifically, we train a dynamics model to predict the next frame given the previous k frames, over a wide variety of scenarios with moving objects. The parameters of the model implicitly capture general laws of physics, which are useful in predicting entity movements. We initialize the dynamics model of the environment with these pre-trained parameters and fine-tune them using transitions from the specific task, while simultaneously learning a policy for the task. The dynamics model is used to predict future frames up to a finite horizon, which are then used as additional input into a policy network, similar to the approach of(Weber et al., 2017). Importantly, our frame prediction model is not action-conditional like most prior work that employs such models in reinforcement learning (Oh et al., 2015; Weber et al., 2017).
Learning a good future frame model is challenging mainly for two reasons: a) the large dimensionality of the output space with arbitrary moving objects and interactions, and b) the partial observability in environments (Mathieu et al., 2015). Prior approaches (Oh et al., 2015)
suffered from error compounding since they encode the entire image into a single vector before decoding the output, thereby missing out fine-grained spatial information. Others like the ConvLSTM(Xingjian et al., 2015)
are better at capturing spatio-temporal interactions but suffer from poor generalization due the use of additive update equations. To overcome these issues, we propose a new architecture (SpatialNet) that consists of a convolutional encoder, a spatial memory block, and a convolutional decoder that better captures localized dynamics. The spatial memory module operates by performing convolution operations over a temporal 3-dimensional state representation that keeps spatial information intact. This allows the network, which includes residual connections, to capture localized physics of objects such as directional movements and collisions in a fine-grained manner as well as efficiently keep track of static background information. This results in lower prediction error, better generalization and invariance to the size of inputs.
We evaluate our approach on three different RL scenarios. First, we consider PhysWorld, a suite of randomized 2D physics-focused games, where learning object movement is crucial to a successful policy. Next we consider PhysShooter3D, a 3D environment with rigid body dynamics and partial observations. Finally, we also evaluate on a stochastic variant of the popular ALE framework consisting of Atari games (Machado et al., 2017a). In all scenarios, we first demonstrate the value of learning a task-agnostic prior for model dynamics -— for instance, our agent achieves up to 130% higher performance on a shooting game, PhysShooter and 56.5% higher on the Atari game of Asteroids, compared to the most competitive baseline. Further, we also show that the dynamics model fine-tuned on these tasks transfers better to new tasks. For instance, our model achieves a relative score improvement of 26.9% on transfer from PhysForage to PhysShooter (both games from PhysWorld), significantly higher than a score improvement of 5.4% using a policy-transfer baseline.
2 Related Work
There are two main lines of work that are closely related to this paper. The first is that of learning and using generic video prediction models for reinforcement learning (Oh et al., 2015; Finn et al., 2016; Weber et al., 2017). The key idea is to train a model to predict future frames on the target task and hallucinate additional trajectories that can help an agent learn faster. The second direction is to incorporate physics priors into parameterized dynamics models for future state prediction (Nguyen-Tuong and Peters, 2010; Kansky et al., 2017). The former path requires only pixel inputs but does not generalize well across tasks. The latter has the potential to generalize but requires manual specification of priors. Our work aims to combine the best of both worlds – learn a frame prediction model that is task-agnostic and captures an effective notion of physics to serve as a useful prior.
Video prediction models. Our frame prediction model is closest in spirit to the ConvolutionalLSTM model which has been applied to several domains (Xingjian et al., 2015; Zhu et al., 2017; Ke et al., 2017). Similar architectures that incorporate differentiable memory modules (Patraucean et al., 2015) or relational intermediates (Watters et al., 2017) have been proposed, with applications to deep RL (Parisotto and Salakhutdinov, 2017). While the ConvLSTM model is reasonably effective at predicting future frames, the additive LSTM update equations are not well suited to capture localized physical interactions.†††While the model theoretically can learn to ignore unnecessary operations, optimizing the parameters effectively is difficult because of a lack of proper inductive bias in the architecture. Our architecture is simpler and more natural at capturing physical dynamics and entity movements – this allows for better generalization as we demonstrate in our experiments.
Several recent methods have also combined policy learning with future frame prediction in different ways. Action-conditioned frame prediction has been used to simulate trajectories for policy learning (Oh et al., 2015; Finn et al., 2016; Weber et al., 2017). Predicted frames have also been used to incentivize exploration in agents, via hashing (Yin et al., 2017) or using the prediction error to provide intrinsic rewards (Pathak et al., 2017). The main departure of our work from these papers is that we learn a frame prediction model that is not conditioned on actions, and from videos not related to a task, which allows us to employ the model on a variety of tasks.
Parameterized physics models. Several recent papers have explored the idea of incorporating physics priors into learning dynamics models of environments (Nguyen-Tuong and Peters, 2010; Cutler et al., 2014; Cutler and How, 2015; Scholz et al., 2014; Kansky et al., 2017; Battaglia et al., 2016; Mrowca et al., 2018; Xie et al., 2016). More recent work trained an object-oriented dynamics predictor by segmenting input frames into sets of objects (Zhu et al., 2018)
. While all these approaches demonstrate the importance of having relevant priors to sample efficient model learning, they all require some form of manual parameterization. In contrast, we learn physics priors in the form of the parameters of a predictive neural network, only using raw videos.
Decoupling dynamics from policy. Our work also relates to previous approaches on decoupling the agent’s knowledge of the environment dynamics from its task-oriented policy. Successor representations (Dayan, 1993) decompose the agent’s value function into a feature-based state representation and a reward projection operator, resulting in better exploration of the state space (Kulkarni et al., 2016; Barreto et al., 2017; Machado et al., 2017b). While these state abstractions help with exploration, such representations do not explicitly capture dynamics models of the environment. More recent work has proposed approaches to learn separate models for dynamics and rewards and use it to perform online planning (Zhang et al., 2018a) or learn independently controllable factors in the environment (Thomas et al., 2017). However, these assume access to task-specific transitions, while we learn a prior from task-independent videos and demonstrate its usefulness in learning different environment dynamics.
Our goal is to demonstrate that acquiring task-agnostic dynamics priors from raw videos helps agents learn faster in new environments. To this end, our approach consists of two phases:
Pre-training a dynamics predictor: We first train a suitable neural network architecture to predict pixels in the next frame given the previous frames of a video. In this work, we use videos of objects moving according to classical mechanics, without any extra annotations.
Reinforcement learning: We use the pre-trained frame predictor from the previous phase to initialize the dynamics model for an RL agent. This dynamics model is used to predict a few frames into the future, which is used as additional context for the control policy. The dynamics model is also simultaneously fine-tuned using trajectories from the environment.
We first describe how we use the frame prediction model for reinforcement learning, and then discuss different options for a frame predictor, including our new architecture, SpatialNet.
3.1 Reinforcement Learning with Dynamics Predictors
There are several ways one can incorporate a dynamics model into a reinforcement learning setup. One approach is to use the model to generate synthetic trajectories and use them in addition to observed transitions while training a policy (Oh et al., 2015; Feinberg et al., 2018). Another option is to perform rollouts from the current step using the model and then use the predicted states as additional context input to the policy (Weber et al., 2017). Our method is similar to the latter – we use our learned dynamics model to predict future frames and concatenate these frames along with the current frame to form the input to our policy network. There are two differences however – (1) we predict future state observations without conditioning on the actions of the agent and without rewards since our dynamics model is task agnostic, and (2) we do not use a global encoding for future frames, but instead stack the frames and use convolution operations to extract local dynamic information.
Formally, consider a standard Markov Decision Process (MDP) setup represented by the tuple, where is the set of all possible state configurations, is the set of actions available to the agent, is the transition distribution, and is the reward function. Assuming our dynamics model to be , and given the current state , we first apply our prediction model iteratively to obtain future state predictions:
We then train a policy network to output actions using all these predicted states as input in addition to the current state:
For the policy network, we follow the architecture described in (Mnih et al., 2015) and use the Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm for learning from rewards obtained in the task. We call this agent an Intuitive Physics Agent (IPA) since it first learns an intuitive prior of physical interactions.
We update policy parameters by using the PPO loss:
where and the advantage, , is computed using the value function . Simultaneously, we also update the parameters of the dynamics model using the transitions from the environment with a pixel prediction loss (described in Section 2.) However, policy gradients are not back-propagated to the dynamics predictor.
3.2 Dynamics Prediction
Prior work has investigated a variety of frame prediction models. LSTM-based recurrent networks (Oh et al., 2015) are not ideal for this task since they encode the entire scene into a single latent vector, thereby losing the localized spatio-temporal correlations that are important for making accurate physical predictions. On the other hand, the ConvLSTM (Xingjian et al., 2015) architecture has localized spatio-temporal correlations, but is not able to accurately maintain global dynamics of entities due to LSTM state updates and limited separation of stationary and non-stationary objects. (as also seen in our experiments in Section 4.1).
Predicting the physical behavior of an entity requires a model that can perform two crucial operations – 1) isolation of the dynamics of each entity, and 2) accurate modeling of localized spaces and interactions around the entity. In order to satisfy both desiderata, we propose a new architecture, SpatialNet, which uses a spatial memory that explicitly encodes dynamics that are updated with object movement through convolutions. This allows us to implicitly capture and maintain localized physics, such as entity velocities and collisions between entities, in our frame prediction model and results in significantly lower long term prediction error.
SpatialNet Architecture SpatialNet is conceptually simple and consists of three modules (Figure 2). The first module is a standard convolutional encoder that converts an input image into a 3D representational map . The second module is a spatial memory block, , that converts and the hidden state from the previous timestep into an output representation and new hidden state . Finally, we have a convolutional decoder that predicts the next frame from . Both the encoder and decoder modules ( and ) use two convolutional layers each with residual connections.
We implement the spatial memory block as a 2D convolution operation. The module takes in a previous hidden state and input at timestep , both of shape where is the number of channels and is the dimensionality of the grid. We then perform the following operations:
where denotes a convolution, denotes concatenation, , , , are convolutional kernels and is a non-linearity (we use ELU (Clevert et al., 2015)). The module first encodes a combination of and into a proposal state , using two convolutions . acts like a dynamics simulator and generates a new hidden state , which captures the localized predictions for the next state around each entity. Finally, uses and to produce , encoding information about the entire frame to be rendered by subsequent decoding.
Intuitively, the SpatialNet architecture biases the module towards storing relevant physics information about each entity in a block of pixels at the entity’s corresponding location. This information is sequentially updated through the convolutions, while static information such as background texture is passed directly through the input encoding (see Figure 5 of appendix). We note that our spatial memory is not action-conditional, which allows us to learn from task-independent videos, as well as generalize better to new environments. Given training videos
, we learn the parameters of the model using a standard MSE-based loss function,.
SpatialNet is inspired by the ConvLSTM model but is different from ConvLSTM in that while ConvLSTM performs an additive state updation operation (), SpatialNet uses convolutions to update the hidden state (Eqn. 2). This allows SpatialNet to better simulate moving objects and physical interactions. Another difference is that SpatialNet has residual connections, which provides a more straightforward inductive bias towards maintaining both static and dynamic information across states.
Ego-dynamics One important feature of our dynamics predictor is that it is not conditioned on the action(s) of the agent, i.e. it does not account for ego-dynamics. We make this choice in order to make the dynamics prediction model task-agnostic. As we demonstrate in our experiments (Section 4.3, this makes our approach generalize well to a variety of different tasks, and learn faster in transfer experiments.
We perform two empirical studies to evaluate our hypothesis. First, we evaluate various frame prediction models, including our proposed SpatialNet, in terms of their capacity to predict future states and model physical interactions (Sections 4.1 and 4.2). Then, we investigate the use of these dynamics predictors for policy learning in different environments (Section 4.3).
Physics video dataset
In order to train a prediction model specifically for physical interaction, we generate a new video dataset, PhysVideos, using a 2-D physics engine (Pymunk, ). Each video in the dataset has frames of size with 4-8 different shapes (such as squares or circles) moving inside a room with up to 3 randomly generated interior walls (see Figure 1 (top)). Objects are initialized with random positions and velocities, a friction coefficient of 0.9 and elasticity of 0.95, resulting in diverse object movements across each trajectory. Being able to predict the future in this type of environment requires 2-dimensional physics reasoning, such as inferring velocity from past movement, anticipating changes in momentum due to collisions, and predicting rotations of each object. We generate 5000 different trajectories in total – 4500 for training a dynamics predictor and 500 for testing – with each trajectory having a length of 125 steps. See supplementary material for sample trajectories.
4.1 Frame Prediction
In this section, we evaluate various frame prediction models on their accuracy across different horizons. We report results on the 500 trajectories from the test set of PhysVideos.
Baselines We compare our model, SpatialNet, with the following baselines:
RCNet: the model of (Oh et al., 2015) modified to work without action-conditioning, i.e. .
ConvLSTM (Xingjian et al., 2015): this model replaces all the inner operations of an LSTM with convolutions. We use a kernel size of 5 and the same encoders and decoders as in SpatialNet.
ConvLSTM + Residual: a modified version of ConvLSTM with added residual connections from input to output of the LSTM cell.
We train all prediction models using mean squared error (MSE) loss. We use the Adam optimizer (Kingma and Ba, 2015) in our experiments with a learning rate of .
|Model||1 step||5 step||10 step||Objects Lost|
|RCNet (Oh et al., 2015)||0.0061||0.0140||0.0268||1.0|
|ConvLSTM (Xingjian et al., 2015)||0.0026||0.0303||0.0503||0.4|
|ConvLSTM + Residual||0.0026||0.0141||0.0210||0.45|
Results From Table 1, we see that SpatialNet achieves a lower test MSE compared to all the baselines, especially for multi-step predictions. This suggests that SpatialNet encourages better dynamic generalization compared to RCNet and ConvLSTM. We can also observe from Figure 3, that SpatialNet is able to accurately maintain the number of objects in the video even after 20 steps, while the baselines suffer from merging of objects (RCNet) or loss of shape information (ConvLSTM). Further, SpatialNet is also able to maintain background details such as walls that are quickly lost in RCNet, as the spatial memory structure allows the input to easily remember fixed background information. We also find that the spatial memory’s overall structure allows it to be very resistant to input noise as well as better generalize to unseen environments – please see the supplementary material for detailed analyses.
4.2 Predicting physical parameters
To further probe the representations learned by the frame prediction models, we test their ability to predict physical properties of environments (e.g. elasticity or drag) from videos. We train a 2 layer classification model on top of the hidden state representations produced by SpatialNet/ConvLSTM to predict one of 3 values for elasticity/drag -— low, medium or high. Only the classification layers are trained, while the rest of the parameters are kept fixed (except for full train).
From Table 2, we see that randomly initialized parameters or SpatialNet trained on Atari Pong don’t do well, indicating that they don’t capture physics. SpatialNet trained on PhysVideos gets an accuracy of around 69% on drag prediction (close to the fully trained model accuracy of 78%). This shows that the pre-training indeed helps the model acquire priors over physical dynamics. Further, the low numbers of the model trained on Atari Pong indicate that task-specific frame prediction may not generalize well.
|SpatialNet (random init)||35.8||43.8|
|SpatialNet (PT on Atari Pong)||35.0||33.6|
|ConvLSTM (PT on PhysVideos)||57.2||53.2|
|SpatialNet (PT on PhysVideos)||69.8||56.9|
|SpatialNet (full train)||78.5||67.8|
4.3 Reinforcement Learning
In this section, we describe the use of SpatialNet to accelerate reinforcement learning. We first train SpatialNet on the physics video dataset described in the previous section. Then, we use the pre-trained SpatialNet model as a future frame predictor for a reinforcement learning agent. We perform empirical evaluations on three different platforms - a suite of 2D games (PhysWorld), a 3D partially observable environment, and a stochastic version of Atari games (Machado et al., 2017a). We demonstrate that IPA with SpatialNet pre-training outperforms existing approaches in all platforms. The IPA architecture also allows for effective decoupling of environment dynamics from agent policy, which results in better transfer performance across tasks.
In our experiments, we use SpatialNet to predict the next k‡‡‡We find k=3 to work well in our experiments. future frames. We then stack the current frame with the k predicted frames and use this as input to a model free policy. We use the Adam optimizer with learning rate to train model predictions and the same set of hyper-parameters for training all policy agents as those used in (Schulman et al., 2017). For our policy network, we use the architecture described in (Mnih et al., 2015). We report numbers averaged over 3 different random seeds.
We compare our agent (IPA) with a number of different baselines:
PPO: A standard implementation of Proximal Policy Gradient (PPO) (Schulman et al., 2017), which is model-free and uses the current frame with the last k frames to output an action. The number of frames provided to PPO is the same as that provided to IPA.
PPO + VF: PPO with value function expansion (Feinberg et al., 2018), which uses a dynamics predictor to obtain a more consistent estimate of the current state’s value.
I2A: Imagination Augmented Agent (Weber et al., 2017) uses a combination of past frames and a recurrent encoding of future rollouts§§§Rollouts are future frames predicted by SpatialNet. as input to the policy.
ISP: A variant of IPA that uses the hidden layer of SpatialNet directly as input to a policy network.
JISP : ISP with auxiliary frame prediction loss.
|PPO||17.9 (0.8)||44.2 (5.4)||23.2 (1.2)|
|PPO + VF||19.2 (2.4)||40.4 (5.4)||26.1 (2.9)|
|I2A + SpatialNet (action-cond)||4.2 (0.4)||23.7 (3.1)||16.5 (1.8)|
|I2A + SpatialNet||16.4 (6.2)||20.8 (2.0)||19.3 (0.7)|
|IPA + RCNet||20.7 (3.1)||46.3 (23.4)||31.7 (1.0)|
|IPA + ConvLSTM||21.6 (2.1)||39.5 (7.0)||29.1 (1.6)|
|ISP||15.2 (1.2)||45.3 (5.5)||18.6 (1.1)|
|JISP||18.2 (5.5)||124.3 (27.1)||28.6 (1.5)|
|IPA + SpatialNet (Blink)||24.6 (2.8)||48.5 (5.3)||31.0 (1.9)|
|IPA + SpatialNet (PhysVideos)||30.8 (5.2)||50.6 (11.5)||42.3 (2.9)|
Average scores (with standard deviation) obtained in PhysWorld environments by various agents after 10 million frames of training. Scores are rewards over 100 episodes, averaged over runs with 3 different random seeds. IPA + SpatialNet consistently outperforms the other approaches. RCNet, SpatialNet, ConvLSTM are pretrained on PhysVideos. PPO+VF = PPO with Value Function Expansion. SpatialNet (Blink) refers to a model trained on videos with blinking objects. We add 500K additional frames to the PPO baselines to account for the frames used in pre-training for the other models.
We first consider PhysWorld, a new collection of three physics-centric 2D games that we created. These games require an agent to accurately predict object movements and rotations in order to perform well. All three tasks have an environment consisting of around 10 randomly moving boxes and circles as well as up to three internal impassable walls. PhysGoal is a navigation task to reach goals while avoiding objects, PhysForage is an object gathering task, and PhysShooter requires a stationary agent to shoot selected moving objects while preventing collisions. The objects in each of these environments are different colors and sizes than those used to train the dynamics predictor in Section 4. We provide a detailed description of each task in the supplementary material. We emphasize that the main parameters (like object velocities, rotations,etc.) in the PhysWorld games are fully randomized for each episode. To obtain good performance, agents need a good understanding of general physics and cannot just memorize frames.
Results: We detail the performance of our approach compared to the baselines in Table 3 and show learning curves in Figure 4. Quantitatively, we find that our approach, IPA + SpatialNet (PhysVideos), obtains significant gains over most baselines in all three tasks in PhysWorld using IPA with SpatialNet. We find that IPA with RCNet or ConvLSTM provides less benefits, due to slower learning than SpatialNet. We also find PPO with value expansions (PPO+VF) also provides slight gains, but significantly less than the gains conferred by IPA. I2A leads to no gains in performance, since it generates a global encoding of an image, destroying local dynamics information of objects. Both ISP and JISP perform worse than IPA except on PhysForage. On PhysForage, we find that JISP performs better, likely due to increased policy capacity compared to IPA (i.e. more parameters). We observe that SpatialNet trained on videos with blinking objects does not provide as much of a benefit, pointing to the fact that our full model is learning some aspects of dynamics beyond just object appearances.
IPA encourages the policy to take into account the future physics of objects, a bias crucial for good performance on each of the PhysWorld environments. Qualitatively, we observe that in all three environments, IPA agents navigate to goals and collect objects with more confidence, even if there are nearby obstacles nearby. In PhysShooter, IPA agents are much more able to hit objects further away on the map, which require multiple time-steps before collisions. Figure 4 demonstrates how having a good prior results in faster learning of the environment dynamics of PhysShooter.
Figure 4 shows the relative training rates of policies under PPO and IPA. In Phys-Shooter we see immediate benefits in using a physics model, as physics knowledge of the future is crucial as the agent only gets one action approximately every 4-5 frames. In Phys-Goal and Phys-Forage, we see long term benefits in knowing future physics as this knowledge allows the agents to more efficiently collect points.
Additionally, we also evaluate on PhysShooter3D, a 3D physics game which we construct using Bullet (Coumans, 2010). We add gravity to the world and generate moving projectiles that follow bouncing parabolic trajectories. We then render 2D images from a particular viewpoint, causing moving objects to be partially or fully occluded at times. With these additional factors, learning dynamics is even more challenging. The game requires a stationary agent to fire bullets at selected 3D projectiles without itself being hit by any projectiles. We found that PPO obtained a score of while IPA + SpatialNet obtained and IPA using Ground truth frames obtained . This demonstrates that IPA generalizes well to partially observed settings, with still room for improvement by performing better frame prediction.
Stochastic Atari Games
In addition to PhysWorld and PhysWorld3D, we also investigate the performance of IPA on a stochastic version of the Arcade Learning Environment (ALE) (Bellemare et al., 2013), by adding sticky actions
, where an agent repeats its last action with probability. This stochasticity was shown to be the most challenging type of randomization to add to ALE (Hausknecht and Stone, 2015; Machado et al., 2017a). We evaluate performance on all Atari games, a subset of which are shown in Table 4. All Atari experiments are run with 5 different seeds.
We emphasize that this is an out-of-domain evaluation – we use the prior trained on PhysVideos to initialize the dynamics predictor for Atari, which contains a significantly different pixel space. Further, not all Atari games are reliant on understanding physics and we do not expect our approach to provide significant gains on those environments.
Results: From Table 4, we observe that IPA outperforms PPO in 8 out of the 10 different tasks¶¶¶Results on all Atari games are in supplementary material. – these are all games that contain physical interactions between objects and benefit from our prior. In several games like Enduro, Breakout, Frostbite, FishingDerby and Assault, IPA provides benefits later on in training after the agent has figured out a good initial policy. In others like Asteroids and DemonAttack, IPA shows immediate boosts in training performance, resulting in faster policy learning. On Pong, where IPA performed worse than PPO, we found that the agents learned to place paddles at one particular location where without paddle movement, the ball would bounce and score points. Similarly, on Ice Hockey, where PPO outperformed IPA, we found that agents can learn a repetitive strategy to prolong the game indefinitely, removing the need for tracking dynamics information. Under such situations, there is no added advantages to predicting dynamics, explaining the reduced scores of IPA. We provide additional qualitative results, including frame predictions, in the supplementary material.
|Assault||2932 (153)||3249.7 (378)||2968.4 (124)|
|Asteroids||1321 (233.5)||1340 (351)||2098 (102)|
|Breakout||19.7 (0.9)||18.7 (0.0)||23.4 (1.0)|
|DemonAttack||5510 (412)||5492 (233)||6793 (558)|
|Enduro||376.7 (10.5)||380 (8.0)||398.6 (23.0)|
|FishingDerby||6.7 (10.1)||12.1 (4.0)||9.3 (3.0)|
|Frostbite||1342 (2154)||1649 (2100)||1701 (2485)|
|IceHockey||-5.9 (0.3)||-6.3 (0.0)||-6.1 (0.0)|
|Pong||6.6 (14.1)||-1.4 (15.0)||2.2 (13.0)|
|Tennis||-6.3 (2.1)||-8 (4.0)||-3.8 (1.0)|
|PhysVideos||IPA + SpatialNet||Y||-||42.27|
|IPA + SpatialNet (Fix)||Y||N||26.30|
|IPA + SpatialNet (FT)||Y||N||42.83|
|IPA + SpatialNet (FT)||Y||Y||42.44|
|IPA + SpatialNet (Fix)||Y||N||30.30|
|IPA + SpatialNet (FT)||Y||N||53.66|
|IPA + SpatialNet (FT)||Y||Y||40.40|
4.4 Transfer and Generalization
We now present some empirical results under the transfer scenario and provide some analysis of our model. Table 5 also shows the impact of initializing IPA with different pre-trained dynamics models on the PhysShooter environment. We find that initializing SpatialNet with random parameters does not perform very well, but using a SpatialNet pretrained on PhysVideos provides better performance (see Figure 4 for MSE errors). Moreover, we observe that transferring a SpatialNet model fine-tuned on a different task like PhysForage/PhysGoal results in even greater performance improvements. Interestingly, we note that transferring just the dynamics model in IPA results in a larger performance gains than transferring both model and policy. For instance, transferring the model from PhysForage results in a score of while transferring both model+policy gets a lower score of . The former is a 27% increase compared to using just PhysVideos () , while the latter results in a lower score. This provides further evidence that decoupling model learning from policy learning allows for better generalization.
We have proposed a new approach to model-based reinforcement learning by learning task-agnostic dynamics priors. First, we pre-train a frame prediction model (SpatialNet) on raw videos of a variety of objects in motion. We then use this network to initialize a dynamics model for an RL agent, which makes use of predicted frames as additional context for its policy. Through several experiments on three different domains, we demonstrate that our approach outperforms model-free techniques and approaches that learn environment dynamics from scratch. We also demonstrate the generalizability of our dynamics predictor through transfer learning experiments.
We would like to thank Alexander Botev, John Schulman, Tejas Kulkarni, Bowen Baker and the OpenAI team for providing helpful comments and suggestions.
- Barreto et al.  André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065, 2017.
- Battaglia et al.  Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In NIPS, 2016.
Bellemare et al. 
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling.
The arcade learning environment: An evaluation platform for general
Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Clevert et al.  Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
- Coumans  Erwin Coumans. Bullet physics engine. Open Source Software: http://bulletphysics. org, 2010.
- Cutler and How  Mark Cutler and Jonathan P How. Efficient reinforcement learning for robots using informative simulated priors. 2015.
- Cutler et al.  Mark Cutler, Thomas J Walsh, and Jonathan P How. Reinforcement learning with multi-fidelity simulators. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 3888–3895. IEEE, 2014.
- Dayan  Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Comput., 5(4):613–624, 1993.
- Dubey et al.  Rachit Dubey, Pulkit Agrawal, Deepak Pathak, Thomas L Griffiths, and Alexei A Efros. Investigating human priors for playing video games. ICML, 2018.
- Feinberg et al.  Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
- Finn et al.  Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
- Hausknecht and Stone  Matthew J Hausknecht and Peter Stone. The impact of determinism on learning atari 2600 games. 2015.
- Kansky et al.  Ken Kansky, Tom Silver, David A Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema networks: Zero-shot transfer with a generative causal model of intuitive physics. In ICML, 2017.
- Ke et al.  Jintao Ke, Hongyu Zheng, Hai Yang, and Xiqun Chen. Short-term forecasting of passenger demand under on-demand ride services: A spatio-temporal deep learning approach. CoRR, abs/1706.06279, 2017.
- Kempka et al.  Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaśkowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In Computational Intelligence and Games (CIG), 2016 IEEE Conference on, pages 1–8. IEEE, 2016.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Kulkarni et al.  Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor reinforcement learning. arXiv:1606.02396, 2016.
- Machado et al. [2017a] Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew J. Hausknecht, and Michael Bowling. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. CoRR, abs/1709.06009, 2017a. URL http://arxiv.org/abs/1709.06009.
- Machado et al. [2017b] Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption discovery through the deep successor representation. arXiv preprint arXiv:1710.11089, 2017b.
- Mathieu et al.  Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. In NIPS Workshop, 2013.
- Mnih et al.  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nat., 518(7540):529–533, 2015.
- Mrowca et al.  Damian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber, Li F Fei-Fei, Josh Tenenbaum, and Daniel L Yamins. Flexible neural representation for physics prediction. In Advances in Neural Information Processing Systems, pages 8799–8810, 2018.
- Nguyen-Tuong and Peters  Duy Nguyen-Tuong and Jan Peters. Using model knowledge for learning inverse dynamics. In ICRA, pages 2677–2682, 2010.
- Nichol et al.  Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, and John Schulman. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018.
- Oh et al.  Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video prediction using deep networks in atari games. In NIPS, 2015.
- Parisotto and Salakhutdinov  Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. CoRR, abs/1702.08360, 2017.
- Pathak et al.  Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In ICML, 2017.
- Patraucean et al.  Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differentiable memory. CoRR, abs/1511.06309, 2015.
-  Pymunk. Pymunk. http://www.pymunk.org/en/latest/. Accessed: 2018-09-26.
Scholz et al. 
Jonathan Scholz, Martin Levihn, Charles Isbell, and David Wingate.
A physics-based model prior for object-oriented mdps.
International Conference on Machine Learning, pages 1089–1097, 2014.
- Schulman et al.  John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
- Silver et al.  David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nat., 529(7587):484–489, 2016.
- Sutton  Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In ICML, 1990.
- Thomas et al.  Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati, Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and Yoshua Bengio. Independently controllable factors. arXiv preprint arXiv:1708.01289, 2017.
- Watters et al.  Nicholas Watters, Andrea Tacchetti, Theophane Weber, Razvan Pascanu, Peter Battaglia, and Daniel Zoran. Visual interaction networks. In NIPS, 2017.
- Weber et al.  Théophane Weber, Sébastien Racanière, David P Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imagination-augmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.
- Xie et al.  Chris Xie, Sachin Patil, Teodor Moldovan, Sergey Levine, and Pieter Abbeel. Model-based reinforcement learning with parametrized physical models and optimism-driven exploration. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 504–511. IEEE, 2016.
- Xingjian et al.  SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
- Yin et al.  Haiyan Yin, Jianda Chen, and Sinno Jialin Pan. Hashing over predicted future frames for informed exploration of deep reinforcement learning. arXiv preprint arXiv:1707.00524, 2017.
- Zhang et al. [2018a] Amy Zhang, Harsh Satija, and Joelle Pineau. Decoupling dynamics and reward for transfer learning. arXiv preprint arXiv:1804.10689, 2018a.
- Zhang et al. [2018b] Chiyuan Zhang, Oriol Vinyals, Remi Munos, and Samy Bengio. A study on overfitting in deep reinforcement learning. arXiv:1804.06893, 2018b.
- Zhang et al. [2018c] Susan Zhang, Michael Petrov, Pachoki Jacob, Henrique Pondé, Brooke Chan, Filip Wolski, Szymon Sidor, Rafał Józefowicz, Przemysław Dębiak, David Farhi, Greg Brockman, Jonathan Raiman, Jie Tang, Christy Dennison, Paul Christiano, Shariq Hashme, Larissa Schiavo, Ilya Sutskever, Eric Sigler, Jonas Schneider, John Schulman, Christopher Hesse, Jack Clark, Quirin Fischer, Diane Yoon, Christopher Berner, Scott Gray, Alec Radford, and David Luan. Openai five, 2018c.
- Zhu et al.  Guangming Zhu, Liang Zhang, Peiyi Shen, and Juan Song. Multimodal gesture recognition using 3-d convolution and convolutional lstm. IEEE Access, 5:4517–4524, 2017.
- Zhu et al.  Guangxiang Zhu, Zhiao Huang, and Chongjie Zhang. Object-oriented dynamics predictor. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9826–9837. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8187-object-oriented-dynamics-predictor.pdf.
Appendix A.1 Additional Dynamic Prediction Experiments
a.1.1 Sensitivity to Corruption of Inputs
We investigate the effects of noisy observations in the input domain at test time on both SpatialNet and RCNet, by adding different amounts of Gaussian random noise to input images (Table 6). We find that SpatialNet is more resistant to noise addition. SpatialNet predictions are primarily local, preventing compounding of error from corrupted pixels elsewhere in the image whereas RCNet compresses all pixels into a latent space, where small errors can easily escalate.
a.1.2 Qualitative visualizations of Generalization Predictions
a.1.3 Dataset Generalization.
We test generalization by evaluating on two unseen datasets. For the first, we create a test set where objects are half the size of the training set and initialized randomly with approximately twice the starting velocity. In this new regime, we found that RCNet had a MSE of 0.0115, ConvLSTM has a MSE of 0.0067, while SpatialNet had a MSE of 0.0039. We find RCNet is unable to maintain shapes of the smaller objects, sometimes omitting them, while ConvLSTM maintains shape but is unable to adapt to new dynamics as seen in Figure A1. In contrast, SpatialNet local structure allows it to generate new shapes, and its dynamic seperation allows better generalization. In the second dataset, we explore input size invariance. We create a second testing data-set consisting 16-32 random circles and squares and input images of size 168x168x3 (the density of objects per area is conserved). On this dataset, we obtained a MSE of 0.0042 compared to ConvLSTM of 0.0060, which is comparable to the MSE on the original test dataset of 0.0024, with qualitative images in Figure A2 showing that the spatial memories local structure allows to easily generalize to different input image sizes.
Appendix A.2 PhysWorld
We provide a description of the three games environments in PhysWorld:
PhysGoal: In this environment, an agent has to navigate to a large red goal. Each successful navigation (+1 reward) respawns the red goal at a random location while collision with balls or boxes terminates the episode (-1 reward).
PhysForage: Here, an agent has to collect moving balls while avoiding moving boxes. Each collected ball (+1 reward) will randomly respawn at a new location with a new velocity. Collision with boxes lead to termination of episode (-1 reward).
PhysShooter: In PhysShooter, the agent is stationary and has to choose an angle to shoot bullets. Each bullet travels through the environment until it hits a square (+1 reward) or circle (-1 reward) or leaves the screen. If a moving ball or box hits the agent (-1 reward), the episode is terminated. After firing a bullet, the agent cannot fire again until the bullet disappears.
Examples of agents playing the PhysWorld environments are given in Figure A3.
a.2.1 SpatialNet Predictions
Figure A4 shows the qualitative next 3 frame predictions of SpatialNet on each of the different PhysWorld environment with the first frame being the current observation. In PhysGoal, SpatialNet is able to infer the movement of the obstacles, the dark blue agent, and the red goal after agent collection. In PhysGather, SpatialNet is able to infer movement of obstacles as well as the gather of a circle. In PhysShooter, SpatialNet is able to anticipate a collision of the bullet with a moving obstacle and further infer the shooting of a green bullet by the agent.
a.2.2 Visualization of Spatial Memory
We provide visualization of the values of spatial memory hidden state while predicting future frames. We visualize the values of spatial memory on PhysVideos, PhysGoal and the Atari environment Demon Attack in Figure A5. To visualize, we take the mean across the channels of each grid pixel in the spatial memory hidden state. We find strong correspondence between high activation regions in the spatial memory and dynamic objects in the associated ground label of the dynamic objects. We further find that static background, such as walls in the input, goals and platforms appear to be passed along in input features.
Appendix A.3 Additional Atari experiments
|Alien||1668.6 224.3||1485.5 281.0|
|Amidar||855.9 98.6||725.5 135.0|
|Assault||2939.2 153.2||2968.4 124.0|
|Asterix||2920.8 287.3||2334.0 184.0|
|Asteroids||1321.0 233.5||2098.4 102.0|
|Atlantis||323205.4 277643.2||289369.8 239469.0|
|BankHeist||310.4 44.0||334.3 29.0|
|BattleZone||26828.0 8472.0||16526.7 6986.0|
|BeamRider||553.1 28.4||1630.3 400.0|
|Bowling||46.6 5.2||64.3 13.0|
|Boxing||54.3 2.5||8.9 20.0|
|Breakout||19.7 0.9||23.4 1.0|
|Centipede||6043.7 990.6||6032.5 199.0|
|ChopperCommand||6549.4 1779.1||4112.0 1024.0|
|CrazyClimber||36893.2 463.9||38499.0 1221.0|
|DemonAttack||5510.9 412.5||6793.6 558.0|
|DoubleDunk||-4.0 0.5||-3.8 0.0|
|Enduro||376.7 10.5||398.6 23.0|
|FishingDerby||6.7 10.1||9.3 3.0|
|Freeway||29.2 3.6||31.2 1.0|
|Frostbite||1342.5 2154.5||1701.1 2485.0|
|Gopher||904.0 42.3||941.1 56.0|
|Gravitar||574.9 36.2||627.2 25.0|
|IceHockey||-5.9 0.3||-6.1 0.0|
|Jamesbond||598.9 112.1||454.3 34.0|
|Kangaroo||2842.4 2461.2||1373.0 445.0|
|Krull||5178.9 205.1||5219.3 129.0|
|KungFuMaster||13831.6 4483.6||13358.5 4352.0|
|MontezumaRevenge||0.0 0.0||129.7 122.0|
|MsPacman||1990.1 227.9||2097.3 259.0|
|NameThisGame||5406.4 278.0||5131.3 427.0|
|Pitfall||-0.1 0.3||0.0 0.0|
|Pong||6.6 14.1||2.2 13.0|
|PrivateEye||95.6 5.4||99.6 0.0|
|Qbert||6981.0 548.0||6331.4 769.0|
|Riverraid||3411.0 201.9||3612.4 130.0|
|RoadRunner||19329.6 8472.6||20041.8 4906.0|
|Robotank||11.9 1.8||14.9 3.0|
|Seaquest||1426.0 43.5||1408.7 51.0|
|SpaceInvaders||902.4 66.0||1132.6 101.0|
|StarGunner||3450.0 801.5||5778.5 1584.0|
|Tennis||-6.5 2.1||-3.8 1.0|
|TimePilot||4281.8 126.6||4580.0 314.0|
|Tutankham||128.5 12.3||118.2 35.0|
|UpNDown||15872.3 3995.3||16913.7 6344.0|
|Venture||930.2 137.9||946.7 167.0|
|VideoPinball||18878.1 1251.7||13981.2 2136.0|
|WizardOfWor||3835.6 404.7||4629.8 662.0|
|Zaxxon||7197.4 220.6||7271.0 264.0|
Predictions on Atari
|Environment||MSE PD||MSE DN||Percent Advantage|
MSE on Stochastic Atari Environments (a action is repeated with a geometric distribution with p=0.5) at 1 million training frames. MSE PD is trained with a model from physics dataset while MSE DN is trained with a model from scratch. We evaluate percentage advantage for initializing with a physics dataset as compared to from scratch. We average 12.9% decrease in MSE error using a initialization from pretraining on a physics dataset. The most negative environment, Enduro, involves a 3D landspace which initializing from model trained on a physics data set may be detrimental.
We also investigate the benefits (in terms of MSE) of initializing SpatialNet pretrained on the physics dataset compared to training with scratch in Figure 8. We evaluate the MSE error at 1 million frames and find that initializing with the physics dataset provides a 12.9% decrease in MSE error. We find that pretraining helps on 7 of the 10 Atari environments, with the most negatively impacted environment being Enduro, a 3D racecar environment in which the environmental prior encoded by the physics dataset may be detrimental. More significant gains in transfer may be achievable by using a large online database of 2D YouTube videos which cover even more of diversity of games.
SpatialNet Predictions We further visualize qualitative results on SpatialNet on training Atari in Figure A7. In general, across the Atari Suite, we found that SpatialNet is able to accurately model both the environment and agents behavior. In the figure, we seed that SpatialNet is able to accurately predict agent movement and ice block movement in Frostbite. On DemonAttack, SpatialNet is able to infer the falling of bullets. On Asteroid, SpatialNet is able to infer the movement of asteroids. Finally, on FishingDerby, SpatialNet is able to the right player capturing a fish and also predict that the left player is likely to catch a fish (indicated by the blurriness of the rod). We note that any blurriness in predicted output may in fact even be beneficial to the policy, since a policy can learn to interpret the input.