A model capable of accurate multi-step prediction prediction over long horizons has the potential to reduce the sample complexity of reinforcement learning as compared to model free methods. For example, such a model would allow an agent to plan by estimating the effect of some sequence of actions. Alternately, the model can be used in a Dyna(Sutton, 1990) style algorithm to generate rollouts starting from observations sampled from a replay buffer. A fundamental trade-off exists when learning a model suitable for model-based reinforcement learning. Complex dynamics necessitate a complex model, which is prone to overfitting without large amounts of data. However, for model-based RL, we want the model to learn a good representation of the dynamics quickly to reduce sample complexity. If learning the model is as difficult as learning a model-free policy, we have gained nothing (unless the model can be re-used for other tasks).
In this work we develop a predictive model using the principles of predictive coding (Rao & Ballard, 1999), which was originally developed to explain endstopping in receptive fields of the visual cortex. An action conditional predictive coding model maps from a history of prediction errors and actions to the predicted observation , where , and . When the trained model is run open loop to make predictions, the error feedback signal is set to zero, i.e., the model implements the function , which is equivalent to using the predicted observation in place of the measured observation to compute the error feedback.
In contrast, most action conditional models reported in the literature implement a mapping from a history of observations and actions to a predicted observation . The model can then be used to predict a trajectory open loop given some initial observation and a sequence of actions by feeding back the predicted observation into the model , where in model-based RL, the action would be output by some policy .
Our model borrows from the PredNet architecture (Lotter et al., 2016)
, which applies modern deep learning techniques to predictive coding, but is modified to make the model action conditional, and implements two key innovations. To understand the first improvement, first consider that a predictive coding model (PCM) does not have access to the observations, but only to the prediction error. Consequently, it typically takes several steps of closed loop interaction with the environment for a trained PCM to begin making accurate open loop predictions, which can be a problem in some applications. Our solution is to learn a mapping from the initial observation in an episode to the model’s recurrent network hidden state. This effectively bootstraps the hidden state, allowing the model to immediately begin making accurate predictions from the onset of an episode. Concretely, the model implements a mapping .
The second improvement concerns the forward pass through the model’s network during training. If the network is unrolled for T steps to implement back propagation through time, we input the prediction error only on the first step, and set for steps 2 through T. This forces the network to learn a representation that is good for multi-step prediction, even when using a conventional single step cost function. Open loop prediction of trajectories is accomplished by initializing the recurrent network’s hidden state using the learned mapping from the initial observation at the start of the episode, and setting the prediction error to zero, i.e., the trajectory is generated using the mapping .
We demonstrate through a series of experiments that even without our enhancements, action conditional predictive coding allows accurate open-loop trajectory prediction of high dimensional trajectories 111Note that here we are referring to the dimension of the configuration space that the agent is operating in, not the dimension of the measurement space. For example, an agent operating in a simulated Atari environment has a low dimensional configuration space but a high dimension measurement space over long horizons, and that our refinements further enhance this accuracy. We attribute the high performance of predictive coding to the fact that a predictive coding model must make predictions using only a history of prediction errors and actions (as opposed to observations and actions). Since the model must integrate a history of prediction errors, minimizing the loss requires the model to learn important temporal dependencies.
In these experiments we use two versions of a Mars powered descent phase environment. The first version is a simple 3 degree of freedom (3-DOF) environment that is used to compare a standard PCM to a model with our enhancements. Although only 3-DOF, the environment is realistic with respect to the lander capability, dynamics, and deployment ellipse. The model is then tested in a 6-DOF version of the Mars landing environment. In one experiment we demonstrate the model’s ability to predict the lander’s trajectory open loop over an entire episode, whereas in the second experiment the model must predict simulated Doppler radar altimeter readings open loop during the descent, with the simulated readings generated using a digital terrain map of the Uzboi Vallis region on Mars. We then couple the model with an agent implementing proximal policy optimization (PPO) (Schulman et al., 2017) and demonstrate a reduction in sample complexity in the 3-DOF landing task using a Dyna style algorithm.
To our knowledge, this is the first demonstration of a sample efficiency improvement of PPO using concurrent training of policy and model, as previous work has focused on accelerating Q learning. Although (Nagabandi et al., 2018) used TRPO, the training had two distinct phases of model-based and model free learning, with model predictive control used in the 1st phase. In addition, most previous work in model-based acceleration has assumed that a ground-truth reward function is known, e.g., see (Nagabandi et al., 2018), (Gu et al., 2016), and (Feinberg et al., 2018). In contrast, our model learns a reward function using a separate reward head in the model network.
Our goal is for the model to learn a representation that is effective for making long-term open loop predictions of high dimensional trajectories. In this work the model will learn to predict future observations in a reinforcement learning setting where an agent interacts with an environment and learns to accomplish some task, with the learning driven by a scalar reward signal. The agent will instantiate a policy that maps observations to actions, and the model learns a representation by observing the sequence of observations and actions induced by the episodic interaction between the agent and environment. Ideally, after a limited amount of experience observing this interaction, the trained model will have the ability to accurately predict future observations from some sequence of on-policy actions, operating open loop. This ability to ”imagine” the long-term consequences of some sequence of actions will then be a powerful tool for planning in model-based reinforcement learning, or alternatively, could be used to accelerate model-free algorithms by augmenting the rollouts collected via interaction with the environment with simulated rollouts from the model.
We will evaluate our model using two criteria. The first is the ability of the model to be able to generate accurate extended horizon trajectory predictions given some sequence of actions, with the policy that is generating actions having access to the ground truth observation. The second criterion is whether the model can significantly accelerate proximal policy optimization (PPO), a model-free policy gradient with baseline algorithm.
3 Related Work
Recent work in developing predictive models include (Finn & Levine, 2017), where a model learns to predict future video frames by observing sequences of observation and actions. The model is then used to generate robotic trajectories using model predictive control (MPC), choosing the trajectory that ends with an image that best matches a user specified goal image. The action conditional architecture of (Oh et al., 2015) has proven successful in open loop prediction of long sequences of rendered frames from simulated Atari games. Predictive coding (Lotter et al., 2016) has been been applied to predicting images from objects that are sequentially rotated, and has been used to predict steering angles from video frames captured from a car mounted camera. (Wang et al., 2017) and (Wang et al., 2017) extend the PredNet architecture described in (Lotter et al., 2016).
As for using predictive models to accelerate reinforcement learning, in (Nagabandi et al., 2018), the authors use an action conditional model to predict the future states of agents operating in various openAI gym environments with high dimensional state spaces, and use the model in a model predictive control algorithm that quickly learns the tasks at a relatively low level of performance. The model is then used to create a dataset of trajectories to pre-train a TRPO policy (Schulman et al., 2015), which then achieves a high level of performance on the task through continued model-free policy optimization. The authors report a 3 to 5 times reduction in sample efficiency using the combined model free and model based algorithm. The approach assumes that a ground truth reward function is known.
In (Gu et al., 2016)
, the authors develop a normalized advantage function that gives rise to a Q-learning architecture suitable for continuous high dimensional action spaces. They then improve the algorithm’s sample efficiency by adding ”imagination rollouts” to the replay buffer, which are created using a time varying linear model. Importantly, the authors state that their architecture had no success using neural network based models to improve sample efficiency, and assume a ground-truth reward function is available.
(Feinberg et al., 2018) accelerates learning a state-action value function by using a neural network based model to generate synthetic rollouts up to a horizon of H steps. The rewards beyond H steps are replaced by the current estimate of the value of the penultimate observation in the synthetic rollouts, as in an H-step temporal difference estimate. The accelerated Q function learning is then used to improve the sample efficiency of the DDPG algorithm (Lillicrap et al., 2015). Again, the algorithm assumes that the ground-truth reward function is available.
4 Learning to predict over long horizons
Although a model can easily learn a representation mapping observations and actions at step to a predicted next observation at step (one-step predictions), these representations typically do not capture the underlying dynamics of the system. The equations of motion encoding the laws of physics typically constrain states in configuration space that are close to each other in time to be numerically close in value, and the observations that are functions of these states (such as a sequence of video frames) will by similar for two consecutive frames. Consequently, the model’s network will tend to learn a trivial mapping that does not capture the underlying dynamics. When such a network is run forward open loop to make multi-step predictions (as required when using such a model for planning), the prediction accuracy falls of rapidly as a function of the number of steps.
4.1 Using Differences in Observations as Training Targets
Some environments (such as many of the open AI Mujoco environments) use dynamical systems where the system dynamics are dominated by rigid body rotations. For example, the half-cheetah environment has an observation space , but the translational motion is constrained to (position and velocity along a line). In these dynamical systems, having an action conditional model’s network use the difference between consecutive observations (deltas) as a training target can lead to a representation with improved multi-step prediction performance (see for example (Nagabandi et al., 2018)). Concretely, the training target is given as for all in the rollouts.
In our work applying RL to aerospace problems, we have found that using deltas as targets fails for environments where translational motion plays a more important roll in the system dynamics. We postulate that the reason using deltas as targets works well for rotational dynamics is that the mapping between torques and rotational velocities using Euler’s rotational equations of motion has sufficient complexity to insure that the model does not learn an identity mapping between components of the current observation and components of the deltas. On the other hand, the differential equations governing position, velocity, and commanded thrust are such that using deltas actually makes matters worse, as the delta velocity component of the training target is approximately proportional to the force (the action input) and the delta position component of the training target is proportional to the velocity input to the model.
4.2 k-step prediction loss functions
Another common approach to improving multi-step prediction is to use a K-step prediction loss function(Oh et al., 2015). Consider training data accumulated over episodes, with steps in a given episode. Then the K-step loss function is as shown below in Equation 1a, with the more commonly used 1-step loss given in 1b. Minimizing the K-step loss function requires the model to learn a representation that is good at multi-step predictions. However, this approach also increases the size of the data used for a model update by a factor of K, as each sample from the rollouts is used to produce a K-step open loop prediction. Moreover, (Oh et al., 2015) required curriculum learning (Bengio et al., 2009), with the value of K was increased once a given value of K converged.
4.3 Open Loop Forward Pass
Recently, we have found a new method to improve multi-step prediction performance in predictive models that use one or more recurrent layers. First, we will review a common approach to implementing the forward pass in a network with one or more recurrent layers that allows parallel computation over a sequence where we want to preserve the temporal dependencies. Prior to the recurrent layer(s), the network unrolls the output of the previous layer, reshaping the data from to , where is the batch size, the feature dimension, and the number of steps we unroll the network. We then implement a loop from step 1 to step , where at step one, we input the hidden state from the rollouts to the recurrent layer, but for all subsequent steps up to , we let the state evolve according to the current parameterization of the recurrent layer. After the recurrent layer (or layers), the recurrent layer output is then reshaped back to .
Now consider a modification to this approach where we unroll the entire network from the network’s inputs to the final recurrent layer, as opposed to just unrolling the recurrent layers. For this example, we will assume a predictive coding architecture. At the first step of the loop over the T steps the network is unrolled, we input the prediction error and action from the rollouts to the network’s first layer. However, on subsequent timesteps, we set to zero, making the remainder of the unrolling open loop, in that there is no feedback from the prediction error captured in the rollouts. Consequently, to minimize the 1-step cost function, the network must learn a representation that is useful for multi-step prediction. This approach is also applicable to action conditional models. Here we need to unroll the entire network so that we can feed back the predicted observation as the observation for steps 2 through T.
This approach has the advantage over a k-step prediction loss function in that it does not increase the training set size by a factor of K. Moreover, in contrast to (Oh et al., 2015), this method does not require curriculum learning.
5 Predictive Coding Model used in Experiments
Our model’s operation over a single episode is listed in Algorithm 1, where is a policy, is our model, with both operating in environment . The model consists of four major components. A representation network maps the previous prediction error and current action to a representation (Eq. 2a
), and consists of a fully connected layer followed by a recurrent layer implemented as a gated recurrent unit (GRU)(Chung et al., 2015) with hidden state .
A prediction network consists of two fully connected layers, and maps the representation to a predicted observation (Eq. 2b). In an application using high dimensional observations, the prediction network would contain several deconvolutional layers mapping from the representation to a predicted image .
The recurrent state initialization network learns a mapping from an observation at the start of an episode to an initialization for hidden state in the representation network (Eq. 2c). This allows the trained model to immediately produce accurate predictions.
Finally the value prediction network maps the representation to , an estimate of the sum of discounted rewards that would be received by starting in representation and following policy (Eq. 2d). Since in theory a value function could be implemented in an external network using representation as input, the primary purpose of the value network is to solve the ”vanishing bullet” problem when the model is used with visual observations. The vanishing bullet problem, coined for the Atari space invaders environment, occurs when a very important visual feature consists of relatively few pixels. In this case, a model using an L2 loss between predicted and actual observations might not include the ”bullets” fired by the space invaders in its predictions, leading to rather poor performance if the model were to be used for planning.
Note that the value prediction head can be replaced by a reward prediction head by simply changing the target from the sum of discounted rewards to rewards. In section 6.3, we use the reward prediction head in our experiment where we use the model to accelerate learning. In applications using images as observations, predicting rewards should have a similar effect on mitigating the ”vanishing bullet” problem, except possibly in sparse reward settings. The network layers are detailed in Figure 1, where is the observation dimension and is the sum of the observation and action dimensions.
Our model is trained end-to-end using two L2 loss functions, one for the difference between the predicted and actual observation and another for the difference between the predicted value and the empirical sum of discounted rewards. Similar to the model used in (Lotter et al., 2016), this model can be treated as a layer and stacked, providing higher levels of abstraction for higher layers. This stacking can enhance performance when observations consist of images, but we found it unnecessary in this work.
|Layer||in dim||out dim||type||act|
6.1 Comparison of standard and enhanced PCM
Here we compare the performance of a standard predictive coding model and the model described in Section 5 using a simple 3-DOF Mars landing environment, with initial lander conditions in the target-centered reference frame as shown in Table 2. The agent’s goal is to achieve a soft pinpoint landing with a terminal glideslope of greater than 5; a complete description of the environment can be found in (Gaudet et al., 2018). For this comparison we use a policy implementing the DR/DV guidance law (D’Souza & D’Souza, 1997)
, which maps position and velocity to a commanded thrust vector. This is similar to using a pre-trained policy as good trajectories are generated from the onset.
The model is updated using 30 episode rollouts and trained for 30,000 episodes. During the forward pass, the networks are unrolled 60 steps. After training, each model is evaluated on a 3600 episode test where the model is run forward open loop for 1, 10, 30, and 60 steps. This is the same process used to collect rollouts for model and policy training, but we reshape the rollouts to allow running multi-step predictions in parallel on the full set of rollouts, and collect prediction accuracy statistics at each step. Note that in most of these cases, the model state will have evolved since the start of an episode, which is why the average accuracy for the model without hidden state initialization is not that bad. The prediction accuracy is measured as the absolute value of the prediction error (position and velocity) and the value estimate accuracy is measured using explained variance. Note that model predictions are scaled such that each element of the prediction has zero mean and is divided by three times the standard deviation, and the error is calculated over the samples and features (which have equal scales). So a prediction error of 0.01 is 1% of the range of values encountered over all elements of the predicted position and velocity vectors; if the maximum altitude was say 2400m, then a mean absolute value of prediction error of 1% would correspond to 24m. To insure a fair comparison, the code is identical in each case, except that the appropriate network is instantiated in the model.
|Velocity (m/s)||Position (m)|
Tables 3 and 4 shows prediction error performance for three model architectures. An explained variance of ’-’ indicates less than zero. PCM is a standard predictive coding model, i.e., without the mapping between initial observations and the hidden state and without zeroing the error feedback in the forward pass. The maximum prediction error for this model is high (close to the maximum range of each state variable), and occurs when the multi-step error checking begins at the start of an episode, so the model never gets any error feedback. When the error checking begins mid-episode, performance improves, and consequently the mean error does not look too bad. PCM-I adds the network layers that learn a mapping between an episodes initial observation and the recurrent layer’s hidden state. Here we see that the mean prediction error improves, and although not shown, the maximum prediction error is much reduced. PCM-I-OLT is our PCM model as described in Section 5, where the error feedback is set to zero during the forward pass in training. We see that running the forward pass open loop has a significant impact on multi-step prediction performance. We also include two standard action conditional models as comparison baselines. ACM is a standard action conditional model and ACM-OLT is modified to feed back the predicted observation as the observation during the training forward pass.
Note that even without the open loop forward pass, the predictive coding model excels at multi-step prediction. We hypothesize that the performance is due to the model taking a prediction error rather than an observation as an input. This requires the model to integrate multiple steps of previous errors in order to make good predictions, thereby forcing the model to learn a representation capturing temporal dependencies. The model learns reasonably quickly, with the model’s performance as a function of training steps given in Table 5. These statistics are captured during training, and importantly, are measured on the current rollouts before the model trains on these rollouts, and are consequently a good measure of generalization.
6.2 6-DOF Mars Landing
Here we test the model’s ability to predict a lander’s 6-DOF state during a simulated powered descent phase. To make this more interesting, we let the model and agent learn concurrently. The agent uses PPO to learn a policy to generate a thrust command for each of the lander’s four thrusters, which are pointed downwards in the body frame. Rather than use a separate value function baseline, the policy uses the model’s value estimate. This is a difficult task, as to achieve a given inertial frame thrust vector, the policy needs to figure out how to properly rotate the lander so that the thrusters are properly oriented, but this also affects the lander’s translational motion. The goal is for the lander to achieve a soft pinpoint landing with the velocity vector directed predominately downward, an upright attitude, and negligible rotational velocity. The initial conditions are similar to that given in Table 2, except that the lander’s attitude and rotational velocity are perturbed from nominal. A full description of the environment can be found in (Gaudet et al., 2018). In the first experiment, the policy and model get access to the ground truth state (position, velocity, quaternion attitude representation, and rotational velocity).
In a variation on this experiment, the model only gets access to simulated Doppler radar altimeter returns, using a digital terrain map (DTM) of the Mars surface in the vicinity of Uzboi Valis. We assume a configuration similar to the Mars Science Laboratory lander, where there are four altimeters with fixed orientation in the body frame, each pointing downwards and outwards at a 22 degree angle from the body frame vertical axis. Since the DTM spans 40 square km, we had to cut some corners simulating the altimeter readings in order to reduce computation time enough to be practical for the large number of episodes (300,000) required to solve this problem using RL. This results in low accuracy, particularly at lower elevations, further complicating this task. Also, due to the relatively high pitch and roll limits we allow the lander, the altimeter beams occasionally completely miss the DTM, and return a max range reading.
Training updates use 120 episode rollouts. Learning a good policy in 6-DOF typically takes around 300,000 episodes, which is over 80M steps of interaction with the environment. Similarly, we find that model takes longer to learn a good representation. There are two factors that can contribute to the increased model learning time. First, the dynamics are more complex, and the model’s network is larger. Second, since it takes a long time for the policy to converge, exploration causes the model to be presented with a wide range of actions for each observation during each training epoch. Table6 shows the model’s convergence during training as a function of training steps. We look at this for both a pre-trained policy and concurrent learning to establish the primary factor behind the slower model convergence. Although model convergence is slower in the case of concurrent learning, the difference is not extreme; consequently we attribute the slower convergence in the 6-DOF case primarily to the more complicated dynamics. It may be possible to speed model convergence by either increasing the number of model training epochs, training on a larger set of rollouts using a replay buffer, or both.
Table 7 tabulates the model’s performance for both the case where the model has access to the ground truth lander state and the case where the model only has access to simulated Doppler radar returns. For the latter case, the observation consists of four simulated altitudes and closing velocities, one for each altimeter. These observations do not satisfy the Markov property in that there are many lander states in configuration space that can give identical readings. However, the recurrent network layer allows the model to make reasonably accurate predictions.
Figure 1 illustrates an entire episode of PCM open loop predictions. Here the policy gets access to the ground truth observation on all steps, but the model only gets access to the ground truth observation on the 1st step of the episode, and for remaining steps, the PCM’s prediction error input is set to zero, removing any feedback from actual observations. The ”Value” plot shows the estimated value (sum of discounted rewards) of the observation at the current step as predicted by the model’s value head. The prediction error is small enough that is is barely discernible from the plot. We use a quaternion representation for attitude. Note that the model’s network was unrolled only 60 steps during training, but this sufficed to allow good prediction out to the end of a 300 step episode.
6.3 Accelerated RL using a Predictive Model
In this experiment we demonstrate an improvement in sample effiency obtained using PPO in conjunction with our model. The model , policy , and value function learn concurrently in the 3-DOF Mars environment. After each update of , , from rollouts , observations are sampled from . These are fed through and in parallel for steps to create a set of simulated rollouts as shown in Algorithm 2. Note that these segments do not result in a full episode, and consequently, when we update the policy using the simulated rollouts, we need to take care discounting the simulated rewards to generate the empirical return for updating the policy. Concretely, we need to use the correct -step temporal difference return, and use this to compute the advantages ; this is implemented in lines 12 through 17 of the algorithm.
Figure 2 plots the norm of the lander terminal position and velocity as a function of training episode for a policy optimized using PPO and PPO with Dyna over the first 5000 episodes. The simulated rollouts were generated using 20,000 observations sampled from the previous rollouts with k=10, resulting in simulated rollouts containing 200,000 tuples of observations, actions, and advantages. We find that PPO enhanced with Dyna converges roughly twice as fast.
We have presented a novel predictive coding model architecture capable of generating accurate trajectory predictions over long horizons. Through a series of experiments, we have shown that our enhancements to predictive coding outperforms a standard implementation for long-horizon predictions, and performs well predicting both the agent’s ground truth state and simulated altimeter readings open loop over an entire episode during a 6-DOF Mars powered descent phase. The ability to generate long horizon open loop trajectory predictions is extremely useful for both model based reinforcement learning and model predictive control. We demonstrated the ability of the model to reduce the sample complexity of proximal policy optimization for a Mars 3-DOF powered descent phase task. Future work will extend the model to predicting observations in higher dimensional spaces consistent with flash LIDAR and electro-optical sensors.
Bengio et al. (2009)
Bengio, Y., Louradour, J., Collobert, R., and Weston, J.
Proceedings of the 26th annual international conference on machine learning, pp. 41–48. ACM, 2009.
Chung et al. (2015)
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y.
Gated feedback recurrent neural networks.In International Conference on Machine Learning, pp. 2067–2075, 2015.
- D’Souza & D’Souza (1997) D’Souza, C. and D’Souza, C. An optimal guidance law for planetary landing. In Guidance, Navigation, and Control Conference, pp. 3709, 1997.
- Feinberg et al. (2018) Feinberg, V., Wan, A., Stoica, I., Jordan, M. I., Gonzalez, J. E., and Levine, S. Model-based value estimation for efficient model-free reinforcement learning. arXiv preprint arXiv:1803.00101, 2018.
- Finn & Levine (2017) Finn, C. and Levine, S. Deep visual foresight for planning robot motion. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2786–2793. IEEE, 2017.
- Gaudet et al. (2018) Gaudet, B., Linares, R., and Furfaro, R. Deep reinforcement learning for six degree-of-freedom planetary powered descent and landing. arXiv preprint arXiv:1810.08719, 2018.
- Gu et al. (2016) Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838, 2016.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Lotter et al. (2016) Lotter, W., Kreiman, G., and Cox, D. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016.
- Nagabandi et al. (2018) Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7559–7566. IEEE, 2018.
- Oh et al. (2015) Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871, 2015.
- Rao & Ballard (1999) Rao, R. P. and Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79, 1999.
- Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Sutton (1990) Sutton, R. Integrated architectures for learning, planning, and reacting based on approximating integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the International Machine Learning Conference, pp. 212–218, 1990.
- Wang et al. (2017) Wang, Y., Long, M., Wang, J., Gao, Z., and Philip, S. Y. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing Systems, pp. 879–888, 2017.