Contains implementations of various deep RL algorithms and papers including action conditional video prediction | Python | Tensorflow | Open AI gym
Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future (image-)frames are dependent on control variables or actions as well as previous frames. While not composed of natural scenes, frames in Atari games are high-dimensional in size, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability. We propose and evaluate two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Experimental results show that the proposed architectures are able to generate visually-realistic frames that are also useful for control over approximately 100-step action-conditional futures in some games. To the best of our knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs.READ FULL TEXT VIEW PDF
Spatio-temporal contexts are crucial in understanding human actions in
Models that can simulate how environments change in response to actions ...
The study of object representations in computer vision has primarily foc...
In reinforcement learning (RL) tasks, an efficient exploration mechanism...
In many vision-based reinforcement learning (RL) problems, the agent con...
Deep learning became the method of choice in recent year for solving a w...
Given a visual history, multiple future outcomes for a video scene are
Contains implementations of various deep RL algorithms and papers including action conditional video prediction | Python | Tensorflow | Open AI gym
Over the years, deep learning approaches (see[5, 26] for survey) have shown great success in many visual perception problems (e.g., [16, 7, 32, 9]). However, modeling videos (building a generative model) is still a very challenging problem because it often involves high-dimensional natural-scene data with complex temporal dynamics. Thus, recent studies have mostly focused on modeling simple video data, such as bouncing balls or small patches, where the next frame is highly-predictable given the previous frames [29, 20, 19]. In many applications, however, future frames depend not only on previous frames but also on control or action variables. For example, the first-person-view in a vehicle is affected by wheel-steering and acceleration. The camera observation of a robot is similarly dependent on its movement and changes of its camera angle. More generally, in vision-based reinforcement learning (RL) problems, learning to predict future images conditioned on actions amounts to learning a model of the dynamics of the agent-environment interaction, an essential component of model-based approaches to RL. In this paper, we focus on Atari games from the Arcade Learning Environment (ALE)  as a source of challenging action-conditional video modeling problems. While not composed of natural scenes, frames in Atari games are high-dimensional, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability. To the best of our knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional images conditioned by control inputs.
This paper proposes, evaluates, and contrasts two spatio-temporal prediction architectures based on deep networks that incorporate action variables (See Figure 1). Our experimental results show that our architectures are able to generate realistic frames over 100-step action-conditional future frames without diverging in some Atari games. We show that the representations learned by our architectures 1) approximately capture natural similarity among actions, and 2) discover which objects are directly controlled by the agent’s actions and which are only indirectly influenced or not controlled. We evaluated the usefulness of our architectures for control in two ways: 1) by replacing emulator frames with predicted frames in a previously-learned model-free controller (DQN; DeepMind’s state of the art Deep-Q-Network for Atari Games ), and 2) by using the predicted frames to drive a more informed than random exploration strategy to improve a model-free controller (also DQN).
The problem of video prediction has led to a variety of architectures in deep learning. A recurrent temporal restricted Boltzmann machine (RTRBM) was proposed to learn temporal correlations from sequential data by introducing recurrent connections in RBM. A structured RTRBM (sRTRBM)  scaled up RTRBM by learning dependency structures between observations and hidden variables from data. More recently, Michalski et al. 
proposed a higher-order gated autoencoder that defines multiplicative interactions between consecutive frames and mapping units, and showed that temporal prediction problem can be viewed as learning and inferring higher-order interactions between consecutive images. Srivastava et al. applied a sequence-to-sequence learning framework 
to a video domain, and showed that long short-term memory (LSTM) networks are capable of generating video of bouncing handwritten digits. In contrast to these previous studies, this paper tackles problems where control variables affect temporal dynamics, and in addition scales up spatio-temporal prediction to larger-size images.
Atari 2600 games provide challenging environments for RL because of high-dimensional visual observations, partial observability, and delayed rewards. Approaches that combine deep learning and RL have made significant advances [21, 22, 11]. Specifically, DQN  combined Q-learning  with a convolutional neural network (CNN) and achieved state-of-the-art performance on many Atari games. Guo et al.  used the ALE-emulator for making action-conditional predictions with slow UCT , a Monte-Carlo tree search method, to generate training data for a fast-acting CNN, which outperformed DQN on several domains. Throughout this paper we will use DQN to refer to the architecture used in  (a more recent work  used a deeper CNN with more data to produce the currently best-performing Atari game players).
The idea of building a predictive model for vision-based RL problems was introduced by Schmidhuber and Huber . They proposed a neural network that predicts the attention region given the previous frame and an attention-guiding action. More recently, Lenz et al.  proposed a recurrent neural network with multiplicative interactions that predicts the physical coordinate of a robot. Compared to this previous work, our work is evaluated on much higher-dimensional data with complex dependencies among observations. There have been a few attempts to learn from ALE data a transition-model that makes predictions of future frames. One line of work [3, 4] divides game images into patches and applies a Bayesian framework to predict patch-based observations. However, this approach assumes that neighboring patches are enough to predict the center patch, which is not true in Atari games because of many complex interactions. The evaluation in this prior work is 1-step prediction loss; in contrast, here we make and evaluate long-term predictions both for quality of pixels generated and for usefulness to control.
The goal of our architectures is to learn a function , where and are the frame and action variables at time , and are the frames from time to time . Figure 1 shows our two architectures that are each composed of encoding layers that extract spatio-temporal features from the input frames (§3.1), action-conditional transformation layers that transform the encoded features into a prediction of the next frame in high-level feature space by introducing action variables as additional input (§3.2) and finally decoding layers that map the predicted high-level features into pixels (§3.3). Our contributions are in the novel action-conditional deep convolutional architectures for high-dimensional, long-term prediction as well as in the novel use of the architectures in vision-based RL domains.
Feedforward encoding takes a fixed history of previous frames as an input, which is concatenated through channels (Figure 0(a)
), and stacked convolution layers extract spatio-temporal features directly from the concatenated frames. The encoded feature vectorat time is:
where denotes frames of pixel images with color channels. CNN is a mapping from raw pixels to a high-level feature vector using multiple convolution layers and a fully-connected layer at the end, each of which is followed by a non-linearity. This encoding can be viewed as early-fusion  (other types of fusions, e.g., late-fusion or 3D convolution  can also be applied to this architecture).
Recurrent encoding takes one frame as an input for each time-step and extracts spatio-temporal features using an RNN in which the temporal dynamics is modeled by the recurrent layer on top of the high-level feature vector extracted by convolution layers (Figure 0(b)). In this paper, LSTM without peephole connection is used for the recurrent layer as follows:
where is a memory cell that retains information from a deep history of inputs. Intuitively, is given as input to the LSTM so that the LSTM captures temporal correlations from high-level spatial features.
We use multiplicative interactions between the encoded feature vector and the control variables:
where is an encoded feature, is an action-transformed feature, is the action-vector at time ,
is 3-way tensor weight, andis bias. When the action a is represented using one-hot vector, using a 3-way tensor is equivalent to using different weight matrices for each action. This enables the architecture to model different transformations for different actions. The advantages of multiplicative interactions have been explored in image and text processing [33, 30, 18]. In practice the 3-way tensor is not scalable because of its large number of parameters. Thus, we approximate the tensor by factorizing into three matrices as follows :
where , and is the number of factors. Unlike the 3-way tensor, the above factorization shares the weights between different actions by mapping them to the size- factors. This sharing may be desirable relative to the 3-way tensor when there are common temporal dynamics in the data across different actions (discussed further in §4.3).
It has been recently shown that a CNN is capable of generating an image effectively using upsampling followed by convolution with stride of 1. Similarly, we use the “inverse” operation of convolution, called deconvolution, which maps spatial region of the input to using deconvolution kernels. The effect of upsampling can be achieved without explicitly upsampling the feature map by using stride of . We found that this operation is more efficient than upsampling followed by convolution because of the smaller number of convolutions with larger stride.
In the proposed architecture, the transformed feature vector is decoded into pixels as follows:
where Reshape is a fully-connected layer where hidden units form a 3D feature map, and Deconv consists of multiple deconvolution layers, each of which is followed by a non-linearity except for the last deconvolution layer.
It is almost inevitable for a predictive model to make noisy predictions of high-dimensional images. When the model is trained on a 1-step prediction objective, small prediction errors can compound through time. To alleviate this effect, we use a multi-step prediction objective. More specifically, given the training data , the model is trained to minimize the average squared error over -step predictions as follows:
where is a -step future prediction. Intuitively, the network is repeatedly unrolled through time steps by using its prediction as an input for the next time-step.
The model is trained in multiple phases based on increasing as suggested by Michalski et al. . In other words, the model is trained to predict short-term future frames and fine-tuned to predict longer-term future frames after the previous phase converges. We found that this curriculum learning 
In the experiments that follow, we have the following goals for our two architectures. 1) To evaluate the predicted frames in two ways: qualitatively evaluating the generated video, and quantitatively evaluating the pixel-based squared error, 2) To evaluate the usefulness of predicted frames for control in two ways: by replacing the emulator’s frames with predicted frames for use by DQN, and by using the predictions to improve exploration in DQN, and 3) To analyze the representations learned by our architectures. We begin by describing the details of the data, and model architecture, and baselines.
We used our replication of DQN to generate game-play video datasets using an -greedy policy with , i.e. DQN is forced to choose a random action with 30probability. For each game, the dataset consists of about training frames and test frames with actions chosen by DQN. Following DQN, actions are chosen once every frames which reduces the video from 60fps to 15fps. The number of actions available in games varies from to , and they are represented as one-hot vectors. We used full-resolution RGB images () and preprocessed the images by subtracting mean pixel values and dividing each pixel value by .
Across all game domains, we use the same network architecture as follows. The encoding layers consist of convolution layers and one fully-connected layer with hidden units. The convolution layers use , , , and filters with stride of 2. Every layer is followed by a rectified linear function . In the recurrent encoding network, an LSTM layer with hidden units is added on top of the fully-connected layer. The number of factors in the transformation layer is . The decoding layers consists of one fully-connected layer with hidden units followed by deconvolution layers. The deconvolution layers use , , , and filters with stride of 2. For the feedforward encoding network, the last frames are given as an input for each time-step. The recurrent encoding network takes one frame for each time-step, but it is unrolled through the last
frames to initialize the LSTM hidden units before making a prediction. Our implementation is based on Caffe toolbox.
We use the curriculum learning scheme above with three phases of increasing prediction step objectives of , and steps, and learning rates of , , and
, respectively. RMSProp[34, 10] is used with momentum of , (squared) gradient momentum of , and min squared gradient of . The batch size for each training phase is , , and for the feedforward encoding network and , , and for the recurrent encoding network, respectively. When the recurrent encoding network is trained on 1-step prediction objective, the network is unrolled through steps and predicts the last frames by taking ground-truth images as input. Gradients are clipped at before non-linearity of each gate of LSTM as suggested by .
The first baseline is a multi-layer perceptron (MLP) that takes the last frame as input and has 4 hidden layers with 400, 2048, 2048, and 400 units. The action input is concatenated to the second hidden layer. This baseline uses approximately the same number of parameters as the recurrent encoding model. The second baseline, no-action feedforward (or naFf), is the same as the feedforward encoding model (Figure 0(a)) except that the transformation layer consists of one fully-connected layer that does not get the action as input.
The prediction videos of our models and baselines are available in the supplementary material and at the following website: https://sites.google.com/a/umich.edu/junhyuk-oh/action-conditional-video-prediction. As seen in the videos, the proposed models make qualitatively reasonable predictions over – steps depending on the game. In all games, the MLP baseline quickly diverges, and the naFf baseline fails to predict the controlled object. An example of long-term predictions is illustrated in Figure 2. We observed that both of our models predict complex local translations well such as the movement of vehicles and the controlled object. They can predict interactions between objects such as collision of two objects. Since our architectures effectively extract hierarchical features using CNN, they are able to make a prediction that requires a global context. For example, in Figure 2, the model predicts the sudden change of the location of the controlled object (from the top to the bottom) at 257-step.
However, both of our models have difficulty in accurately predicting small objects, such as bullets in Space Invaders. The reason is that the squared error signal is small when the model fails to predict small objects during training. Another difficulty is in handling stochasticity. In Seaquest, e.g., new objects appear from the left side or right side randomly, and so are hard to predict. Although our models do generate new objects with reasonable shapes and movements (e.g., after appearing they move as in the true frames), the generated frames do not necessarily match the ground-truth.
Mean squared error over 100-step predictions is reported in Figure 3. Our predictive models outperform the two baselines for all domains. However, the gap between our predictive models and naFf baseline is not large except for Seaquest. This is due to the fact that the object controlled by the action occupies only a small part of the image.
We hypothesize that feedforward encoding can model more precise spatial transformations because its convolutional filters can learn temporal correlations directly from pixels in the concatenated frames. In contrast, convolutional filters in recurrent encoding can learn only spatial features from the one-frame input, and the temporal context has to be captured by the recurrent layer on top of the high-level CNN features without localized information. On the other hand, recurrent encoding is potentially better for modeling arbitrarily long-term dependencies, whereas feedforward encoding is not suitable for long-term dependencies because it requires more memory and parameters as more frames are concatenated into the input.
As evidence, in Figure 3(a) we show a case where feedforward encoding is better at predicting the precise movement of the controlled object, while recurrent encoding makes a 1-2 pixel translation error. This small error leads to entirely different predicted frames after a few steps. Since the feedforward and recurrent architectures are identical except for the encoding part, we conjecture that this result is due to the failure of precise spatio-temporal encoding in recurrent encoding. On the other hand, recurrent encoding is better at predicting when the enemies move in Space Invaders (Figure 3(b)). This is due to the fact that the enemies move after 9 steps, which is hard for feedforward encoding to predict because it takes only the last four frames as input. We observed similar results showing that feedforward encoding cannot handle long-term dependencies in other games.
To evaluate how useful the predictions are for playing the games, we implement an evaluation method that uses the predictive model to replace the game emulator. More specifically, a DQN controller that takes the last four frames is first pre-trained using real frames and then used to play the games based on -greedy policy where the input frames are generated by our predictive model instead of the game emulator. To evaluate how the depth of predictions influence the quality of control, we re-initialize the predictions using the true last frames after every n-steps of prediction for . Note that the DQN controller never takes a true frame, just the outputs of our predictive models.
The results are shown in Figure 5. Unsurprisingly, replacing real frames with predicted frames reduces the score. However, in all the games using the model to repeatedly predict only a few time steps yields a score very close to that of using real frames. Our two architectures produce much better scores than the two baselines for deep predictions than would be suggested based on the much smaller differences in squared error. The likely cause of this is that our models are better able to predict the movement of the controlled object relative to the baselines even though such an ability may not always lead to better squared error. In three out of the five games the score remains much better than the score of random play even when using 100 steps of prediction.
|Model||Seaquest||S. Invaders||Freeway||QBert||Ms Pacman|
|DQN - Random exploration||13119 (538)||698 (20)||30.9 (0.2)||3876 (106)||2281 (53)|
|DQN - Informed exploration||13265 (577)||681 (23)||32.2 (0.2)||8238 (498)||2522 (57)|
Average game score of DQN over 100 plays with standard error. The first row and the second row show the performance of our DQN replication with different exploration strategies.
To learn control in an RL domain, exploration of actions and states is necessary because without it the agent can get stuck in a bad sub-optimal policy. In DQN, the CNN-based agent was trained using an -greedy policy in which the agent chooses either a greedy action or a random action by flipping a coin with probability of . Such random exploration is a basic strategy that produces sufficient exploration, but can be slower than more informed exploration strategies. Thus, we propose an informed exploration strategy that follows the -greedy policy, but chooses exploratory actions that lead to a frame that has been visited least often (in the last time steps), rather than random actions. Implementing this strategy requires a predictive model because the next frame for each possible action has to be considered.
The method works as follows. The most recent frames are stored in a trajectory memory, denoted . The predictive model is used to get the next frame for every action
. We estimate the visit-frequency for every predicted frame by summing the similarity between the predicted frame and the mostrecent frames stored in the trajectory memory using a Gaussian kernel as follows:
where is a threshold, and is a kernel bandwidth. The trajectory memory size is 200 for QBert and 20 for the other games, for Freeway and 50 for the others, and for all games. For computational efficiency, we trained a new feedforward encoding network on gray-scaled images as they are used as input for DQN. The details of the network architecture are provided in the supplementary material. Table 1 summarizes the results. The informed exploration improves DQN’s performance using our predictive model in three of five games, with the most significant improvement in QBert. Figure 7 shows how the informed exploration strategy improves the initial experience of DQN.
In the factored multiplicative interactions, every action is linearly transformed tofactors ( in Equation 4). In Figure 7 we present the cosine similarity between every pair of action-factors after training in Seaquest. ‘N’ and ‘F’ corresponds to ‘no-operation’ and ‘fire’. Arrows correspond to movements with (black) or without (white) ‘fire’. There are positive correlations between actions that have the same movement directions (e.g., ‘up’ and ‘up+fire’), and negative correlations between actions that have opposing directions. These results are reasonable and discovered automatically in learning good predictions.
Distinguishing Controlled and Uncontrolled Objects is itself a hard and interesting problem. Bellemare et al.  proposed a framework to learn contingent regions of an image affected by agent action, suggesting that contingency awareness is useful for model-free agents. We show that our architectures implicitly learn contingent regions as they learn to predict the entire image.
In our architectures, a factor () with higher variance measured over all possible actions, , is more likely to transform an image differently depending on actions, and so we assume such factors are responsible for transforming the parts of the image related to actions. We therefore collected the high variance (referred to as “highvar”) factors from the model trained on Seaquest (around 40 of factors), and collected the remaining factors into a low variance (“lowvar”) subset. Given an image and an action, we did two controlled forward propagations: giving only highvar factors (by setting the other factors to zeros) and vice versa. The results are visualized as ‘Action’ and ‘Non-Action’ in Figure 8. Interestingly, given only highvar-factors (Action), the model predicts sharply the movement of the object controlled by actions, while the other parts are mean pixel values. In contrast, given only lowvar-factors (Non-Action), the model predicts the movement of the other objects and the background (e.g., oxygen), and the controlled object stays at its previous location. This result implies that our model learns to distinguish between controlled objects and uncontrolled objects and transform them using disentangled representations (see [25, 24, 37] for related work on disentangling factors of variation).
This paper introduced two different novel deep architectures that predict future frames that are dependent on actions and showed qualitatively and quantitatively that they are able to predict visually-realistic and useful-for-control frames over 100-step futures on several Atari game domains. To our knowledge, this is the first paper to show good deep predictions in Atari games. Since our architectures were domain independent we expect that they will generalize to many vision-based RL problems. In future work we will learn models that predict future reward in addition to predicting future frames and evaluate the performance of our architectures in model-based RL.
This work was supported by NSF grant IIS-1526059, Bosch Research, and ONR grant N00014-13-1-0762. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the views of the sponsors.
Journal of Artificial Intelligence Research, 47:253–279, 2013.
Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
The network architectures of the proposed models and the baselines are illustrated in Figure 9.
The weight of LSTM is initialized from a uniform distribution of
The weight of LSTM is initialized from a uniform distribution of. The weight of the fully-connected layer from the encoded feature to the factored layer and from the action to the factored layer are initialized from a uniform distribution of and respectively.
The total number of iterations is , , and for each training phase (1-step, 3-step, and 5-step). The learning rate is multiplied by after every iterations.
’ indicates element-wise multiplication. The text in each (de-)convolution layer describes the number of filters, the size of the kernel, padding (height and width), and stride.
The entire DQN algorithm with informed exploration is described in Algorithm 1.
A feedforward encoding network (illustrated in Figure 10) trained on down-sampled and gray-scaled images is used for computational efficiency. We trained the model on 1-step prediction objective with learning rate of and batch size of . The pixel values are subtracted by mean pixel values and divided by 128. RMSProp is used with momentum of , (squared) gradient momentum of , and min squared gradient of .
Figure 11 visualizes the difference between random exploration and informed exploration in two games. In Freeway, where the agent gets rewards by reaching the top lane, the agent moves only around the bottom area in the random exploration, which results in steps to get the first reward. On the other hand, the agent moves around all locations in the informed exploration and receives the first reward in steps. The similar result is found in Ms Pacman.
|Model||Seaquest||S. Invaders||Freeway||QBert||Ms Pacman|
|DQN (Nature) ||5286||1976||30.3||10596||2311|
|DQN (NIPS) ||1705||581||-||1952||-|
|Our replication of ||13119 (538)||698 (20)||30.9 (0.2)||3876 (106)||2281 (53)|
|I.E (Prediction)||13265 (577)||681 (23)||32.2 (0.2)||8238 (498)||2522 (57)|
|I.E (Emulator)||13002 (498)||708 (17)||32.2 (0.2)||7969 (496)||2702 (92)|