Practical Issues of Action-conditioned Next Image Prediction

02/08/2018 ∙ by Donglai Zhu, et al. ∙ HUAWEI Technologies Co., Ltd. 0

The problem of action-conditioned image prediction is to predict the expected next frame given the current camera frame the robot observes and an action selected by the robot. We provide the first comparison of two recent popular models, especially for image prediction on cars. Our major finding is that action tiling encoding is the most important factor leading to the remarkable performance of the CDNA model. We present a light-weight model by action tiling encoding which has a single-decoder feedforward architecture same as [action_video_prediction_honglak]. On a real driving dataset, the CDNA model achieves 0.3986× 10^-3 MSE and 0.9846 Structure SIMilarity (SSIM) with a network size of about 12.6 million parameters. With a small network of fewer than 1 million parameters, our new model achieves a comparable performance to CDNA at 0.3613× 10^-3 MSE and 0.9633 SSIM. Our model requires less memory, is more computationally efficient and is advantageous to be used inside self-driving vehicles.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In autonomous driving and robotics, system engineers often choose a state representation according to the application. For different control problems, the representation is usually different. It is also common that for the same control problem, different algorithms use different state representations. In particular, we are interested in models that reflect how a vehicle’s projection of the world changes according to its action.

In the classical approach for autonomous driving, engineers extract perception and localization results from sensors, summarizing the geometric relationship of the vehicle with its environment. Based on the geometric representation

of the world, a controller is built. This approach is so far the most popular and widely adopted in industry. However, it also has weakness. The driving software following this approach has a heavy pipeline stack, from perception and localization to planning modules. The message passing between perception, localization and controller modules leads to a complicated software architecture. On the control side it is common to use heuristics and hand-tuned controllers that only know how to respond to a low-dimensional state representation. This specific design for autonomous driving architecture results in significant software development efforts, requiring intensive testing efforts for corner cases and tuning efforts for hyper parameters. By following this classical approach, building safety has been largely reduced to software practice.

The imitation learning approach, starting from multi-layer perceptions (MLPs)

(Pomerleau, 1989)

to the more recent convolutional neural networks

(Bojarski et al., 2016), regresses the steering angle given the camera view. This approach leads to a simpler architecture for autonomous driving. However, comparing to the classical approach, it lacks understanding of the decision process of driving. The effectiveness of this approach is yet to be seen in the next few years.

The affordance approach(Chen et al., 2015) predicts relevant geometric features (called “affordances”) from images. Based on the predicted features, a controller can be developed. This approach bears some similarity to Pavlovian control in which animals map predictions of events into their behaviors(Modayil and Sutton, 2014).

A key difference of the three approaches is the state representation on which controller is built. The classical approach represents the world as geometric features either manually extracted or learned using machine learning. The end-to-end imitation learning approach uses raw images as the world representation. The affordance approach can be viewed as an intermediate approach between the two: it uses predicted geometric information from images as the state representation.

These are exciting advances on the control side of autonomous vehicles. An autonomous vehicle needs to drive safely but it also needs to be understood by human beings. It has to keep the boarding passengers well informed during the planning and decision making so that they can trust the system and alleviate unnecessary nervousness. In addition, system designers need to understand the decision making process of the software so it can be improved. However, by far visualization of the internal planning and decision making process of autonomous driving softwares has remained less noticed. Given the many problems that self-driving cars are facing and their rich choices of state representation, it is important to model the effects of car actions in a representation-agnostic way. Models that estimate environment state transitioning in response to a robot’s actions turn out to be fundamental for visualizing the planning process of self-driving cars.

Besides visualization, building action models is essential to bring reinforcement learning to autonomous driving. Reinforcement learning achieved remarkable success in Atari games

(Mnih et al., 2015) and Go (Silver et al., 2016). Using reinforcement learning to develop driving software has a potential of making cars safer and save human lives. Lacking an accurate action model is one of the major gaps from reinforcement learning to autonomous driving: A perfect simulator or action model such as in games is not available in driving. A perfect simulator means ground truth interaction samples can be collected efficiently, cheaply and infinitely. AlphaGo achieves master play through playing millions of games against itself, state-of-art Go programs and human professionals (Silver et al., 2016). Recently AlphGo Zero even beat its previous version by learning through playing itself without the use of human knowledge (Silver et al., 2017). For cars to be able to gain driving skills off-line like AlphaGo, we need models that accurately predict transitioning between states conditioned on actions. Previous works in robotics and autonomous driving have focused on modeling vehicle dynamics (Omar et al., 1998; Yim and Oh, 2004; Ng et al., 2003; Bakker et al., 1987; Kong et al., 2015; Rajamani, 2011; Levinson et al., 2011; Urmson et al., 2007).

Learning action models for driving is challenging. For a contrast, state transition in the game of Go is deterministic and noise-free. In Go, given a board and a legal move the next board is fully determined. For autonomous driving, the next state in response to the car’s action is highly stochastic and noisy. In addition, the action of driving (e.g., steering angle and throttle) is continuous. Most model learning algorithms in reinforcement learning only handle discrete actions (e.g. Sutton et al. (2008); Yao and Szepesvári (2012); Grünewälder et al. (2012); Oh et al. (2015)). Recently, CommaAI proposed a simulator model that is learned from real time driving video (Santana and Hotz, 2016). (Lotter et al., 2016) used predictive coding inspired by models of the visual cortex to predict the next image in a video sequence without using robot’s actions as input.

In this paper we focus on action models for image prediction, which is of a general form of action-conditioned state prediction for a specific world representation. We provide the first comparison of two recent popular models: the Dynamic Neural Advection (CDNA) model (Finn et al., 2016) and the feedforward model (Oh et al., 2015), especially for image prediction on cars. On Comma AI driving dataset, the CDNA model achieves Structure SIMilarity (SSIM) compared to the ground truth images while the feedforward model achieves SSIM. To explore the reason of the performance gap between the two models, we conducted a diagnosis experiment. Our experiment shows that action tiling encoding is the most important factor leading to the remarkable performance of CDNA regardless of its complicated network architecture. We present a light-weight model by action tiling encoding which has a single-decoder feedforward architecture same as (Oh et al., 2015). Our model achieves comparable results to CDNA and requires much less memory and computation resources. It is especially advantageous to be used on self-driving cars where deploying high-end computation chips such as GPU has to consider cooling, reliability, and cost issues.

2 Action Models

Markov Motion Prediction. If a problem is a Markovian Decision Process (MDP), we can characterize it by a tuple, , where is the state space, is the action space, and is a transition kernel. The reward in MDP is not within the scope of in this paper and thus left out. The problem being Markovian means that the next state after an action only depends on the current state. In particular, the next state follows the conditional distribution after taking at state . Assume a state representation mapping , where

is a tensor with a generic shape. We interact with the MDP problem and collect a dataset of samples

. The tuple means that is observed immediately after taking action when observing . For a general interest, we define a model that predicts the expected next state representation, in particular, a model that minimizes the following Mean-Square-Error (MSE):

This model considers only one-step information to predict for the next time step. For Markovian problems, this model suffices.

Figure 1: Our single-decoder feedforward model with action tiling (SDF-tiling-16) for action-conditioned next image prediction.

Non-Markov Motion. However, it turns out that for autonomous driving, predicting the next observation of the car is not a Markovian problem. Pedestrians and other vehicles can move in an unpredictable way. However, their behaviors are relatively predictable in a very short time. We thus need to extend the model to track the Non-Markov factors:


where is a representation of information up to time step . In practice, can be a history window of up to the current time step, or a representation recursively constructed from such as in LSTM.

The model in equation 1 has its various forms in different literature, e.g., physics models for motion and contact (Brubaker et al., 2009), video prediction model for robot arm operation (Finn et al., 2016), and patch movement in images for car trajectory prediction (Walker et al., 2014). In control, it is usually in the form of a set of equations to describe system dynamics (Coates et al., 2008). In self-driving cars especially motion planning, a specific form of this model (referred to as the Kinematics model (Paden et al., 2016)) is used to predict car’s motion in a short term. In reinforcement learning, Markovian models and discrete actions are often considered (Yao and Szepesvári, 2012; Grünewälder et al., 2012; Oh et al., 2015). Because all these models capture the effects of actions regardless of their application context, we call them simply the action models. Action models are specific forms of generative models (Goodfellow et al., 2014). Starting from a given state, one can generate a trajectory of states by applying a sequence of actions. Model-based reinforcement learning algorithms such as Monte Carlo Tree Search (MCTS) rely on action models to generate sample trajectories in order to learn good policies (Silver et al., 2017).

This paper is focused on building action models for predicting the next frame. In particular, we focus on being a car camera frame.

3 Action-Conditioned Image Prediction

The SDF Model. The model we consider is the feedforward model for gray-scaled images (Oh et al., 2015). The model has a single-decoder feedforward (SDF)

architecture. Images are encoded using convolution layers and reshaped into a feature vector. The action is encoded into a feature vector and combined with the image feature vector into a single vector. The resulting vector is reshaped and decoded into the next frame through de-convolution layers. The merit of this model is that it is simple in the architecture and easy to understand.

The CDNA Model. In a recent work Finn et al. (2016) proposed an image prediction model called Convolutional Dynamic Neural Advection (CDNA) that predicts the next frame using a sum of convoluted masks and kernel transformed images. The masks and the kernels are learned in two separate decoders split from a shared representation after a number of convolution layers.

Another difference between CDNA and SDF is that the history is represented in CDNA using LSTM while in SDF it uses a history window of observations. Using LSTM significantly increases the network size.

The SDF model was shown to predict accurately for video frames in Atari games. The CDNA model was shown to predict accurately for robot arm motion, human motion and agent motion in simulators. We performed the first comparison for the two models especially for image prediction in driving, giving new insights into the models.

We used Comma AI dataset 111 for evaluation. The training set contains driving logs of images and actions from January 2016 to May 2016. In total there are image-action pairs. Images have been resized into

. An action is composed of acceleration, steering angle and brake. Action controls are normalized with mean and standard deviation calculated from the dataset. The test set is driving logs in June 2016, totaling

image-action pairs. We used the Mean-Squared Error (MSE) and SSIM (Wang et al., 2004) for measuring the predicted next images by different models, averaged on the test set.

Model MSE() SSIM
CDNA (Finn et al., 2016) 3.986 0.9836
SDF (Oh et al., 2015) 23.670 0.8312
SDF-recurrent (Oh et al., 2015) 72.600 0.6498
Copy-last-frame 79.20 0.6671
CDNA-no-current-image 6.362 0.9778
CDNA-no-skip-connection 4.933 0.9798
SDF-tiling 7.050 0.9184
SDF-tiling-16 3.613 0.9633
Table 1: Image prediction results on Comma AI dataset.
Figure 2: Training errors of SDF and SDF-tiling.

In our experiment, the CDNA model performed significantly better than the SDF model as shown in Table 1. We trained the models on the same training dataset, and optimized their parameters rigorously. In the table, CDNA is the original model of (Finn et al., 2016). CDNA achieves a remarkable SSIM of . The predicted images are very sharp as shown in Figure 2(a). The SDF model is the same model as (Oh et al., 2015). The model achieved an SSIM score at and an averaged MSE of , which is much inferior to the CDNA model. We also run SDF-recurrent, the recurrent model of (Oh et al., 2015), which lead to a lower accuracy than the SDF model in our experiment.

(a) CDNA sample predictions.
(b) SDF sample predictions.
(c) SDF-tiling-16 sample predictions.
Figure 3:

Sample image predictions in the test set. In each of the five plots, the left column is the sequence of the ground truth images (running from top to bottom). The right column is the sequence of predicted next images. In a row of each plot, the two images are at the same time step. The first plot is a moment when the car passing the stop line. The second plot is a moment when a car crossing in the front in night. The third plot is driving in lane in night with a car in the front. The fourth shows a moment switching lanes on highway in daylight. The fifth shows following a van in the front closing the distance.

There may have been four factors that have contributed to the outstanding performance of the CDNA model in our experiment:

  • The CDNA model employs a special two decoding branches.

  • The model has two skip connections from earlier layers to the mask generation layer.

  • The current image is used as one of the input to the kernel layer.

  • Action is encoded using tiling instead of vector encoding in the SDF model.

Figure 4: Prediction at a sample time step of our model. The top plot shows the current image, the predicted next image and the ground truth. The bottom plot shows the basis images in the last layer for generating the prediction. The basis images are presented in the descending order of their weight (from the last layer of our model). Thus the weights of these basis images decrease in each row from left to right. The top-left basis image has the largest weight.

To study the individual effects of these factors, we conducted a diagnosis experiment.

Effects of Current Image and Skip Connections.

In particular, CDNA-no-current-image is the CDNA model leaving out the current image connection in generating one of the transformed images (using its corresponding kernel), achieving an SSIM of .

CDNA-no-skip-connection is the CDNA model without the skip connections, getting an SSIM of .

This shows that leaving out the current image connection to the kernel generation layer or removing the skip connections has little effects in the performance of CDNA model, degrading only and respectively.

Effect of Action Tiling. To find whether action tiling was effective, we designed a single-decoder feedforward model with action tiling encoding (called SDF-tiling), shown in Figure 1. Our SDF-tiling model achieved a much better performance than SDF, at an SSIM of and an averaged MSE of .

This shows that tiling encoding of action is much more effective than (dense) vector encoding. The reason is that convoluted shapes (e.g., in Figure 1 the convolution layer with shape before merging with actions) are less tweaked due to that action tiling provides a uniform weighting for the shapes. The vector encoding of actions introduces a non-uniform weighting to the convolution features, which introduces noises in shapes. In addition, the SDF model has four fully connected layers between convolution encoding layers and de-convolution decoding layers (see Figure 9(a) of (Oh et al., 2015)). This introduces extra noises into the 2D shapes in the output of the convolution encoding layers. Altogether the noises from action features and fully connected layers are propagated to de-convolution layers and cause blurry predicted images, as shown in Figure 2(b).

Window size. Both the SDF and the SDF-tiling use a window size parameter . Objects on the road move much faster than in games and we need a longer history to track them. We increased the window size to , and the performance of SDF-tiling on the test set improves to MSE (better than CDNA) and SSIM.

Network Size. The following table shows the model sizes.

Model # parameters
CDNA (Finn et al., 2016) 12,661,803
SDF (Oh et al., 2015) 37,237,825
SDF-recurrent (Oh et al., 2015) 70,800,449
SDF-tiling 958,400
SDF-tiling-16 986,048

Basis Image Learned. The last layer of SDF-tiling is a linear operation that combines the basis images into the final prediction with trainable weights :

where is the collection of the basis images (output of the last second layer). In Table 1, the SDF-tiling-16 used the maximum value for which is according to matrix low rank approximation.

In practice, using a smaller number of basis images may generalize well too. To test this, we run a SDF-tiling with equal to . We also reduced the depth of the other two de-convolution layers to . The induced model still gave a descent performance, with an MSE of (better than CDNA) and an SSIM of on the test set. Figure 4 shows the learned basis images for a sampled time step for this model. This model has only million parameters.

Parameters. The best learning rates for training the above models all happen to be . All the models were trained for epochs.

The training error of SDF and our SDF-tiling models are shown in Figure 2. The MSE curve of SDF is nearly flat after 40 epochs. Using a learning rate of leads to divergence for SDF.

Our SDF-tiling model used a kernel size of

, ADAM optimizer, and RELU activation.

4 Conclusion

In this paper, we studied the problem of action-conditioned next image prediction. The problem is particularly important for developing model-based reinforcement learning algorithms that can drive a car from raw image observations. We compared two recent popular models for the next image prediction, especially for predicting the next camera view in driving. We found that the CDNA model originally illustrated for robot arm operation (Finn et al., 2016) performed extremely well while the feedforward and the recurrent models developed in Atari games setting (Oh et al., 2015) failed to give a comparable performance. We run diagnosis experiments and found that action-tiling encoding is the most important factor that gives accurate next image predictions. Our proposed model combines the best worlds of CDNA and the feedforward model, achieving MSE (better than CDNA) and SSIM with a small network size of fewer than one million parameters.


  • (1)
  • Bakker et al. (1987) Egbert Bakker, Lars Nyborg, and Hans B Pacejka. Tyre modelling for use in vehicle dynamics studies. Technical report, SAE Technical Paper, 1987.
  • Bojarski et al. (2016) Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016. URL
  • Brubaker et al. (2009) Marcus Brubaker, Leonid Sigal, and David Fleet. Estimating contact dynamics. In

    Proceedings / IEEE International Conference on Computer Vision. IEEE International Conference on Computer Vision

    , pages 2389–2396, 09 2009.
  • Chen et al. (2015) Chenyi Chen, Ari Seff, Alain L. Kornhauser, and Jianxiong Xiao. DeepDriving: Learning affordance for direct perception in autonomous driving. In ICCV, 2015.
  • Coates et al. (2008) Adam Coates, Pieter Abbeel, and Andrew Y. Ng. Learning for control from multiple demonstrations. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 144–151, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390175. URL
  • Finn et al. (2016) Chelsea Finn, Ian J. Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. CoRR, abs/1605.07157, 2016. URL
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. June 2014. URL
  • Grünewälder et al. (2012) Steffen Grünewälder, Guy Lever, Luca Baldassarre, Massimilano Pontil, and Arthur Gretton. Modelling transition dynamics in mdps with rkhs embeddings. In ICML, pages 1603–1610, USA, 2012. Omnipress. ISBN 978-1-4503-1285-1.
  • Kong et al. (2015) Jason Kong, Mark Pfeiffer, Georg Schildbach, and Francesco Borrelli. Kinematic and dynamic vehicle models for autonomous driving control design. In Intelligent Vehicles Symposium (IV), 2015 IEEE, pages 1094–1099. IEEE, 2015.
  • Levinson et al. (2011) Jesse Levinson, Jake Askeland, Jan Becker, Jennifer Dolson, David Held, Soeren Kammel, J Zico Kolter, Dirk Langer, Oliver Pink, Vaughan Pratt, et al. Towards fully autonomous driving: Systems and algorithms. In Intelligent Vehicles Symposium (IV), 2011 IEEE, pages 163–168. IEEE, 2011.
  • Lotter et al. (2016) Bill Lotter, Gabriel Kreiman, and David Cox.

    Deep predictive coding networks for video prediction and unsupervised learning.

    In ICLR, 2016.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 02 2015. URL
  • Modayil and Sutton (2014) Joseph Modayil and Richard S. Sutton. Prediction driven behavior: Learning predictions that drive fixed responses. In AAAI Workshop on AI and Robotics, 2014.
  • Ng et al. (2003) Andrew Y Ng, H Jin Kim, Michael I Jordan, Shankar Sastry, and Shiv Ballianda. Autonomous helicopter flight via reinforcement learning. In NIPS, volume 16, 2003.
  • Oh et al. (2015) Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-Conditional Video Prediction using Deep Networks in Atari Games. In C Cortes, N D Lawrence, D D Lee, M Sugiyama, R Garnett, and R Garnett, editors, NIPS 28, pages 2845–2853. Curran Associates, Inc., 2015.
  • Omar et al. (1998) T Omar, A Eskandarian, and N Bedewi.

    Vehicle crash modelling using recurrent neural networks.

    Mathematical and computer Modelling, 28(9):31–42, 1998.
  • Paden et al. (2016) Brian Paden, Michal Cáp, Sze Zheng Yong, Dmitry S. Yershov, and Emilio Frazzoli. A survey of motion planning and control techniques for self-driving urban vehicles. CoRR, abs/1604.07446, 2016. URL
  • Pomerleau (1989) Dean A. Pomerleau. Alvinn, an autonomous land vehicle in a neural network. Technical report, Carnegie Mellon University, 1989. URL
  • Rajamani (2011) Rajesh Rajamani. Vehicle dynamics and control. Springer Science & Business Media, 2011.
  • Santana and Hotz (2016) Eder Santana and George Hotz. Learning a driving simulator, 2016. URL
  • Silver et al. (2016) David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016. doi: 10.1038/nature16961. URL
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–359, 10 2017.
  • Sutton et al. (2008) Richard Sutton, Csaba Szepesvari, Alborz Geramifard, and Michael Bowling. Dyna-style planning with linear function approximation and prioritized sweeping. In UAI, pages 528–536, 2008.
  • Urmson et al. (2007) Chris Urmson, J Andrew Bagnell, Christopher R Baker, Martial Hebert, Alonzo Kelly, Raj Rajkumar, Paul E Rybski, Sebastian Scherer, Reid Simmons, Sanjiv Singh, et al. Tartan racing: A multi-modal approach to the darpa urban challenge. 2007.
  • Walker et al. (2014) Jacob Walker, Abhinav Gupta, and Martial Hebert. Patch to the future: Unsupervised visual prediction. In CVPR, 2014.
  • Wang et al. (2004) Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • Yao and Szepesvári (2012) Hengshuai Yao and Csaba Szepesvári. Approximate policy iteration with linear action models. In AAAI, pages 1212–1217, 2012.
  • Yim and Oh (2004) Young Uk Yim and Se-Young Oh. Modeling of vehicle dynamics from real vehicle measurements using a neural network with two-stage hybrid learning for accurate long-term prediction. IEEE Transactions on Vehicular Technology, 53(4):1076–1084, 2004.
  • (30)