Consider the images shown in Figure 1. Given the girl in front of the cake, we humans can easily predict that her head will move downward to extinguish the candle. The man with the discus is in a position to twist his body strongly to the right, and the squatting man on the bottom has nowhere to move but up. Humans have an amazing ability to not only recognize what is present in the image but also predict what is going to happen next. Prediction is an important component of visual understanding and cognition. In order for computers to react to their environment, simple activity detection is not always sufficient. For successful interactions, robots need to predict the future and plan accordingly.
|(a) Input Image||(b) Prediction|
There has been some recent work that has focused on this task. The most common approach to this prediction problem is to use a planning-based agent-centric approach: an object  or a patch  is modeled as an agent that performs actions based on its current state and the goal state. Each action is decided based on compatibility with the environment and how these actions helps the agent move closer to the goal state. The priors on actions are modeled via transition matrices. Such an approach has been shown to produce impressive results: predicting trajectories of humans in parking lots  or hallucinating car movements on streets . There are two main problems with this approach. First, the predictions are still sparse, and the motion is still modeled as a trajectory. Second, and more importantly, these approaches have always been shown to perform in restrictive domains such as parking lots or streets.
In this paper, we take the next step towards generalized prediction — a framework that can be learned from tens of thousands of realistic videos. This framework can work in indoor and outdoor environments; it can account for one or multiple agents whether the agent is an animal, a human, or even a car. Specifically, this framework looks at the task of motion prediction — given a static image we predict the dense expected optical flow as if this image were part of a video. This optical flow represents how and where each and every pixel in the image is going to move in the future. However, we can see that motion prediction is more than identifying active agents; it is also highly dependent on context. For example, someone’s entire body may move up or down if they are jump-roping, but most of the body will be stationary if they are playing the flute. Instead of modeling agents and its context separately under restrictive assumptions, we use a learning based approach for motion prediction. Specifically, we train a deep network that can incorporate all of this contextual information to make accurate predictions of future motion in a wide variety of scenes. We train our model from thousands of realistic video datasets, namely UCF-101  and the HMDB-51 .
Contributions: Our paper makes three contributions. First, we present a CNN model for motion prediction. Given a static image, our CNN model predicts expected motion in terms of optical flow. Our CNN-based model is agent-free and makes almost no assumptions about the underlying scene. Therefore, we show experimental results on diverse set of scenes. Second, our CNN model gives state of the art performance on prediction compared to contemporary approaches. Finally, we also present a proof of concept extension of the CNN model which makes long-range prediction about future motion. Our preliminary results indicate that this new CNN model might indeed be promising even for the task of long-range prediction.
Prediction has caught the interest of the vision community in recent years. Most of research in this area has looked at different aspects of the problem. The first aspect of interest is the output space of prediction. Some of the initial work in this area focused on predicting the trajectory for the given input image . Others have looked at more semantic forms of prediction: that is, predicting the action class of what is going to happen next [5, 14]. However, one of the issues with semantic prediction is that it tells us nothing about the future action beyond the category. One of our goals in prediction is to go beyond classification and predict the spatial layout of future actions. For example, in case of agents such as humans, the output space of prediction can be trajectories themselves . On the contrary, recent approaches have argued for much richer form of predictions even in terms of pixels  or the features of the next frame [7, 19].
The other aspect of research in visual prediction looks at the question of selecting the right approach for prediction. There have been two classes of approaches for the temporal prediction. The first is a data-driven, non-parametric approach. In the case of non-parameteric approaches, they do not make any assumptions about the underlying scene. For example,  simply retrieves videos visually similar to the static scene, allowing a warping  of the matched action into the scene. The other end of the spectrum is parametric and domain-specific approaches. Here, we make assumptions on what are the active elements in the scene whether they may be cars or people. Once the assumption is made, then a model is developed to predict agent behavior. This includes forecasting pedestrian trajectories , human-human interactions [7, 14], human expressions through SOSVM , and human-object interaction through graphical models [11, 3].
Some of the recent work in this area has looked at a more hybrid approach. For example, Walker et al.  builds a data-derived dictionary of rigid objects given a video domain and then makes long-term motion and appearance predictions using a transition and context model. Recent approaches such as  and  have even looked at training convolutional neural networks for predicting one future frame in a clip  or motion of handwritten characters .
We make multiple advances over previous work in this paper. First, our self-supervised method can generalize across a large number of diverse domains. While  does not explicitly require video labels, it is still domain dependent, requiring a human-given distinction between videos in and outside the domain. In addition,  focused only on birds-eye domains where scene depth was limited or non existent, while our method is able to generalize to scenes with perspective. 
also uses self-supervised methods to train a Structured Random Forest for motion prediction. However, the authors only learn a model from the simple KTH dataset. We show that our method is able to learn from a set of videos that is far more diverse across scenes and actions. In addition, we demonstrate much better generalization can be obtained as compared to the nearest-neighbor approach of Yuen et al. .
Convolutional Neural Networks: We show in this paper that a convolutional neural network can be trained for the task of motion prediction in terms of optical flow. Current work on CNNs have largely focused on recognition tasks both in images and video [12, 9, 4, 23, 29, 20, 22]. There has been some initial work where CNNs have been combined with recurrent models for prediction. For example,  uses a LSTM  to predict the immediate next frame given a video input.  uses a recurrent architecture to predict motions of handwritten characters from a video. On the other hand, our approach predicts motion for each and every pixel from a static image for any generic scene.
|(a) Input Image||(b) Prediction||(c) Ground Truth|
Our goal is to learn a mapping between the input RGB image and the output space which corresponds to the predicted motion of each and every pixel in terms of optical flow. We propose to use CNNs as the underlying learning algorithm for this task. However, there are a few questions that need to be answered: what is a good output space, and what is a good loss function? Should we model optical flow prediction as a regression or a classification problem? What is a good architecture to solve this problem? We now discuss these issues below in detail.
3.1 Regression as Classification
Intuitively, motion estimation can be posed as a regression problem since the space is continuous. Indeed, this is exactly the approach used in, where the authors used structured random forests to regress the magnitude and direction of the optical flow. However, such an approach has one drawback: such an output space tends to smoothen results to the mean. Interestingly, in a related regression problem of surface normal prediction, researchers have proposed reformulating structured regression as a classification problem [26, 17]. Specifically, they quantize the surface normal vectors into a codebook of clusters and then output space becomes predicting the cluster membership. In our work, we take a similar approach. We quantize optical flow vectors into
clusters by k-means. We can then treat the problem in a manner similar to semantic segmentation, where we classify each region as the image as a particular cluster of optical flow. We use a soft-max loss layer at the output for computing gradients.
However, at test time, we create a soft output by considering the underlying distribution of all the clusters, taking a weighted-probability sum over all the classes in a given pixel for the final output. Transforming the problem into classification also leads directly to a discrete probability distribution over vector directions and magnitudes. As the problem of motion prediction can be ambiguous depending on the image (see Figure3), we can utilize this probability distribution over directions to measure how informative our predictions are. We may be unsure if the man in Figure 3
is sitting down or standing up given only the image, but we can be quite sure he will not turn right or left. In the same way, our network can rank upward and downward facing clusters much higher than other directions. Even if the ground truth is upward, and the highest ranked cluster is downward, it may be that the second-highest cluster is also upward. Because the receptive fields are shared by the top layer neurons, the output trends to a globally coherent movement. A discrete probability distribution, through classification, allows an easier understanding of how well our network may be performing.
3.2 Network Design
Our model is similar to the standard seven-layer architecture from . To simplify the description, we denote the convolutional layers as , which indicates that there are kernels, each having the size of
. During convolution, we set all the strides toexcept for the first layer, which is
. We also denote the local response normalization layer as LRN, and the max-pooling layer as MP. The stride for pooling isand we set the pooling operator size as . Finally, denotes fully connected layer with neurons. Our network architecture can be described as:
. We used a modified version of the popular Caffe toolbox for our implementation. For computational simplicity, we use 200x200 windows as input. We used a learning rate of 0.0001 and a stepsize of 50000 iterations. Other network parameters were set to default. The only exception is that we used Xavier initialization of parameters. Instead of using the default softmax output, we used a spatial softmax loss function from  to classify every region in the image. This leads to a softmax layer, where is the number of rows, is the number of columns, and is the number of clusters in our codebook. We used , , and for a softmax layer of 16,000 neurons. Our softmax loss is spatial, summing over all the individual region losses. Let represent the image and be the ground truth optical flow labels represented as quantized clusters. Then our spatial loss function is:
where represents the probability that the th pixel will move according to cluster . is an indicator function.
Data Augmentation: For many deep networks, datasets which are insufficiently diverse or too small will lead to overfitting.  and  show that training directly on datasets such as the UCF-101 for action classification leads to overfitting, as there is only on the order of tens of thousands of videos in the dataset. However, our problem of single-frame prediction is different from this task. We find that we are able to build a generalizable representation for prediction by training our model over 350,000 frames from the UCF-101 dataset as well as over 150,000 frames from the HMDB-51 dataset. We benefit additionally from data augmentation techniques. We randomly flip each image as well as use randomly cropped windows. For each input, we also mirror or flip the respective labels. In this way we are able to avoid spatial biases (such as humans always appearing in the middle of the image) and train a general model on a far smaller set of videos than for recognition tasks.
Labeling: We automatically label our training dataset with an optical flow algorithm. We chose the publicly available implementation of DeepFlow  to compute optical flow. The UCF-101 and the HMDB-51 dataset use realistic, sometimes low-quality videos from a wide variety of sources. They often suffer from compression artifacts. Thus, we aim to make our labels somewhat less noisy by taking the average optical flow of five future frames for each image. The videos in these datasets are also unstabilized.  showed that action recognition can be greatly improved with camera stabilization. In order to further denoise our labels, we wish to focus on the motion of objects inside the image, not the camera motion. We thus use the stabilization portion of the implementation of  to automatically stabilize videos using an estimated homography.
|(a) Input Image||(b) ||(c) Ours||(d) Ground Truth||(a) Input Image||(b) ||(c) Ours||(d) Ground Truth|
For our experiments, we mostly focused on two datasets, the UCF-101 and HMDB-51, which have been popular for action recognition. For both of these datasets, we compared against baselines using 3-fold cross validation with the splits specified by the dataset organizers. We also evaluated our method on the KTH  dataset using the exact same configuration in  with DeepFlow. Because the KTH dataset is very small for a CNN, we finetuned our UCF-101 trained network on the training data. For training, we subsampled frames by a factor of 5. For testing, we sampled 26,000 frames per split. For our comparison with AlexNet finetuning, we used a split which incorporated a larger portion of the training data. We will release this split publicly. We used three baselines for evaluation. First we used the technique of , a SRF approach to motion prediction. We took their publicly available implementation and trained a model according to their default parameters. Because of the much larger size of our datasets, we had to sample SIFT-patches less densely. We also use a Nearest-Neighbor baseline using both fc7 features from the pre-trained AlexNet network as well as pooled-5 features. Finally, we compare unsupervised training from scratch with finetuning on the supervised AlexNet network.
Single-image evaluation using the 3-fold split on UCF-101. Ours-HMDB represents our network trained only on HMDB data. The Canny suffix represents pixels on the Canny edges, and the NZ suffix represents moving pixels according to the ground-truth. NN represents a nearest-neighbor approach. Dir and Orient represent direction and orientation metrics respectively. For EPE, less is better, and for other metrics, higher is better. With the exception of Orient-NZ against both NN features, all differences against our model are significant at the 5% level with a paired t-test.
|Pretrained vs. From Scratch|
We compare finetuning from ImageNet features to a randomly initialized network on UCF-101. Orient represents orientation metric. NZ and Canny are non-zero and Canny pixels.
4.1 Evaluation Metrics
Because of the complexity and sometimes high level of label ambiguity in motion prediction, we use a variety of metrics to evaluate our method and baselines. Following from , we use traditional End-Point-Error, measuring the Euclidean distance of the estimated optical flow vector from the ground truth vector. In addition, given vectors and
, we also measure direction similarity using the cosine similarity distance:and orientation similarity (angle taken on half-circle): . The orientation similarity measures how parallel is predicted optical flow vector with respect to given ground truth optical flow vector. Some motions may be strictly left-right or up-down, but the exact direction may be ambiguous. This measure accounts for this situation.
We choose these metrics established by earlier work. However, we also add some additional metrics to account for the level of ambiguity in many of the test images. As  notes, EPE is a poor metric in the case where motion is small and may reasonably proceed in more than one possible direction. We thus additionally look at the underlying distribution of the predicted classes to understand how well the algorithm accounts for this ambiguity. For instance, if we are shown an image as in Figure 3, it is unknown if the man will move up or down. It is certainly the case, however, that he will not move right or left. Given the probability distribution over the quantized flow clusters, we check to see if the ground truth is within the top probable clusters. For the implementation of , we create an estimated probability distribution by quantizing the regression output from all the trees and then, for each pixel, we bin count the clusters over the trees. For Nearest-Neighbor we take the top-N matched frames and use the matched clusters in each pixel as our top-N ranking. We evaluate over the mean rank of all pixels in the image. Following , we also evaluate over the Canny edges. Because of the simplicity of the datasets in , Canny edges were a good approximation for measuring the error of pixels of moving objects in the scene. However, our data includes highly cluttered scenes that incorporate multiple non-moving objects. In addition, we find that our network is very effective at identifying moving vs non-moving elements in the scene. We find that the difference between overall pixel mean and Canny edges is very small across all metrics and baselines. Thus, we also evaluate over the moving pixels according to the ground-truth. Moving pixels in this case includes all clusters in our codebook except for the vector of smallest magnitude. While unfortunately this metric depends on the choice of codebook, we find that the greatest variation in performance and ambiguity lies in predicting the direction and magnitude of the active elements in the scene.
4.2 Qualitative Results
Figure 4 shows some of our qualitative results. For single frame prediction, our network is able to predict motion in many different contexts. We find that while  is able to make reasonable predictions on the KTH, qualitative performance collapses once the complexity and size of the dataset increases. Although most of our datasets consist of human actions, our model can generalize beyond simply detecting general motion on humans. Our method is able to successfully predict the falling of the ocean wave in the second row, and it predicts the motion of the entire horse in the first row. Furthermore, our network can specify motion depending on the action being performed. For the man playing guitar and the man writing on the wall, the arm is the most salient part to be moved. For the man walking the dog and the man doing a pushup, the entire body will move according to the action.
4.3 Quantitative Results
UCF101 and HMDB: We show in tables 1 and 2 that our method strongly outperforms both the Nearest-Neighbor and SRF-based baselines by a large margin on most metrics. This holds true for both datasets. Interestingly, the SRF-based approach seems to come close to ours based on End-Point-Error on all datasets, but is heavily outperformed on all other metrics. This is largely a product of the End-Point-Error metric, as we find that the SRF tends to output the mean (optical flow with very small magnitude). This is consistent with the results found in , where actions with low, bidirectional motion can result in higher EPE than predicting no motion at all. When we account for this ambiguity in motion in the top-N metric, however, the difference in performance is large. The most dramatic differences appear over the non-zero pixels. This is due to the fact that most pixels in the image are not going to move, and an algorithm that outputs motion that is small or zero over the entire image will appear to perform artificially well without taking the moving objects into account.
KTH: For KTH in table 3,  is close to our method in EPE and Orientation, but Top-N suffers greatly because it often output vectors of correct direction but incorrect magnitude. On absolute levels our method seems to perform well on this simple dataset, with the network predicting the correct cluster will over of the time.
Cross Dataset: As both the UCF101 and HMDB dataset are curated by humans, it is important to determine how well our method is able to generalize beyond the structure of a particular dataset. In table 1 we show that training on HMDB (Ours-HMDB) and testing on UCF101 leads only to a small drop in performance. Likewise, training on UCF101 (Ours-UCF101) and testing on HMDB in table 2 shows little performance loss.
Pretraining: We train our representation in a self-supervised manner, using no semantic information. However, do human labels help? We compared finetuning from supervised, pretrained features trained on ImageNet to a randomly initialized network trained only on self-supervised data. The pretrained net has been exposed to far more diverse data, and the network has been trained on explicity semantic information. However, we find in table 4 that the pretrained network yields only a very small improvement in performance.
Stabilization: How robust is the network to camera motion? We explicitly stabilized the camera in our training data in order for the network to focus on moving objects and not camera motion itself. In table 5 we compare a network trained on data with and without stabilization. We test on stablilized data, and we find even without camera stabilization that the difference in performance is small.
5 Multi-Frame Prediction
|(a) Input Image||(b) Frame 1||(c) Frame 2||(c) Frame 3||(d) Frame 4||(e) Frame 5|
Until now we have described an architecture for predicting optical flow given a static image as input. However, it would be interesting to predict not just the next frame but a few seconds into future. How should we design such a network?
We present a proof-of-concept network to predict 6 future frames. In order to predict multiple frames into the future, we take our pre-trained single frame network and output the seventh feature layer into a ”temporally deep” network, using the implementation of 
. This network architecture is the same as an unrolled recurrent neural network with some important differences. On a high level, our network is similar to the unfactored architecture in, with each sequence having access to the image features and the previous hidden state in order to predict the next state. We replace the LSTM module with a fully connected layer as in a RNN. However, we also do not use a true recurrent network. The weights for each sequence layer are not shared, and each sequence has access to all the past hidden states. We used 2000 hidden states in our network, but we predict at most six future sequences. We attempted to use recurrent architectures with the publicly available LSTM implementation from . However, in our experiments they always regressed to a mean trajectory across the data. Our fully connected network has much higher number of parameters than a RNN and therefore highlights the inherent difficulty of this task. Due to the much larger size of the state space, we do not predict optical flow for each and every pixel. Instead, we use kmeans to created a codebook of 1000 possible optical flow frames, and we predict one of 1000 class as output as each time step. This can be thought of as analogous to a sequential prediction problem similar to caption generation. Instead of a sequence of words, our ”words” are clusters of optical flow frames, and our ”sentence” is an entire trajectory. We used a set number of sequences, six, in our experiments with each frame representing the average optical flow of one-sixth of a second.
In this paper we have presented an approach to generalized prediction in static scenes. By using an optical flow algorithm to label the data, we can train this model on a large number of unlabeled videos. Furthermore, our framework utilizes the success of deep networks to outperform contemporary approaches to motion prediction. We find that our network successfully predicts motion based on the context of the scene and the stage of the action taking place. Possible work includes incorporating this motion model to predict semantic action labels in images and video. Another possible direction is to utilize the predicted optical flow to predict in raw pixel space, synthesizing a video from a single image.
Acknowledgements: We thank Xiaolong Wang for many helpful discussions. We thank the NVIDIA Corporation for the donation of Tesla K40 GPUs for this research. In addition, this work was supported by NSF grant IIS1227495.
-  S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodology for optical flow. IJCV, 92(1):1–31, 2011.
-  J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014.
-  D. Fouhey and C. L. Zitnick. Predicting object dynamics in scenes. In CVPR, 2014.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
-  M. Hoai and F. De la Torre. Max-margin early event detectors. IJCV, 107(2):191–202, 2014.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  D.-A. Huang and K. M. Kitani. Action-reaction: Forecasting the dynamics of human interaction. In ECCV. 2014.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
-  K. Kitani, B. Ziebart, D. Bagnell, and M. Hebert. Activity forecasting. In ECCV, 2012.
-  H. S. Koppula and A. Saxena. Anticipating human activities using object affordances for reactive robotic response. In RSS, 2013.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database for human motion recognition. In ICCV, 2011.
-  T. Lan, T.-C. Chen, and S. Savarese. A hierarchical representation for future action prediction. In ECCV. 2014.
-  I. Laptev, B. Caputo, et al. Recognizing human actions: A local svm approach. In ICPR, 2004.
-  C. Liu, J. Yuen, and A. Torralba. Sift flow: Dense correspondence across scenes and its applications. PAMI, 2011.
-  B. Z. L’ubor Ladickỳ and M. Pollefeys. Discriminatively trained dense surface normal estimation.
-  S. L. Pintea, J. C. van Gemert, and A. W. Smeulders. Déjà vu: Motion prediction in static images. In ECCV. 2014.
-  M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
-  K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
-  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. arXiv preprint arXiv:1502.04681, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
-  J. Walker, A. Gupta, and M. Hebert. Patch to the future: Unsupervised visual prediction. In CVPR, 2014.
-  H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, Sydney, Australia, 2013.
-  X. Wang, D. F. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. arXiv preprint arXiv:1411.4958, 2014.
-  P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical flow with deep matching. In ICCV, 2013.
-  J. Yuen and A. Torralba. A data-driven approach for event prediction. In ECCV, 2010.
-  N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned networks for deep attribute modeling. In CVPR, 2014.