Video combines structured spatial and temporal information in high dimensions. The strong spatio-temporal dependencies among consecutive frames in video greatly increases the difficulty of modeling. For instance, it is challenging to effectively separate moving objects from background, and predict a plausible future movement of the former chao2017forecasting ; henaff2017prediction ; villegas2017decomposing ; vondrick2017generating ; walker2016uncertain ; xue2016visual . Though video is large in size and complex to model, video prediction is a task that can leverage the extensive online video data without the need of human labeling. Learning a good video predictor is an essential step toward understanding spatio-temporal modeling. These concepts can also be applied to various tasks, like weather forecasting, traffic-flow prediction, and disease control wang2017deepsd ; wang2018will ; wang2016etcps .
The recurrent neural network (RNN) is a widely used framework for spatio-temporal modeling. In most existing works, motion is estimated by the subtraction of two consecutive frames and the background is encoded by a convolutional neural network (CNN)cai2017deep ; patraucean2015spatio ; villegas2017decomposing ; villegas2017learning . The CNN ensures spatial consistency, while temporal consistency is considered by the recurrent units, encouraging motion to smoothly progress through time. However, information in two consecutive frames is usually insufficient to learn the dynamics. Using 3D convolution to generate future frames can avoid these problems vondrick2016generating ; vondrick2017generating , although generating videos by 3D-convolution usually lacks sharpness.
We propose a Leaked Motion Video Predictor (LMVP) for robust future-frame prediction. LMVP generates the prediction in an adversarial framework: we use a generative network to predict next video frames, and a discriminative network to judge the generated video clips. For the motion part, we propose to learn the dynamics by introducing a motion guider, connecting the generator and the discriminator. The motion guider learns the motion feature through training on real video clips, and guides the prediction process by providing possible motion features. At the same time, in contrast with estimating motions by subtracting two consecutive frames, we allow the discriminator to leak high-level extracted dynamic features to the motion guider to further help the prediction. Such dynamic features provide more informative guidance about dynamics to the generator. The spatial dependencies of video are imposed by a convolutional filter network conditioned on the current frame. This idea is inspired by a conventional signal processing technique named adaptive filter, which can increase the flexibility of the neural network jia2016dynamic . It is assumed that, each pixel of the predicted frame is a nonlinear function of the neighborhood pixels of the current one, where the nonlinear function is implemented via LMVP as a deep neural network.
The video frames are represented as , where is the total number of frames, is the frame height, is the width and is the channel number. Given the first () frames, the task is to predict the following frames. and represent for real and predicted video frame at time , respectively. The model framework is given in Figure 1. It mainly contains a generator , a motion guider , and a discriminator . distinguishes between the real and predicted video clips. learns the temporal dependencies among the video through the features leaked from , and generator uses the output of motion guider to predict the next frame based on the current.
2.1 Leaked Features from as Motion Signals
The discriminator (shown in top of Figure 1) is designed as both a discriminator and a motion feature extractor. The bottom layers of is a feature extractor
, followed by several convolutional and fully connected layers to classify real/fake samples, parameterized by. Mathematically, given input video clips , we have , where . The extracted motion feature from is denoted as , which is the input of motion guider .
The feature extractor is implemented as a convolutional network. The output is expected to capture motion from . The difference between and is treated as the dynamic motion feature between two consecutive frames, which is denoted as . In contrast of the direct subtraction of two consecutive frames villegas2017decomposing , our dynamic motion feature is extracted from two consecutive video clips of length . Since previous video frames are also included, it can still give reasonable output even if the model fails at previous time step. The discriminator loss can be written as
2.2 Learning and teaching game of
To utilize the leaked motion information from , we introduce a motion guider module , which is inspired by the leaky GAN model guo2017long
for text generation task. The structure ofis displayed in the green dotted box in Figure 1. has a recurrent structure that takes the extracted motion feature as input at each time step , and outputs a predicted motion feature . Specifically, the motion guider plays two roles in the model: learner and teacher.
As a learner, learns the motion in video via leaked feature from from real video. At time step , receives the leaked information exacted by , and predicts the dynamic motion feature between time and , by forcing close to . Denoting the parameters of the motion guider as
, the learner loss function can be written as
Note that only real video samples are used to update . The superscript means “Leaner”.
As a teacher, serves as a guider by providing predicted dynamic motion features to . During this step, is fixed while the generator is updated under the guidance of . Given the leaked features of predicted data at time , the output serves as an input to the generator to predict the next frame . This is detailed in the next section.
Since the dynamic motion feature is extracted from a real video clip instead of a single frame, is robust against fail predictions at previous time step, i.e., even if the previous predicted frame diverges from the ground truth.
2.3 Generating the next frame under guide from
The structure of the generator is shown at the bottom of Figure 1. It contains a spatial feature encoder, a temporal feature encoder, and a filter network. The spatial feature encoder is designed to learn the static background structure , while temporal feature are computed from dynamic motion feature . Note that only predicted samples have their motion guider output flow back to the generator. The spatio-temporal features and are concatenated and further fed into the filter network. The next frame is predicted by applying the generated adaptive filter from on the current frame. This technique is also known as visual transformation vondrick2017generating .
As mentioned in Section 2.2, is updated during the learning step. When generating the next frame in the teaching step, is fixed and outputs the motion guide . Specifically, to ensure generates the next frames following the guidance of , the dynamic motion feature between the generated video clips at time and , which is denoted as , should be close to from . Then, the generator is updated by minimizing the following loss function:
where includes all parameters in the generator . The gradient is taken w.r.t , while and are treated as inputs. Note that and are the predicted output and leaked motion feature from time step , respectively. The superscript in indicates as a “Teacher”.
The total loss function for the generator is
is the reconstruction loss function of , including a pixel-wise cross-entropy/MSE loss and the gradient difference loss (GDL) mathieu2015deep . The whole model is updated iteratively for each component. A pseudo-algorithm is given in Alg. 1 in the Appendix. The discriminator loss (1) is first evaluated and is updated. Then the motion guider parameters are updated using only real samples of . The generated parameters is updated using loss function (4). In practice, Adam kingma2014adam is used to perform the gradient descent optimization.
Moving MNIST: Each video in the Moving MNIST dataset srivastava2015unsupervised has frames in total, with two handwritten digits bouncing inside a patch. Given frames, the task is to predict the motion of the digits of the following frames. We follow the same training and testing procedure as srivastava2015unsupervised
. Evaluation metrics include Binary Cross Entropy (BCE), Peak Signal to Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM)wang2004image between the ground truth and the prediction . Small values of BCE or large values of SSIM and PSNR indicate good prediction results. In this task, we need to keep the digit shape the same across time (spatial consistency) while giving them reasonable movements (temporal consistency).
Table 1 gives the comparison of LMVP and baseline models. The BCE of LMVP achieves per pixel over frames, which is better than state-of-the-art models xingjian2015convolutional ; srivastava2015unsupervised ; jia2016dynamic ; villegas2017decomposing . The predictions from LMVP and DFN jia2016dynamic are shown in Figure 4. Input is given in the first row, followed by ground truth of the output, and results from DFN model and our LMVP model. To prove that our model has consistently good result in to the future, Figure 5 in the Appendix gives the SSIM and PSNR comparison over . LMVP achieves higher SSIM and PSNR scores than other baseline models through all time steps.
Highway Drive: The dataset contains videos were collected from a car-mounted camera during car driving on a highway. The videos contain rich temporal dynamics, including both self-motion of the car and the motion of other objects in the scene lotter2016deep . Following the setting used in jia2016dynamic , we split the approximately frames of the -minutes video into a training set of frames and a test set of frames. Each frame is of size . The task is to predict three frames in the future given the past three.
The prediction results are compared in Table 2 and two samples from the test set are selected in Figure 3. In the prediction results of DFN, the rail of the guidepost becomes curving. However, in the prediction results of LMVP, the rail keeps straight in the first and second predicted frames. To help the visual comparison, this part has been highlighted by a red circle.
We have proposed the Leaked Motion Video Predictor (LMVP) to handle the spatio-temporal consistency in video prediction. For the dynamics in video, the motion guider learns motion features from real data and guides the prediction. Since the motion guider learns features from video sequences, it is more robust compared to using only single frames as input. For structures of the background, the adaptive filter generates input-aware filters when predicting the next frame, ensuring spatial consistency. Further, A discriminator is adopted to further improve the prediction result. On both synthetic and real datasets, LMVP shows superior results over the state-of-the-art approaches.
- (1) H. Cai, C. Bai, Y.-W. Tai, and C.-K. Tang. Deep video generation, prediction and completion of human action sequences. arXiv preprint arXiv:1711.08682, 2017.
- (2) Y.-W. Chao, J. Yang, B. Price, S. Cohen, and J. Deng. Forecasting human dynamics from static images. In IEEE CVPR, 2017.
- (3) J. Guo, S. Lu, H. Cai, W. Zhang, Y. Yu, and J. Wang. Long text generation via adversarial training with leaked information. AAAI, 2018.
- (4) M. Henaff, J. Zhao, and Y. LeCun. Prediction under uncertainty with error-encoding networks. arXiv preprint arXiv:1711.04994, 2017.
- (5) X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool. Dynamic filter networks. In NIPS, 2016.
- (6) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
W. Lotter, G. Kreiman, and D. Cox.
Deep predictive coding networks for video prediction and unsupervised learning.ICLR, 2017.
- (8) M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. ICLR, 2017.
V. Patraucean, A. Handa, and R. Cipolla.
Spatio-temporal video autoencoder with differentiable memory.ICLR, 2016.
- (10) N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
- (11) R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. ICLR, 2017.
- (12) R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via hierarchical prediction. aICML, 2017.
- (13) C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
- (14) C. Vondrick and A. Torralba. Generating the future with adversarial transformers. In CVPR, 2017.
- (15) J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In ECCV, 2016.
- (16) D. Wang, W. Cao, J. Li, and J. Ye. Deepsd: supply-demand prediction for online car-hailing services using deep neural networks. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 2017.
- (17) D. Wang, W. Cao, M. Xu, and J. Li. Etcps: An effective and scalable traffic condition prediction system. In International Conference on Database Systems for Advanced Applications. Springer, 2016.
- (18) D. Wang, J. Zhang, W. Cao, J. Li, and Y. Zheng. When will you arrive? estimating travel time based on deep neural networks. AAAI, 2018.
- (19) Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 2004.
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo.
Convolutional lstm network: A machine learning approach for precipitation nowcasting.In NIPS, 2015.
- (21) T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, 2016.
Appendix A Model training
The model is first pre-trained by iteratively updating the parameters of and . In each iteration, we first update by minimizing the loss in Equation (1); then, , , and are jointly updated by minimizing the loss in Equation (4) with . We found that the above pre-training technique can empirically stabilize the generation process and learn useful leaked information from discriminator.
In the main algorithm loop, , , and are trained iteratively. The algorithm outline is given in Algorithm 1. Firstly, is updated according to discriminator loss while and are kept fixed. Secondly, defined in Equation (2) is evaluated to update while and remain unchanged. The third step is to update by minimizing loss in Equation (4) with . Note that, in both pre-train and main algorithm loop, all the initial hidden states in recurrent architecture are set to zero. The gradient is updated by Adam kingma2014adam .
Appendix B Experiment Result on Moving MNIST Dataset
The predictions generated by a LMVP model and a DFN model jia2016dynamic are displayed in Figure 4. Frames in the first line are input sequences and ground truth sequences. Frames generated by a DFN model and our LMVP model are shown in the second and the third line, respectively. Visually, the prediction of LMVP is better than DFN. This is further confirmed quantitatively in Table 3.
To demonstrate it more clearly, Figure 5 displays the prediction evaluation of DFN and our model over different time step . LMVP achieves higher SSIM and PSNR scores than other baseline models through all time steps.