Modeling visual dynamics of the real world is crucial for intelligent agents in a wide range of domains, such as computer vision, robotics, and graphics. For example, in model-based reinforcement learning, agents can “foresee” future frames and plan accordingly to maximize their expected rewards[1, 2, 3, 4]. As shown in Figure 1, consider a sequence of frames in which a car passes over. We human can easily imagine how the next frame will be. This is because we can first identify the foreground and predict how it will move in the short-term future. Then, we just crop it out, distort it a bit, and paste it onto where we think it should be. Finally, we refill the rest of the cropping hole using our acquired prior through daily experiences. This mental process is both effective and efficient. It is desirable to have our computational models predict the future in a similar way.
As illustrated in Figure 2
, we summarize previous approaches by classifying how they generate new pixels. In pure pixel-based models[4, 5, 6, 7, 8, 9, 10, 11, 12], every pixel is generated from scratch given a history buffer. Within such process, motions are implicitly modeled and propagated by convolutional or recurrent neural blocks [13, 14]. Supervised by the pixel-wise loss in visual space, predictions are usually blurry so that they can avoid large penalty. Adversarial  and feature-alignment  priors alleviate such problem but are known to be hard for training in practice. Besides, since they need to potentially infer latent physical principles of how pixels get propagated, pure pixel-based models requires a prohibitive amount of computation and data. On the other hand, pure motion-based models [17, 18] are encoded with propagation priors from low-level computer vision, in which pixel movements are represented by coordinate shifts, or more formally, appearance flow from sources to predicted targets. The main merit of such method is that, by injecting a warping operator, models know how to computationally propagate pixels and only need to focus on predicting future dynamics in the flow representation. Also, since all pixels are copy-pasted from previous frames, temporal consistency is automatically ensured. Yet as shown in Figure 1d, motion-based methods fail in occluded regions where the flow is ill-defined. Motivated by previous limitations, Finn et al.  and Hao et al.  explored different ways to compose the pixel- and motion-based prediction through a jointly learned linear mask. These methods work well in practice, but results are still “smoothed” due to linear fusion.
In this work, we present a new approach towards both accurate and realistic video prediction using a confidence-aware spatial-temporal context encoder. Our insight is that motion propagation and generation could and should be disentangled so that each component can maximize its utility. Different from traditional learned linear masks, we design a new warping operator for computed hard masks, so that disentangled modules can focus on non-occluded regions for flow prediction and occluded regions for novel scene generation. For better image quality and temporal consistency, we additionally introduce a fusion decoder and a segmentation loss when training our generator. By going beyond linear fusion techniques, our disentangled fusion model is capable of generating both accurate and realistic predictions.
We evaluate our approach on both the standard CalTech Pedestrian dataset and the more challenging KITTI Flow dataset of larger motions and occlusions. Our approach achieves the state-of-the-art performance on both datasets. Further ablation studies also demonstrate the effectiveness of each component proposed in our method.
2 Related Works
Photo-realistic Image Synthesis Pixel accuracy, or more preferably realism, is the constant pursuit of high-quality image synthesis [16, 20, 21, 22, 23, 24]. Recent developments towards photo-realistic image synthesis constantly feature Generative Adversarial Networks (GANs) . Conditioned on categorical labels , textual descriptions  or segmentations , high-fidelity image generation quality can be achieved. The closest work to ours is Dense Pose Transfer , which hallucinates new human pose images by warping the original image with a predefined dense pose and in-painting the ambiguous parts. In our case, however, the model needs to predict future motion and further synthesize based on both spatial and temporal information.
Video Prediction There is a vast body of research in video prediction [1, 5, 4, 29, 30]. Fueled by high-capacity models for image synthesis, recent approaches [11, 12, 6, 31, 32] are mostly pixel-based, in which every pixel is generated from scratch. They show that an encoder-decoder network can produce reasonable images predictions while suffering from blurry effects, especially for unseen novel scenes. On the contrary, motion-based methods [17, 18] excel in predicting sharp results, yet fail in areas where motion predictions are erroneous or ill-defined. SDC-Net  proposes an interesting architectural design in which motion is modeled by both convolutional kernels as in 
and vectors as optical flow. Our closest previous work is which composes the pixel- and motion-based prediction through a jointly learned linear mask. However, our proposed approach is differentiated from it in two aspects: (1) motion propagation and generation are separately trained to focus on non-occluded and occluded regions only; (2) our aggregation is done by a non-linear hard mask computed by inferred flow information.
Disentangling Motion and Content Since videos are essentially composed of pixel motions and contents, such as foreground objects that are invariant to motions themselves, it is natural to take them apart, i.e. preserving content semantics while propagating them through the temporal axis. [9, 10, 11] are the three representative works in such direction to attack video prediction. Though similar in the disentangling nature, our approach does not discriminate motion from the content information — instead, we closely relate them together using a warping operator. What we disentangle are the propagation results and the novel scenes that are agnostic to motion values.
Spatial Context Encoding Pixels are not isolated. On the contrary, there are many cases where each pixel is referred to as the context for its nearby pixels [34, 35, 36, 37]. Given an image where certain pixels are masked out, spatial context encoders are required to refill the missing pixels by taking account of their neighbors. In our case, the “mask” is the occluded area where motion predictions are erroneous. Particularly, we employ partial convolutions  in our generator’s encoding blocks. Different from previous image in-painting works, our approach implicitly combines the temporal context with the spatial context. Since unconditional context encoders tend to remove foreground in their predictions, we further introduce a fusion decoder and a segmentation loss to improve the visual quality.
Video prediction aims to synthesize future frames given a stack of history frames. For the ease of exposition, we here focus on a -in -out prediction task: given an input video sequence denoted as , the model aim to predict the frame which is expected to be both accurate and visually sharp. In this sense, we intentionally build our model in a way beyond reconstruction loss, which is widely known to encourage blurry results. Concretely, as illustrated in Figure 2d, our model is disentangled into two orthogonal yet complementary modules and so that each module can maximize its utility.
Given the history frames , our propagation module learns to predict the flow field for the pixel correspondence between the last input frame and the target . Since it neglects the existence of occlusions, we define an occlusion-aware warper , and propagate the input frame into a motion-dependent prediction . At the same time, the warper computes a confidence map that encodes where the artifacts exist. Finally, based on motion-dependent prediction and computed occlusion mask, our generation module learns to in-paint low confidence patches from the dataset prior. By explicitly attributing future dynamics to motion and motion-agnostic novel scene, our model is able to predict high-fidelity future frames.
3.2 Motion Propagation
Our module computes the correlation of appearances between each pair of frames from the history buffer and predicts future motion dynamics using optical flow. We choose it over other motion representations, such as frame differences  or sparse trajectories , because it provides richer information about motion occlusions over pixels themselves.
As illustrated in Figure 3a, our flow prediction module is an encoder-decoder network with skip connections. The output of is a 2-dimensional flow field that aims to propagate the last frame into the propagated target frame . Formally, let be a Cartesian grid over the target frame, and we have
By assuming local linearity, we can next define a standard backward warping operator , and sample the future frame from the last given frame as
is a bilinear sampler that generates the new image by first mapping the regular grid to the transformed grid and then bilinear interpolating between produced sub-pixels.
However, this method erroneously introduces “ghosting” effect in occluded regions (see Figure 1d and Figure 4). Our insight is that the low-confidence predictions on occluded regions should be excluded from motion propagation results to avoid unnecessary errors. It should be noted that similar ideas have been previously explored in [1, 19]; but in their contexts, occlusion masks are linearly learned to compose pixels from different sources. We argue that these regions can be explicitly computed based on backward flows, and hence can be directly masked out. We will later demonstrate the effectiveness of our computed mask over the learned mask in Section 4.
Specifically, we augment our predefined warping operator to retrieve the occluded regions by examining how the Cartesian grid from the last input frame changes during motion propagation. For each propagated frame , We maintain a look-up table to record how many sub-pixels will move to each coordinates in after the propagation. That says, we have
where is the hard binary mask where zeros stands for occlusion and ones for valid pixels; is an element-wise indicator function. Empirically, we here set a hard threshold at to identify occluded regions where at least two sub-pixels collide.
where denotes structural similarity index and is our trade-off weight between the loss terms, which is fixed at through cross-validations.
3.3 Spatial-temporal Context Encoding
Given the propagated frame and the computed occlusion map , we can now formulate our second modeling stage as spatial-temporal context encoding, in which missing pixels are not directly determined by motions but subject to their spatial contexts propagated through time.
Our generator module adopts generally the same network architecture as its propagation counterpart while substituting the standard convolution blocks with other building blocks for context encoding. Illustrated in Figure 3b, our encoder takes the previous propagated frame and its occlusion map as inputs, producing a latent feature representation. The decoder then takes this feature representation and synthesizes the missing content.
Specifically, we design all encoder blocks as partial convolution operators  to mask out invalid pixels and re-normalize features within clean receptive fields only. That is, to compute the feature and the binary occlusion mask at the th layer, we have
where is a normal convolution operator, is the element-wise multiplication. For each location in the mask, it will be considered as valid if there exists a valid pixel in its receptive field. We design our encoder in a way that the receptive field of the bottleneck is bigger than the maximal area of the occlusion masks so that, by pushing an image through the encoder, all input pixels will be valid and we will have a dense, clean feature map in our bottleneck.
Next, for the decoder, low-resolution but clean feature maps are upsampled and linked with previous high-resolution but masked features by skip connections. However, this raises a fusion issue when aggregating the features. Consider we have an occlusion mask on the encoder’s feature that we need to refill by the decoder, previously in , feature maps and masks are concatenated channel-wise and handled by new partial convolutions in their decoder. We find this could be improved by directly refilling the occluded encoder features by the upsampled clean decoder features. Concretely, our decoder computes the feature of the reversed-th layer, with respect to its encoder counterpart, by
where denotes the decoder feature map in pair with the encoder, is a bi-linear upsampler, and is a channel-wise concatenation operator. This decoding fusion is repeated at each layer from the bottleneck up to our final output .
To train our generator , we design all of our losses to be temporal-independent so that it can focus on the visual quality. In general, our loss terms consists of a pixel reconstruction loss
perceptual and style losses in VGG’s latent spaces as in 
a total-variance loss to encourage similar texture in occlusion boundaries
and an extra semantic loss to enforce layout consistency, distilled by a pretrained segmentation network , since unconditional image in-painting tends to remove the foreground objects,
where is our attentive weight for masked regions; is the masked pixel loss borrowed inherited from previous motion propagation training; denotes the element-wise difference matrix of the prediction and target on the -th feature space of VGG ; denotes the Jacobian matrix of the composed image ; is the cross-entropy function; and is the pseudo ground-truth segmentation class. We empirically find the segmentation consistency loss crucial for our generator .
The overall training objective could be formulated as
where all ’s are the hyper-parameters that control the training schedule. We set through a coarse grid search.
Since flow estimation/prediction is known to be hard to learn and sensitive to data biases, we first train our motion propagation module and generation module separately. After gradients become stable, we connect the two components together and fine-tune the whole network in an end-to-end fashion.
|(a)||(b)||(c) PredNet||(d) ContextVP||(e) FGVP (Ours)||(f)|
We conduct experiments on the CalTech Pedestrian dataset in Section 4.1 and KITTI Flow dataset in Section 4.2. We also conduct ablation study to clarify the effectiveness of the proposed modules in Section 4.3. We will show that our FGVP can produce much sharper predictions than competitive techniques which leads to significant improvement in the general visual quality.
We adopt the traditional Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM)  metrics to measure the “pixel/patch-wise accuracy”. However, it is well-known that these metrics disagree with the human perception [7, 43, 44, 45] because they tend to encourage blurriness over naturalness. Therefore we also measure the realism of the given predictions by Learned Perceptual Image Patch Similarity (LPIPS111The metric computes distance in AlexNet  feature space (conv1-5 , pre-trained on Imagenet ). Comparing to other deep feature metrics like Inception Score, LPIPS matches better with human perceptual judgments.
, pre-trained on Imagenet
). Comparing to other deep feature metrics like Inception Score, LPIPS matches better with human perceptual judgments.) proposed by . Quantitatively, higher PSNR/SSIM scores and smaller LPIPS distances suggest better performance.
Baselines We consider strong baselines from three main families of video prediction methods previous introduced: (1) Pixel-based methods, including Beyond MSE , PredNet , SVP-LP  and ContextVP ; (2) Motion-based method, including DVF ; (3) Fusion of the pixel- and motion-based methods, including Dual Motion GAN  and CtrlGen .
4.1 CalTech Pedestrian Dataset
|Dual Motion GAN ||F|
We begin our experiments on the standard CalTech Pedestrian dataset  which is taken from moving vehicles, and consists of both ego and object motions in real-life scenarios including rigid and the non-rigid scene changes.
Setup The conventional experimental setup on this dataset is to first train a model on the training split of KITTI Raw  proposed by Lotter et al.  and then directly test it on the testing set of CalTech Pedestrian. Frames are center-cropped and down-sampled to pixels. Every consecutive frames are divided and sampled as a training clip in which the first frames are fed into the model as the input, and the th frame is used as prediction target. As the results, the training, validation and testing set consists of , , and clips.
|(a) DVF||(b) CtrlGen||(c) FGVP (Ours)||(d)|
Analysis We compare our model against previous state-of-the-art methods on this dataset. The results of Next-Frame Prediction is shown in Table 1. Our model achieves comparable PSNR and SSIM scores with ContextVP . Meanwhile, our method can predict non-stretching textures in those occluded regions, which leads to smaller perceptual dissimilarity measured by LPIPS. As shown in Figure 5, our model is robust for both pixel propagation and novel scene inference.
Apart from empirical improvements, we find that, in terms of LPIPS metric, all the evaluated state-of-the-art methods do no better than the most naive baseline — repeating the last input frame as the prediction. This suggests that the CalTech Pedestrian dataset consists of small motions that are not obvious for human perception. This motivates us to work on a more challenging dataset so that learners can benefit from more inductive biases and thus be more robust.
4.2 KITTI Flow Dataset
We next move to a more challenging dataset — KITTI Flow . It is designed originally as a benchmark for optical flow estimation and featured with higher resolution222Samples are downsampled and center-cropped to pixels to avoid optical distortions around lens edges, larger motions, and more occlusions compared to the raw dataset.
Setup The dataset contains examples for training, for validation and for testing. We apply data augmentation techniques such as random cropping and random horizontal flipping for all the models. In addition, we sample video clips of frames (-in -out) from the dataset using a sliding window. This amounts to clips for training and clips for testing.
We choose the strong baseline methods which published their codebases online, we also include a weak baseline which trivially repeats the last frame input as its prediction.It should be noted that PredNet  and SVP-LP  are originally designed to infer on past frames, but here we configure them to take in only frames.
Analysis As demonstrated in Figure 6, our proposed model again produce more visually appealing predictions than our baselines. In contrast to the pixel-based methods, all demonstrated methods suffer less from blurriness but display the distortion and stretch in shapes due to quick scene changes, which cause inaccurate flow prediction. Our model, instead, can predict better flow so as to alleviate undesirable artifacts in large motion areas. Occluded areas are masked by motion propagation and refilled by generation so that they are free of ghosting effects. Our generator learns a scene prior to hallucinate what is missing given the contextual information. Our quantitative improvements are shown in Table 2. As resolution increases, previous pixel-based methods (PredNet, SVP-LP) suffer from a steeper learning curve and more uncertainty in the visual space, resulting in the noticeable drop in their performance. Though achieving the better pixel/patch accuracy, they underperform the weakest repeating baseline in terms of the perceptual similarity. Our FGVP achieves the best results in all metrics, especially LPIPS, showing around improvement over the second best result from DVF. It should be noted that our method is mainly supervised by “realistic” losses (perceptual, style, segmentation), but still surpasses other baselines in pixel/patch-wise accuracy metrics. It shows that our method could effectively learn dataset priors and flow prediction given the same amount of data. We attribute it to our design of decoupling propagation and generation into two modules so that each module can concentrate on learning its own objective.
Figure 7 compares the results on Multi-Frame Prediction of our models with various baselines on KITTI Flow dataset. Given frames, all networks are trained to predict the next frame and then tested by recursively producing frames. Our method shows consistent performance gains on all metrics through time. DVF performs similarly to our model for short-term prediction measured by PSNR but quickly decays after steps. This is because that their method is sensitive to propagated error since there are no remedy mechanisms. Our model, however, can mask out undesirable regions and generate new pixels instead.
To better understand our design choices and their effectiveness, we conduct ablation studies on our motion propagator and generator shown in the Table 3.
On the upper half of the table, we evaluated the performance gap between our motion propagator, which predicts future flows given an input sequence and an oracle flow estimator (PWC ) that exploits target frames. It should be noted that all occlusion artifacts are masked by a warping operator so that we only evaluate the prediction results caused by moving pixels. The performance of our motion propagator closely follows the oracle, which proves our effectiveness in predicting future dynamics caused by motion.
On the bottom half, we build three groups of comparison experiment by removing segmentation loss, perceptual and style losses or replacing our fusion decoder with normal partial convolutions as in . All generators are trained using the same oracle model used in motion ablation studies. Removing perceptual and style losses does not hurt the performance of PSNR, but leads to large degradation in structural and perceptual metrics. On the other hand, removing segmentation loss and our fusion decoding blocks results in performance drops in all metrics. These observations demonstrate that our individual designs are beneficial for training our generation module.
In this work, we present a method for video prediction by disentangling motion prediction and novel scene prediction. We predict the optical flow to warp our last input frame into the propagated target frame. For occluded regions where the warping operator assigns low-confidence, our model uses a spatial-temporal context encoder to hallucinate appropriate content. We systematically evaluate our approach on both traditional CalTech Pedestrian dataset and more challenging KITTI Flow dataset of larger motions and occlusions. Our approach can yield both accurate and realistic predictions, achieving the state-of-the-art performance on both datasets.
C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” inAdvances in neural information processing systems, pp. 64–72, 2016.
-  T. Weber, S. Racanière, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al., “Imagination-augmented agents for deep reinforcement learning,” arXiv preprint arXiv:1707.06203, 2017.
-  X. Wang, W. Xiong, H. Wang, and W. Y. Wang, “Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation,” arXiv preprint arXiv:1803.07729, 2018.
-  J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video prediction using deep networks in atari games,” in Advances in neural information processing systems, pp. 2863–2871, 2015.
-  M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” arXiv preprint arXiv:1511.05440, 2015.
-  W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” arXiv preprint arXiv:1605.08104, 2016.
-  N. Wichers, R. Villegas, D. Erhan, and H. Lee, “Hierarchical long-term video prediction without supervision,” arXiv preprint arXiv:1806.04768, 2018.
-  R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, “Learning to generate long-term future via hierarchical prediction,” arXiv preprint arXiv:1704.05831, 2017.
-  S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decomposing motion and content for video generation,” arXiv preprint arXiv:1707.04993, 2017.
-  R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” arXiv preprint arXiv:1706.08033, 2017.
-  E. Denton and R. Fergus, “Stochastic video generation with a learned prior,” arXiv preprint arXiv:1802.07687, 2018.
-  W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos, “Contextvp: Fully context-aware video prediction,” arXiv preprint arXiv:1710.08518, 2017.
-  F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” 1999.
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” inAdvances in neural information processing systems, pp. 802–810, 2015.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” inEuropean Conference on Computer Vision, pp. 694–711, Springer, 2016.
-  S. L. Pintea, J. C. van Gemert, and A. W. Smeulders, “Déja vu,” in European Conference on Computer Vision, pp. 172–187, Springer, 2014.
-  Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow.,” in ICCV, pp. 4473–4481, 2017.
Z. Hao, X. Huang, and S. Belongie, “Controllable video generation with sparse
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7854–7863, 2018.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
-  A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in Advances in Neural Information Processing Systems, pp. 658–666, 2016.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
C. Li and M. Wand, “Combining markov random fields and convolutional neural networks for image synthesis,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2479–2486, 2016.
-  S. Xie, X. Huang, and Z. Tu, “Top-down learning for structured labeling with convolutional pseudoprior,” in European Conference on Computer Vision, pp. 302–317, Springer, 2016.
-  A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
-  S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016.
-  T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  N. Neverova, R. A. Güler, and I. Kokkinos, “Dense pose transfer,” arXiv preprint arXiv:1809.01995, 2018.
-  C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” in Advances In Neural Information Processing Systems, pp. 613–621, 2016.
-  A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, “Stochastic adversarial video prediction,” arXiv preprint arXiv:1804.01523, 2018.
-  M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra, “Video (language) modeling: a baseline for generative models of natural videos,” arXiv preprint arXiv:1412.6604, 2014.
M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” inIEEE International Conference on Computer Vision (ICCV), vol. 2, p. 5, 2017.
-  F. A. Reda, G. Liu, K. J. Shih, R. Kirby, J. Barker, D. Tarjan, A. Tao, and B. Catanzaro, “Sdc-net: Video prediction using spatially-displaced convolution,” arXiv preprint arXiv:1811.00684, 2018.
M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” inProceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417–424, ACM Press/Addison-Wesley Publishing Co., 2000.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman, “Patchmatch: A randomized correspondence algorithm for structural image editing,” ACM Transactions on Graphics (ToG), vol. 28, no. 3, p. 24, 2009.
-  D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544, 2016.
-  H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro, “Image inpainting for irregular holes using partial convolutions,” arXiv preprint arXiv:1804.07723, 2018.
-  D.-A. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri, L. Fei-Fei, and J. C. Niebles, “What makes a video a video: Analyzing temporal information in video understanding models and datasets,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7366–7375, 2018.
-  S. Meister, J. Hur, and S. Roth, “Unflow: Unsupervised learning of optical flow with a bidirectional census loss,” arXiv preprint arXiv:1711.07837, 2017.
-  Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2018.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in International conference on machine learning, pp. 843–852, 2015.
-  N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” arXiv preprint arXiv:1610.00527, 2016.
-  J. Walker, K. Marino, A. Gupta, and M. Hebert, “The pose knows: Video forecasting by generating pose futures,” in Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 3352–3361, IEEE, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255, Ieee, 2009.
-  R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” arXiv preprint, 2018.
-  X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion gan for future-flow embedded video prediction,” in IEEE International Conference on Computer Vision (ICCV), vol. 1, 2017.
-  P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 4, pp. 743–761, 2012.
-  M. Menze, C. Heipke, and A. Geiger, “Joint 3d estimation of vehicles and scene flow,” in ISPRS Workshop on Image Sequence Analysis (ISA), 2015.
-  D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,”
J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,”arXiv preprint, 2017.
-  S. R. Bulò, L. Porzi, and P. Kontschieder, “In-place activated batchnorm for memory-optimized training of dnns,” CoRR, abs/1712.02616, December, vol. 5, 2017.
-  H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz, “Super slomo: High quality estimation of multiple intermediate frames for video interpolation,” arXiv preprint arXiv:1712.00080, 2017.
-  F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” arXiv preprint arXiv:1707.06484, 2017.
-  G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The mapillary vistas dataset for semantic understanding of street scenes.,” in ICCV, pp. 5000–5009, 2017.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inProceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
Appendix A Implementation Details
General architectural parameters We adapt our architectures from Zhu et al. 333https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix  and Johnson et al. 444https://github.com/jcjohnson/fast-neural-style . For all experiments described in the main paper, we use blocks for the encoder and blocks for the decoder. Below, we follow the naming convention used in their Github repositories to describe our general architectural parameters.
Let cMsN-K denote a
Convolution-Batchnorm-Activation layer with strideand filters. We use Inplace-ABN  to reduce the memory consumption. Further, let us define a encoder basic block eM-K by cascading cMs1-K with another downsample convolution block cMs2-K
where ReLU is used555All ReLU units are approximated by LeakyReLUs of slope to be compatible with Inplace-ABN. The basic decoder block dM-K consists of a nearest-neighbor upsample layer followed by two cMs1-K layers in which activation layers are chosen as LeakyReLUs of slope .
Motion propagation module Our motion propagation module could be defined as:
e7-64, e5-128, e5-256, e3-512, e3-512, d3-512, d3-512, d3-256, d3-128, d3-2,
where the last output layer has no activation, i.e., the flow prediction network regresses unconstrained displacement values for each coordinates. Raised by , we also empirically confirmed large kernel sizes, in first several layers, help the training to converge.
Generation module Our generator module uses the same architectural parameters as in the motion propagation module. The only differences here are that: (1) we replace the normal convolution operators with partial convolution operators in eM-K’s and fusion convolution operators in dM-K’s; (2) we replace d3-2 with d3-3, where Tanh activation is used to bound the output value between and .
Appendix B Training Details
Here we specify more training details to supplement what we have described in the main paper. To train the motion propagation module, we start from the learning rate at and decay it by
at the half of the training epochs, then repeat it again at theof the training epochs. The generator is trained from and scheduled with the same decay strategy. We train our motion propagation, and generation module for epochs, and epochs on CalTech Pedestrian dataset. For KITTI Flow dataset, they are trained for and epochs, respectively. Apart from the motion masks generated by the propagation module, we augment the generator training process by extra masks obtained by random walks. Our segmentation extractor for the segmentation loss is a DLA34 model  trained on Mapillary Vistas dataset  with the Cityscape labels .
Appendix C More Qualitative Results
More qualitative results are shown in Figure 8 and 9. To better assess the visual quality and temporal coherence of our proposed method, please check out our anonymous video website at https://sites.google.com/view/fgvp.
|(a)||(b)||(c) PredNet ||(d) ContextVP ||(e) FGVP (Ours)||(f)|