Disentangling Propagation and Generation for Video Prediction

12/02/2018 ∙ by Hang Gao, et al. ∙ 0

Learning to predict future video frames is a challenging task. Recent approaches for natural scenes directly predict pixels via inferring appearance flow and using flow-guided warping. Such models excel when motion estimates are accurate, but the motion may be ambiguous or erroneous in many real scenes. When scene motion exposes new regions of the scene, motion-based prediction yields poor results. However, learning to predict novel pixels directly can also require a prohibitive amount of training. In this work, we present a confidence-aware spatial-temporal context encoder for video prediction called Flow-Grounded Video Prediction (FGVP), in which motion propagation and novel pixel generation are first disentangled and then fused according to computed flow uncertainty map. For regions where motion-based prediction shows low-confidence, our model uses a conditional context encoder to hallucinate appropriate content. We test our methods on the standard CalTech Pedestrian dataset and the more challenging KITTI Flow dataset of larger motions and occlusions. Our methods produce both sharp and natural predictions compared to previous works, achieving the state-of-the-art performance on both datasets.



There are no comments yet.


page 1

page 6

page 7

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Video prediction by decoupling pixel motion from motion-agnostic novel scene. Consider a sequence of input frames in which a car passes over. Here we show a next frame prediction case, where we aim to warp the last input frame (a), given a backward flow prediction (b), to the target frame (c); however, direct motion propagation introduces ghosting effects in occluded region (d); our model identifies the occlusion based on flow information (e), and in-paints low-confidence region where novel pixels are independent to the motion (f).
Figure 2: Overview of previous works and our proposed method. In a -in -out setup, video prediction frameworks typically feature a motion propagation module and/or a generation module ; (a): pixel-based methods directly generate every pixel from scratch where motions are implicitly learned; (b): motion-based methods are encoded with a warping operator so that they can focus on pixel displacement; (c): linear fusion methods locate occluded regions by a jointly learned linear mask and naively fuse pixel- and motion-based prediction together; (d): we propose to disentangle and so that each module only need to focus on where it should look at; instead of being jointly inferred by a network, occlusion is computed using the predicted flow. We use a hard mask when aggregating propagation and generation to avoid blurriness.

Modeling visual dynamics of the real world is crucial for intelligent agents in a wide range of domains, such as computer vision, robotics, and graphics. For example, in model-based reinforcement learning, agents can “foresee” future frames and plan accordingly to maximize their expected rewards 

[1, 2, 3, 4]. As shown in Figure 1, consider a sequence of frames in which a car passes over. We human can easily imagine how the next frame will be. This is because we can first identify the foreground and predict how it will move in the short-term future. Then, we just crop it out, distort it a bit, and paste it onto where we think it should be. Finally, we refill the rest of the cropping hole using our acquired prior through daily experiences. This mental process is both effective and efficient. It is desirable to have our computational models predict the future in a similar way.

As illustrated in Figure 2

, we summarize previous approaches by classifying how they generate new pixels. In pure pixel-based models

[4, 5, 6, 7, 8, 9, 10, 11, 12], every pixel is generated from scratch given a history buffer. Within such process, motions are implicitly modeled and propagated by convolutional or recurrent neural blocks [13, 14]. Supervised by the pixel-wise loss in visual space, predictions are usually blurry so that they can avoid large penalty. Adversarial [15] and feature-alignment [16] priors alleviate such problem but are known to be hard for training in practice. Besides, since they need to potentially infer latent physical principles of how pixels get propagated, pure pixel-based models requires a prohibitive amount of computation and data. On the other hand, pure motion-based models [17, 18] are encoded with propagation priors from low-level computer vision, in which pixel movements are represented by coordinate shifts, or more formally, appearance flow from sources to predicted targets. The main merit of such method is that, by injecting a warping operator, models know how to computationally propagate pixels and only need to focus on predicting future dynamics in the flow representation. Also, since all pixels are copy-pasted from previous frames, temporal consistency is automatically ensured. Yet as shown in Figure 1d, motion-based methods fail in occluded regions where the flow is ill-defined. Motivated by previous limitations, Finn et al[1] and Hao et al[19] explored different ways to compose the pixel- and motion-based prediction through a jointly learned linear mask. These methods work well in practice, but results are still “smoothed” due to linear fusion.

In this work, we present a new approach towards both accurate and realistic video prediction using a confidence-aware spatial-temporal context encoder. Our insight is that motion propagation and generation could and should be disentangled so that each component can maximize its utility. Different from traditional learned linear masks, we design a new warping operator for computed hard masks, so that disentangled modules can focus on non-occluded regions for flow prediction and occluded regions for novel scene generation. For better image quality and temporal consistency, we additionally introduce a fusion decoder and a segmentation loss when training our generator. By going beyond linear fusion techniques, our disentangled fusion model is capable of generating both accurate and realistic predictions.

We evaluate our approach on both the standard CalTech Pedestrian dataset and the more challenging KITTI Flow dataset of larger motions and occlusions. Our approach achieves the state-of-the-art performance on both datasets. Further ablation studies also demonstrate the effectiveness of each component proposed in our method.

2 Related Works

Photo-realistic Image Synthesis Pixel accuracy, or more preferably realism, is the constant pursuit of high-quality image synthesis [16, 20, 21, 22, 23, 24]. Recent developments towards photo-realistic image synthesis constantly feature Generative Adversarial Networks (GANs) [15]. Conditioned on categorical labels [25], textual descriptions [26] or segmentations [27], high-fidelity image generation quality can be achieved. The closest work to ours is Dense Pose Transfer [28], which hallucinates new human pose images by warping the original image with a predefined dense pose and in-painting the ambiguous parts. In our case, however, the model needs to predict future motion and further synthesize based on both spatial and temporal information.

Video Prediction  There is a vast body of research in video prediction [1, 5, 4, 29, 30]. Fueled by high-capacity models for image synthesis, recent approaches [11, 12, 6, 31, 32] are mostly pixel-based, in which every pixel is generated from scratch. They show that an encoder-decoder network can produce reasonable images predictions while suffering from blurry effects, especially for unseen novel scenes. On the contrary, motion-based methods [17, 18] excel in predicting sharp results, yet fail in areas where motion predictions are erroneous or ill-defined. SDC-Net [33] proposes an interesting architectural design in which motion is modeled by both convolutional kernels as in [1]

and vectors as optical flow. Our closest previous work is

[19] which composes the pixel- and motion-based prediction through a jointly learned linear mask. However, our proposed approach is differentiated from it in two aspects: (1) motion propagation and generation are separately trained to focus on non-occluded and occluded regions only; (2) our aggregation is done by a non-linear hard mask computed by inferred flow information.

Disentangling Motion and Content  Since videos are essentially composed of pixel motions and contents, such as foreground objects that are invariant to motions themselves, it is natural to take them apart, i.e. preserving content semantics while propagating them through the temporal axis. [9, 10, 11] are the three representative works in such direction to attack video prediction. Though similar in the disentangling nature, our approach does not discriminate motion from the content information — instead, we closely relate them together using a warping operator. What we disentangle are the propagation results and the novel scenes that are agnostic to motion values.

Spatial Context Encoding  Pixels are not isolated. On the contrary, there are many cases where each pixel is referred to as the context for its nearby pixels  [34, 35, 36, 37]. Given an image where certain pixels are masked out, spatial context encoders are required to refill the missing pixels by taking account of their neighbors. In our case, the “mask” is the occluded area where motion predictions are erroneous. Particularly, we employ partial convolutions [38] in our generator’s encoding blocks. Different from previous image in-painting works, our approach implicitly combines the temporal context with the spatial context. Since unconditional context encoders tend to remove foreground in their predictions, we further introduce a fusion decoder and a segmentation loss to improve the visual quality.

3 Approach

3.1 Overview

Video prediction aims to synthesize future frames given a stack of history frames. For the ease of exposition, we here focus on a -in -out prediction task: given an input video sequence denoted as , the model aim to predict the frame which is expected to be both accurate and visually sharp. In this sense, we intentionally build our model in a way beyond reconstruction loss, which is widely known to encourage blurry results. Concretely, as illustrated in Figure 2d, our model is disentangled into two orthogonal yet complementary modules and so that each module can maximize its utility.

Given the history frames , our propagation module learns to predict the flow field for the pixel correspondence between the last input frame and the target . Since it neglects the existence of occlusions, we define an occlusion-aware warper , and propagate the input frame into a motion-dependent prediction . At the same time, the warper computes a confidence map that encodes where the artifacts exist. Finally, based on motion-dependent prediction and computed occlusion mask, our generation module learns to in-paint low confidence patches from the dataset prior. By explicitly attributing future dynamics to motion and motion-agnostic novel scene, our model is able to predict high-fidelity future frames.

3.2 Motion Propagation

Our module computes the correlation of appearances between each pair of frames from the history buffer and predicts future motion dynamics using optical flow. We choose it over other motion representations, such as frame differences [39] or sparse trajectories [19], because it provides richer information about motion occlusions over pixels themselves.

As illustrated in Figure 3a, our flow prediction module is an encoder-decoder network with skip connections. The output of is a 2-dimensional flow field that aims to propagate the last frame into the propagated target frame . Formally, let be a Cartesian grid over the target frame, and we have


By assuming local linearity, we can next define a standard backward warping operator , and sample the future frame from the last given frame as



is a bilinear sampler that generates the new image by first mapping the regular grid to the transformed grid and then bilinear interpolating between produced sub-pixels.

(a) Propagation module
(b) Generation module
Figure 3: Conceptual illustrations of our and . (a): Our propagation module is a standard encoder-decoder fully convolutional network with skip connections, which takes in history frames and predict the backward flow from to ; (b): taking in the computed occlusion mask and the warped image , our generation module in-paints for the final prediction ; it replaces standard convolutional blocks with partial convolutions (P-Conv) and fusion blocks (F-Conv) in its encoder and decoder, respectively.

However, this method erroneously introduces “ghosting” effect in occluded regions (see Figure 1d and Figure 4). Our insight is that the low-confidence predictions on occluded regions should be excluded from motion propagation results to avoid unnecessary errors. It should be noted that similar ideas have been previously explored in [1, 19]; but in their contexts, occlusion masks are linearly learned to compose pixels from different sources. We argue that these regions can be explicitly computed based on backward flows, and hence can be directly masked out. We will later demonstrate the effectiveness of our computed mask over the learned mask in Section 4.

Specifically, we augment our predefined warping operator to retrieve the occluded regions by examining how the Cartesian grid from the last input frame changes during motion propagation. For each propagated frame , We maintain a look-up table to record how many sub-pixels will move to each coordinates in after the propagation. That says, we have

(a) , w/ flow
(b) , target
(c) , warped by flow
(d) , w/ mask
Figure 4: Ghosting effect caused by warping. Consider a foreground object on the background. (a): the frame with a backward flow prediction shown as an arrow; (b): the target frame ; (c): the propagated frame shows ghosting effects on the occluded region (shown as shaded); (d): we exclude occluded region by a computed mask .

where is the hard binary mask where zeros stands for occlusion and ones for valid pixels; is an element-wise indicator function. Empirically, we here set a hard threshold at to identify occluded regions where at least two sub-pixels collide.

Similar to previous works on unsupervised flow estimation [40, 41], we adopt a masked pixel loss and a smooth loss that penalizes abnormal shifts in the flow’s magnitude,


where denotes structural similarity index and is our trade-off weight between the loss terms, which is fixed at through cross-validations.

3.3 Spatial-temporal Context Encoding

Given the propagated frame and the computed occlusion map , we can now formulate our second modeling stage as spatial-temporal context encoding, in which missing pixels are not directly determined by motions but subject to their spatial contexts propagated through time.

Our generator module adopts generally the same network architecture as its propagation counterpart while substituting the standard convolution blocks with other building blocks for context encoding. Illustrated in Figure 3b, our encoder takes the previous propagated frame and its occlusion map as inputs, producing a latent feature representation. The decoder then takes this feature representation and synthesizes the missing content.

Specifically, we design all encoder blocks as partial convolution operators [38] to mask out invalid pixels and re-normalize features within clean receptive fields only. That is, to compute the feature and the binary occlusion mask at the th layer, we have


where is a normal convolution operator, is the element-wise multiplication. For each location in the mask, it will be considered as valid if there exists a valid pixel in its receptive field. We design our encoder in a way that the receptive field of the bottleneck is bigger than the maximal area of the occlusion masks so that, by pushing an image through the encoder, all input pixels will be valid and we will have a dense, clean feature map in our bottleneck.

Next, for the decoder, low-resolution but clean feature maps are upsampled and linked with previous high-resolution but masked features by skip connections. However, this raises a fusion issue when aggregating the features. Consider we have an occlusion mask on the encoder’s feature that we need to refill by the decoder, previously in [38], feature maps and masks are concatenated channel-wise and handled by new partial convolutions in their decoder. We find this could be improved by directly refilling the occluded encoder features by the upsampled clean decoder features. Concretely, our decoder computes the feature of the reversed-th layer, with respect to its encoder counterpart, by


where denotes the decoder feature map in pair with the encoder, is a bi-linear upsampler, and is a channel-wise concatenation operator. This decoding fusion is repeated at each layer from the bottleneck up to our final output .

To train our generator , we design all of our losses to be temporal-independent so that it can focus on the visual quality. In general, our loss terms consists of a pixel reconstruction loss


perceptual and style losses in VGG’s latent spaces as in [42]


a total-variance loss to encourage similar texture in occlusion boundaries


and an extra semantic loss to enforce layout consistency, distilled by a pretrained segmentation network , since unconditional image in-painting tends to remove the foreground objects,


where is our attentive weight for masked regions; is the masked pixel loss borrowed inherited from previous motion propagation training; denotes the element-wise difference matrix of the prediction and target on the -th feature space of VGG ; denotes the Jacobian matrix of the composed image ; is the cross-entropy function; and is the pseudo ground-truth segmentation class. We empirically find the segmentation consistency loss crucial for our generator .

3.4 Training

The overall training objective could be formulated as


where all ’s are the hyper-parameters that control the training schedule. We set through a coarse grid search.

Since flow estimation/prediction is known to be hard to learn and sensitive to data biases, we first train our motion propagation module and generation module separately. After gradients become stable, we connect the two components together and fine-tune the whole network in an end-to-end fashion.

4 Experiments

(a)  (b)  (c) PredNet (d) ContextVP (e) FGVP (Ours) (f) 
Figure 5: Qualitative comparisons for Next-Frame Prediction on CalTech Pedestrian dataset. Given past frames, all models are required to predict the future frame in the next step. Our FGVP produces much sharper results compared to previous state-of-the-art methods on this dataset. Remarkably, it is also capable of inferring semantically sensible appearance for occluded regions. Results are best viewed in color with zoom.

We conduct experiments on the CalTech Pedestrian dataset in Section 4.1 and KITTI Flow dataset in Section 4.2. We also conduct ablation study to clarify the effectiveness of the proposed modules in Section 4.3. We will show that our FGVP can produce much sharper predictions than competitive techniques which leads to significant improvement in the general visual quality.

Metrics   We adopt the traditional Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) [22] metrics to measure the “pixel/patch-wise accuracy”. However, it is well-known that these metrics disagree with the human perception [7, 43, 44, 45] because they tend to encourage blurriness over naturalness. Therefore we also measure the realism of the given predictions by Learned Perceptual Image Patch Similarity (LPIPS111The metric computes distance in AlexNet [46] feature space (conv1-5

, pre-trained on Imagenet


). Comparing to other deep feature metrics like Inception Score, LPIPS matches better with human perceptual judgments.

) proposed by [48]. Quantitatively, higher PSNR/SSIM scores and smaller LPIPS distances suggest better performance.

Baselines   We consider strong baselines from three main families of video prediction methods previous introduced: (1) Pixel-based methods, including Beyond MSE [5], PredNet [6], SVP-LP [11] and ContextVP [12]; (2) Motion-based method, including DVF [18]; (3) Fusion of the pixel- and motion-based methods, including Dual Motion GAN [49] and CtrlGen [19].

4.1 CalTech Pedestrian Dataset

BeyondMSE [5] P
PredNet [6] P
ContextVP [12] P
DVF [18] M
Dual Motion GAN [49] F
CtrlGen [19] F
FGVP (Ours) F
Table 1: Next-Frame Prediction results on CalTech Pedestrian. All models are trained on KITTI Raw dataset. The best results under each metric are marked in bold.

We begin our experiments on the standard CalTech Pedestrian dataset [50] which is taken from moving vehicles, and consists of both ego and object motions in real-life scenarios including rigid and the non-rigid scene changes.

Setup   The conventional experimental setup on this dataset is to first train a model on the training split of KITTI Raw [51] proposed by Lotter et al[6] and then directly test it on the testing set of CalTech Pedestrian. Frames are center-cropped and down-sampled to pixels. Every consecutive frames are divided and sampled as a training clip in which the first frames are fed into the model as the input, and the th frame is used as prediction target. As the results, the training, validation and testing set consists of , , and clips.

(a) DVF (b) CtrlGen (c) FGVP (Ours) (d) 
Figure 6: Qualitative comparisons for Next-Frame Prediction on the more challenging KITTI Flow dataset. All models are given 4 frames as input and required to predict the next frame. Frames indicate large motions, scene changes or occlusions. Our method is robust to these cases and consistent in the performance. Results are best viewed in color with zoom.
PredNet [6] P
SVP-LP [11] P
DVF [18] M
CtrlGen [19] F
FGVP (Ours) F
Table 2: Next-Frame Prediction results on KITTI Flow. All models are trained to predict next frame given a history buffer of two frames. All evaluation results of the previous methods are obtained by their published codebases. The best results under each metric are marked in bold.

Analysis   We compare our model against previous state-of-the-art methods on this dataset. The results of Next-Frame Prediction is shown in Table 1. Our model achieves comparable PSNR and SSIM scores with ContextVP [12]. Meanwhile, our method can predict non-stretching textures in those occluded regions, which leads to smaller perceptual dissimilarity measured by LPIPS. As shown in Figure 5, our model is robust for both pixel propagation and novel scene inference.

Apart from empirical improvements, we find that, in terms of LPIPS metric, all the evaluated state-of-the-art methods do no better than the most naive baseline — repeating the last input frame as the prediction. This suggests that the CalTech Pedestrian dataset consists of small motions that are not obvious for human perception. This motivates us to work on a more challenging dataset so that learners can benefit from more inductive biases and thus be more robust.

4.2 KITTI Flow Dataset

We next move to a more challenging dataset — KITTI Flow [51]. It is designed originally as a benchmark for optical flow estimation and featured with higher resolution222Samples are downsampled and center-cropped to pixels to avoid optical distortions around lens edges, larger motions, and more occlusions compared to the raw dataset.

Setup   The dataset contains examples for training, for validation and for testing. We apply data augmentation techniques such as random cropping and random horizontal flipping for all the models. In addition, we sample video clips of frames (-in -out) from the dataset using a sliding window. This amounts to clips for training and clips for testing.

(a) PSNR
(b) SSIM
Figure 7: Quantitative results for Multi-Frame Prediction on KITTI Flow dataset. All the models take in frames as input and recursively predict next frames. Best viewed in color with zoom.

We choose the strong baseline methods which published their codebases online, we also include a weak baseline which trivially repeats the last frame input as its prediction.It should be noted that PredNet [6] and SVP-LP [11] are originally designed to infer on past frames, but here we configure them to take in only frames.

Analysis   As demonstrated in Figure 6, our proposed model again produce more visually appealing predictions than our baselines. In contrast to the pixel-based methods, all demonstrated methods suffer less from blurriness but display the distortion and stretch in shapes due to quick scene changes, which cause inaccurate flow prediction. Our model, instead, can predict better flow so as to alleviate undesirable artifacts in large motion areas. Occluded areas are masked by motion propagation and refilled by generation so that they are free of ghosting effects. Our generator learns a scene prior to hallucinate what is missing given the contextual information. Our quantitative improvements are shown in Table 2. As resolution increases, previous pixel-based methods (PredNet, SVP-LP) suffer from a steeper learning curve and more uncertainty in the visual space, resulting in the noticeable drop in their performance. Though achieving the better pixel/patch accuracy, they underperform the weakest repeating baseline in terms of the perceptual similarity. Our FGVP achieves the best results in all metrics, especially LPIPS, showing around improvement over the second best result from DVF. It should be noted that our method is mainly supervised by “realistic” losses (perceptual, style, segmentation), but still surpasses other baselines in pixel/patch-wise accuracy metrics. It shows that our method could effectively learn dataset priors and flow prediction given the same amount of data. We attribute it to our design of decoupling propagation and generation into two modules so that each module can concentrate on learning its own objective.

Figure 7 compares the results on Multi-Frame Prediction of our models with various baselines on KITTI Flow dataset. Given frames, all networks are trained to predict the next frame and then tested by recursively producing frames. Our method shows consistent performance gains on all metrics through time. DVF performs similarly to our model for short-term prediction measured by PSNR but quickly decays after steps. This is because that their method is sensitive to propagated error since there are no remedy mechanisms. Our model, however, can mask out undesirable regions and generate new pixels instead.

4.3 Ablation

To better understand our design choices and their effectiveness, we conduct ablation studies on our motion propagator and generator shown in the Table 3.

On the upper half of the table, we evaluated the performance gap between our motion propagator, which predicts future flows given an input sequence and an oracle flow estimator (PWC [52]) that exploits target frames. It should be noted that all occlusion artifacts are masked by a warping operator so that we only evaluate the prediction results caused by moving pixels. The performance of our motion propagator closely follows the oracle, which proves our effectiveness in predicting future dynamics caused by motion.

On the bottom half, we build three groups of comparison experiment by removing segmentation loss, perceptual and style losses or replacing our fusion decoder with normal partial convolutions as in [38]. All generators are trained using the same oracle model used in motion ablation studies. Removing perceptual and style losses does not hurt the performance of PSNR, but leads to large degradation in structural and perceptual metrics. On the other hand, removing segmentation loss and our fusion decoding blocks results in performance drops in all metrics. These observations demonstrate that our individual designs are beneficial for training our generation module.

w/ mask
w/o p+s
w/o seg
w/o fus
Table 3: Ablation study on each component in our model. See texts for more information.

5 Conclusion

In this work, we present a method for video prediction by disentangling motion prediction and novel scene prediction. We predict the optical flow to warp our last input frame into the propagated target frame. For occluded regions where the warping operator assigns low-confidence, our model uses a spatial-temporal context encoder to hallucinate appropriate content. We systematically evaluate our approach on both traditional CalTech Pedestrian dataset and more challenging KITTI Flow dataset of larger motions and occlusions. Our approach can yield both accurate and realistic predictions, achieving the state-of-the-art performance on both datasets.


Appendix A Implementation Details

General architectural parameters   We adapt our architectures from Zhu et al. 333https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix [53] and Johnson et al. 444https://github.com/jcjohnson/fast-neural-style [16]. For all experiments described in the main paper, we use blocks for the encoder and blocks for the decoder. Below, we follow the naming convention used in their Github repositories to describe our general architectural parameters.

Let cMsN-K denote a

Convolution-Batchnorm-Activation layer with stride

and filters. We use Inplace-ABN [54] to reduce the memory consumption. Further, let us define a encoder basic block eM-K by cascading cMs1-K with another downsample convolution block cMs2-K

where ReLU is used

555All ReLU units are approximated by LeakyReLUs of slope to be compatible with Inplace-ABN. The basic decoder block dM-K consists of a nearest-neighbor upsample layer followed by two cMs1-K layers in which activation layers are chosen as LeakyReLUs of slope .

Motion propagation module   Our motion propagation module could be defined as:

e7-64, e5-128, e5-256, e3-512, e3-512, d3-512, d3-512, d3-256, d3-128, d3-2,
where the last output layer has no activation, i.e., the flow prediction network regresses unconstrained displacement values for each coordinates. Raised by [55], we also empirically confirmed large kernel sizes, in first several layers, help the training to converge.

Generation module   Our generator module uses the same architectural parameters as in the motion propagation module. The only differences here are that: (1) we replace the normal convolution operators with partial convolution operators in eM-K’s and fusion convolution operators in dM-K’s; (2) we replace d3-2 with d3-3, where Tanh activation is used to bound the output value between and .

Appendix B Training Details

Here we specify more training details to supplement what we have described in the main paper. To train the motion propagation module, we start from the learning rate at and decay it by

at the half of the training epochs, then repeat it again at the

of the training epochs. The generator is trained from and scheduled with the same decay strategy. We train our motion propagation, and generation module for epochs, and epochs on CalTech Pedestrian dataset. For KITTI Flow dataset, they are trained for and epochs, respectively. Apart from the motion masks generated by the propagation module, we augment the generator training process by extra masks obtained by random walks. Our segmentation extractor for the segmentation loss is a DLA34 model [56] trained on Mapillary Vistas dataset [57] with the Cityscape labels [58].

Appendix C More Qualitative Results

More qualitative results are shown in Figure 8 and 9. To better assess the visual quality and temporal coherence of our proposed method, please check out our anonymous video website at https://sites.google.com/view/fgvp.

(a)  (b)  (c) PredNet [6] (d) ContextVP [12] (e) FGVP (Ours) (f) 
Figure 8: More qualitative comparisons for -in -out Next-Frame Prediction on CalTech Pedestrian dataset.
(a) DVF [18] (b) CtrlGen [19] (c) FGVP (Ours) (d) 
Figure 9: More qualitative comparisons for -in -out Next-Frame Prediction on KITTI Flow dataset.