Video prediction, the task of generating high-fidelity future frames by observing a sequence of past frames, is paramount in numerous computer vision applications such as video object segmentation[xu2019spatiotemporal], robotics [finn2016unsupervised]liu2018future], and autonomous driving applications [byeon2018contextvp, gui2018few, cho2018multi].
Early works addressed the video prediction to directly predict raw pixel intensities from the observations through various deep learning architectures including 2D/3D convolutional neural networks (CNNs)[mathieu2015deep, vondrick2016generating]
, recurrent neural networks (RNNs)[ranzato2014video, wang2017predrnn, srivastava2015unsupervised, xingjian2015convolutional]
, and generative adversarial networks (GANs)[mathieu2015deep, vondrick2016generating, kwon2019predicting, liang2017dual]. These methods have improved the perceptual quality by encoding the spatial contents, but they often overlook the complex motion variations over time. Another approach is to explicitly model the motion dynamics by predicting a dense motion field (e.goptical flow) [patraucean2015spatio, liu2017video, liang2017dual, reda2018sdc, liu2018future, li2018flow]
. The dense motion field ensures the temporal consistencies between frames, but occlusion and large motion may limit precise motion estimation[gao2019disentangling].
Humans perceive an entire image, recognize the regional information such as moving objects [rensink2000dynamic], and can predict the next scene. Likewise, recent approaches in video prediction have attempted to decompose high-dimensional videos into various factors, such as pose/content [hsieh2018learning], motion/content [villegas2017decomposing, tulyakov2018mocogan], motion context [Lee2020long], multi-frequency [jin2020exploring], and object parts [xu2019unsupervised]. The decomposed factors are easier to predict since they address the prediction on lower-dimensional representations.
In this paper, we propose a video prediction framework that captures the global context and local motion patterns in the scene with two streams: local filter memory networks (LFMN) and global context propagation networks (GCPN). We present memory-based local filters that address the motion dynamics of objects over the frames. By leveraging the memory module [weston2014memory], LFMN can generate filter kernels that contain the prototypical motion of moving objects on the scene. Namely, the filter kernels with memory facilitate learning the pixel motions (Fig. 1 and 1) and predicting long-term future frames. Second, to capture the global context that lies over the consecutive frames, GCPN iteratively propagates the context from reference pixels to the entire image by computing the pairwise similarity in a non-local manner [wang2018non]. Through the propagation steps, GCPN aggregates the regions which have similar appearances to predict future frames consistent to the global context (Fig. 1). Finally, we integrate the global and local information by filtering the global feature from GCPN with memory-based kernels containing motion patterns from LFMN (Fig. 4).
The proposed method can preserve the global context and refine the movements of dynamic objects simultaneously, resulting in more accurate video prediction. In experiment, we validate our method on the Caltech pedestrian [dollar2009pedestrian] and UCF101 [soomro2012ucf101] datasets. Our method shows state-of-the-art performance and improves the prediction in the large-motion sequences and multi-step prediction.
2 Related Work
Early studies in video prediction adopt recurrent neural networks (RNNs) to consider the temporal information between frames[ranzato2014video]
. To achieve better performance in long-term prediction, Long-Short-Term-Memory (LSTM) networks[srivastava2015unsupervised] and convLSTM [xingjian2015convolutional] is proposed. Recent study [wang2019memory]
exploits two cascaded adjacent recurrent states for spatio-temporal dynamics. Motivated by the recent success of 3D convolutions, a lightweight model is proposed to use a two-way autoencoder incorporating 3D convolutions[yu2019efficient], and a model is designed to integrate 3D convolutions into RNNs [wang2018eidetic].Several recent methods have adopted GAN since it generates sharper results in image generation [kwon2019predicting, liang2017dual, mathieu2015deep, vondrick2016generating, tulyakov2018mocogan]. Since they focus on generating future frames at a global level through direct pixel synthesis, they show limited ability in capturing moving objects and large motions.
Another line of research computes the pixel-wise motion information from consecutive frames, and then, explicitly learning it or exploiting it as input [patraucean2015spatio, liu2017video, liang2017dual, reda2018sdc, liu2018future, li2018flow, gao2019disentangling, xu2019unsupervised]. The estimated flow information is partially utilized to predict the future movement of an object [xu2019unsupervised]. Moreover, motion information can be extended to spatio-temporal domain [liu2017video] or multi-step flow prediction [li2018flow] for video prediction. However, these methods estimate the next frames with the local receptive fields while not fully considering the global context. Moreover, some studies [pottorff2019video, xu2019unsupervised, tulyakov2018mocogan, Lee2020long, park2021vid]srivastava2015unsupervised], KTH action [schuldt2004recognizing], and Human 3.6M [ionescu2013human3]. Unlike previous works, we conduct experiments on more challenging datasets including Caltech and UCF101 and exploit two complementary attributes by incorporating global contextual information and local motion dynamics.
Future prediction errors in incomplete models can be classified as follows[wang2018eidetic, oprea2020review]: (i) how to estimate the systematic errors owing to a lack of modeling ability for deterministic variations, (ii) how to model the probabilistic and inherent uncertainty about the future [xu2018video, franceschi2020stochastic, wang2020probabilistic, guen2020disentangling]. Our paper belongs to video prediction, which focuses on the first factor.
To address the long-term dependencies in video prediction, various methods have employed RNNs or LSTM as their backbone models [atkeson1995memory, graves2013generating, hochreiter1997long]. However, their capacity is not large enough to accurately recover the features from the past information. To overcome the problems, the memory networks [weston2014memory] introduce an external memory component that can be read and written for prediction. The external memory can be used to address various vision problems, including video object segmentation [oh2019video, seong2020kernelized], image generation [zhu2019dm], object tracking [yang2018learning], and anomaly detection [gong2019memorizing, park2020learning]. In our work, we extend the idea of the memory networks to make it suitable for our task, video prediction. Specifically, the memory in local networks is dynamically recorded and updated with the new prototypical motion of moving objects, which helps in predicting motion dynamics in videos.
3 Proposed Method
3.1 Problem Statement and Overview
Let be the frame in the video sequence of the last -frames. The goal of video prediction is to generate the next frames from the input sequence . The major challenge in video prediction is to handle the complex evolution of pixels by composing two key attributes of the scene, i.e. context of the content and motion dynamics.
To this end, as shown in Fig. 2, we devise a new framework that takes advantages of two complementary components estimated from two sub-networks: local filter memory networks (LFMN) and global context propagation networks (GCPN). Following the previous works [reda2018sdc, kwon2019predicting, liu2018future], our networks take the concatenated frames as inputs and feed them into the encoder to obtain high-dimensional embedded representation. LFMN takes the encoded representation to generate adaptive filter kernels that contain the prototype of moving objects. GCPN transforms the encoded representation to capture the contextual information by iteratively propagating all the points based on a non-local manner [wang2018non]. To incorporate global context and motion dynamics effectively, the intermediate features from GCPN are convolved with the filters generated from LFMN. Finally, the filtered features are fed into the decoder to predict the next frame.
3.2 Local Filter Memory Networks
To consider motion dynamics in video sequences, we introduce LFMN that captures and memorizes the prototypical motion of moving objects from the encoded representation . For instance, as shown in 3, the objects in the sequence can move different directions. We thus aim at learning to update and address kernel parameters encoding prototypical motion patterns that transform global representation to be robust in dynamic changes.
We design the memory as a matrix containing
real-valued memory vectors of fixed channel dimension. The prototypical items of the encoded future-relevant local motion features are updated to be written into the memory module. Given the encoded representation , the memory networks retrieve a memory item most similar to . In the memory addressing operator [weston2014memory], it takes the encoding as query to obtain the addressing weights. We compute the similarity of the memory items and the encoded representation . We compute each weight for memory vector by applying softmax operations as follows:
where denotes a similarity measurement. We define similar to [santoro2016one] as follows:
Given an encoded representation , the memory networks obtain an aggregated memory feature relying a soft addressing weight as follows
The is a combination of the memory items most similar to . This allows our model to capture various local motion patterns using memory items. Finally, we generate the future-relevant local motion filter kernels using a filter generating network such that
where is the kernel size and denotes the generated local motion filters. This generates filter kernels for each pixel on conditioned video frames. While the DFN [jia2016dynamic] are limited on having to predict the next frame within the short-length input frame, our LFMN enables us to capture the long-term dynamics thanks to the memory. Fig. 3 shows a visualization of an attention map that is activated by memory items. The second row shows prototype patterns such as orange arrows which are activated in the movement of vehicles moving to the left. The third row shows that another type of prototype pattern is activated in the movement of objects moving to the right. We can observe that that each memory item addresses the particular movement pattern of objects.
3.3 Global Context Propagation Networks
The objective of GCPN is to capture a spatially wide range by propagating neighbor observations. To capture the content from different query locations far away within the image, GCPN is built upon non-local networks [wang2018non] which calculates the pairwise relationship regardless of the pixel position. However, non-local networks have a weakness to contain contextual relationships effectively [cao2019gcnet], which may not always be effective to capture similar content information. Instead of directly using non-local networks, we construct a propagation step that aggregates more relevant elements along with the most discriminative parts.
Algorithm 1 summarizes GCPN. Given the encoded feature
, we compute the affinity matrix representing the relationship between all points. Unlike[wang2018non], we update the affinity matrix iteratively to propagate the future relevant context information. We conduct matrix multiplication between the updated affinity matrix () and a non-local block (
) that denote linear transformation matrices (e.g., 11 convolution). At each step , the feature is computed by aggregating the non-local neighboring representations [wang2018non]. The global context is propagated through GCPN that computes affinity matrix via self-attention operation, , such that
where and .
At the final propagation step , globally enhanced feature is calculated by the sum of initial feature and propagated feature:
with a weight matrix . This procedure allows GCPN to consider all positions for each location at each step and produce non-locally enhanced feature representations by propagating the future-relevant contextual information.
To incorporate global information and local information, we merge the output features and generated filter kernels from GCPN and LFMN by the convolutional filtering operation, respectively. Given the enhanced feature from GCPN, this is convolved with the prediction of the generated filter kernels from LFMN as follows:
where is the convolution operation. In this process, we generate an adaptive filter that contains motion dynamic information for every pixel. The resulting features are fed into the decoder and then we estimate next future frame .
3.4 Loss Function
To train the proposed model, we minimize an overall objective function , that includes a reconstruction loss and a gradient loss :
where is a weighting factor. The reconstruction loss, , measures the difference between an estimated future frame and its corresponding ground-truth frame :
where is norm which does not over-penalize the error and thus enables sharper predicted images [reda2018sdc, niklaus2017video]. We also use the gradient loss , similar to [mathieu2015deep, liu2018future], that computes the differences of image gradient predictions and enforces to preserve image edges effectively:
and are the pixel elements from the estimated future frame and its corresponding ground-truth , respectively. denotes the coordinates of the pixel for the width and height. indicates the absolute value function.
3.5 Implementation Details
All models were trained end-to-end using PyTorch[paszke2017automatic], taking about 2 days, with an Nvidia RTX TITAN. The network is trained using Adam [kingma2014adam]
with an initial learning rate of 0.0002 and batch size of 16 for 60 epochs. During the training, the learning rate is reduced using a cosine annealing method[loshchilov2016sgdr] and the memory is randomly initialized. At the testing phase, we read the memory items based on the input query. All training video frames were normalised to the range of [-1, 1]. We set the height, , width, , channels, , and memory items, , to 64, 64, 64, and 20, respectively. For the time index , we copy the first frame times to get the frames for predicting the future frames. We achieved the best results by setting the generated kernel size to 5. We used a grid search to set the parameter ( = 0.01) on the validation set of the KITTI dataset. Details of the network architecture are provided in the supplementary material.
4.1 Experimental Settings
In the experiment, we consider representative baselines of the most relevant methods to our method such as PredNet [lotter2016deep], BeyondMSE [mathieu2015deep], Dual-GAN [liang2017dual] MCNet [villegas2017decomposing], SDC-Net [reda2018sdc], Liu et al [liu2018future], CtrlGen [hao2018controllable], DPG [gao2019disentangling], Kwon et al [kwon2019predicting], Jin et al [jin2020exploring], and CrevNet [yu2019efficient]. We used the pre-trained models provided by authors for visual comparison. We additionally obtained the results from ContextVP [byeon2018contextvp].
For quantitative comparison, we employ several evaluation metrics that have been used most widely for video prediction such as, Mean-Squared Error (MSE), Structural Similarity Index Measure (SSIM), and Peak Signal to Noise Ratio (PSNR). Since these metrics are mostly focused on the pixel-level image quality, we additionally measure Learned Perceptual Image Patch Similarity (LPIPS)[zhang2018perceptual] as an evaluation metric for perceptual dissimilarity. Higher values of SSIM/PSNR and lower values of LPIPS indicate better quality.
We evaluate our method on two different datasets such as Caltech pedestrian [dollar2009pedestrian] and UCF101 [soomro2012ucf101] datasets. The Caltech pedestrian dataset consists of videos taken from various driving places using vehicle-mounted cameras. To validate the generalization performance, we followed experimental protocols of [byeon2018contextvp, kwon2019predicting, lotter2016deep], where the KITTI dataset with 41K images is used of training and the Caltech pedestrian dataset is used for testing. The KITTI dataset contains image sequences for driving scenes. Both datasets also contain dynamic scenes since they were recorded from the moving vehicles.
In addition, we use the UCF101 dataset that is generally used for action recognition and contains videos focusing on human activities. This dataset mainly contains the moving objects in various environments. Since it contains a large amount of videos (13K videos), we use 10% of the videos, similar to the previous works [kwon2019predicting, mathieu2015deep, byeon2018contextvp].
4.2 Ablation study
We compare the performance of our network trained with and without GCPN in Table 1(left). The results show improved performance with GCPN as it aggregates non-local neighboring features to consider the global context effectively. As shown in the third and fourth row Table 1(left), GCPN shows greater performance improvement than LFMN in terms of the PSNR results as GCPN improves entire images while LFMN focuses on partial details. The proposed model, GCPN combined with LFMN, thus shows the best performance thanks to the benefit from both networks.
|LFMN||GCPN||PSNR||SSIM||LPIPS ( )|
|Number of memory items||0||5||10||20||30|
We compare the proposed model trained with and without LFMN as shown in Table 1(left). Compared to the basic U-net network, a baseline, without LFMN and GCPN, our model with LFMN yields improved prediction performance in both Caltech and UCF101 datasets. Fig. 4 shows the examples to validate that LFMN can record the local motion patterns and generate adaptive filter kernels by combining the prototypical elements of the encoded future-relevant features. This shows the output features of GCPN convolved with and without filter kernels from LFMN. In the 3 row, we show that the integration of adaptive filter kernels with the output features from GCPN. With LFMN, the feature response around the moving objects becomes higher while that around the static region decreases. The results show that the output features convolved with filter kernels capture moving objects effectively.
Table 1(right) shows the evaluation by varying the number of memory items (), from to to verify the effect of the memory items. The model with a larger generally shows better prediction results since diverse prototypical motions of moving objects can be more recorded in the memory items. The effect of LFMN starts to converge when becomes or larger.
To demonstrate the changes in each step in GCPN, we visualize the feature maps on the Caltech dataset, as shown in Fig. 5. Fig. 5 shows that the input sequence (left) of the Caltech dataset, and the 4 and 16 feature map of GCPN (right). At each step , GCPN aggregates non-local neighbors, propagating global spatial context. In addition, in terms of PSNR/SSIM, we obtained the 29.7/0.921 at and 30.1/0.927 at =2, respectively. The results show the improved performance with GCPN because it aggregates non-local neighboring features to consider the global context effectively.
|# of frames||2||4||6||8||10|
|Kwon et al [kwon2019predicting]||PSNR||29.17||29.22||29.01||28.94||29.01|
In addition, as in [kwon2019predicting], we present an experiment to evaluate the effect of the number of input frames when predicting the next frame. Table 2 shows the quantitative results according to the number of input images. Similar to [kwon2019predicting], we achieve good performance when using the inputs from 2 to 6 frames, and the performance starts to decrease when using more inputs.
4.3 Comparison with state-of-the-art methods
4.3.1 Next-frame prediction.
To evaluate the performance of the proposed method, we compare the accuracy of the next-frame predictions with several state-of-the-art video prediction methods. Sample results on thez Caltech pedestrian [dollar2009pedestrian] dataset are shown in Fig. 6 in the first row. Each model is only trained on the KITTI [geiger2013vision] dataset and evaluated on the Caltech pedestrian dataset without additional fine-tuning process. Our method uses 4 past frames as input images, while PredNet [lotter2016deep] and ContextVP [byeon2018contextvp] use 10 past frames as input but showing blurry results. MCnet [villegas2017decomposing] also shows blurry results and unnatural deformations on the highlighted car in the first row. ContextVP [byeon2018contextvp] shows a lot of artifacts in moving parts of objects such as the backside of a car.
|Kwon et al [kwon2019predicting]||29.2||0.919||1.61||35.0||0.94||1.37|
|Jin et al [jin2020exploring]||29.1||0.927||-||-||-||-|
For capturing the motion dynamics, MCnet [villegas2017decomposing] takes the difference between two consecutive frames. Because they are mainly focused on short-term information, there is a limit to their performance in estimating dynamic motion. In contrast, our method mitigates the above problems by considering both the global contextual information and local motion dynamics with the memory networks. The second row shows the results of the UCF101 [soomro2012ucf101] dataset that contains human actions taken in the wild and exhibits various challenges. The results show that our method produces sharp predictions and visually pleasing results compared to the state-of-the-art methods. More qualitative results are provided in supplementary materials.
Table 3 shows a quantitative comparison with several state-of-the-art methods in both datasets. We also report the results obtained by copying the last frame, which is the trivial baseline that uses the most current past frame as the prediction. The last two rows of Table 3 show that improves the performance by preserving the image edges. Our method significantly outperforms the baselines with decomposition (MCNet [villegas2017decomposing], DPG [gao2019disentangling], CtrlGen [hao2018controllable]) or without disentanglement. For DPG [gao2019disentangling] and CtrlGen [hao2018controllable], training with estimated optical flows may lead to erroneous supervision signals. Contrarily, our method outperforms the state-of-the-art methods thanks to the GCPN, LFMN, and integration of each network.
|PredNet [lotter2016deep]||MCnet [villegas2017decomposing]||Ours|
Table 4 compares the running time between ours and existing methods. We follow the original setting of all the released codes. In GCPN and LFMN, we obtain 0.161s and 0.112s on average on the Caltech dataset, respectively. On average, our method takes about 0.451.
4.3.2 Multi-step prediction
To validate the time and spatial consistency in the long-term future, we present the experiment on multi-step prediction [lotter2016deep, hao2018controllable, kwon2019predicting, gao2019disentangling, villegas2017decomposing]. The experimental scheme is as follows.
For example, the networks that require 4 input frames take the first 4 consecutive frames as input to find the 5 frame. The 6 frame is then predicted by using the 2 to 4 frames and the 5 predicted frame as input. We continuously find the next future frame using the predicted results. Fig. 7 shows the quantitative results for multi-step predictions. All the models take in 4 frames as input except PredNet [lotter2016deep] and recursively predict the next 15 frames. Our method consistently outperforms recent approaches on all metrics over time. In terms of LPIPS, while the Kwon et al [kwon2019predicting] and DPG [gao2019disentangling] face a performance degradation significantly after 6 steps, our method does not quickly drop thanks to the previously well-predicted results. The proposed method achieves outstanding results at anticipating the far future frames compared to state-of-the-art methods using with/without explicit motion estimation. Due to the page limitation, we put the qualitative comparisons with state-of-the-art methods to the supplementary materials.
In this paper, we presented the video prediction framework that includes GCPN and LFMN, two complementary convolutional neural networks, to capture global contextual information and local motion patterns of objects, respectively. GCPN considers the global context by iteratively aggregating the non-local neighboring representations. LFMN generates adaptive filter kernels using memory items that have the prototypical motion of moving objects. The integration of the two networks preserves the global context and refines the movements of dynamic objects simultaneously. Our in-depth analysis shows that the proposed method significantly outperforms state-of-the-art methods on the next frame prediction as well as the multi-step prediction. In future work, we will investigate the applicability of our model to other applications such as video semantic segmentation and anomaly detection.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2021R1A2C2006703) and was supported by the Yonsei University Research Fund of 2021 (2021-22-0001).