Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction

by   Beibei Jin, et al.
Capital Normal University

Video prediction is a pixel-wise dense prediction task to infer future frames based on past frames. Missing appearance details and motion blur are still two major problems for current predictive models, which lead to image distortion and temporal inconsistency. In this paper, we point out the necessity of exploring multi-frequency analysis to deal with the two problems. Inspired by the frequency band decomposition characteristic of Human Vision System (HVS), we propose a video prediction network based on multi-level wavelet analysis to deal with spatial and temporal information in a unified manner. Specifically, the multi-level spatial discrete wavelet transform decomposes each video frame into anisotropic sub-bands with multiple frequencies, helping to enrich structural information and reserve fine details. On the other hand, multi-level temporal discrete wavelet transform which operates on time axis decomposes the frame sequence into sub-band groups of different frequencies to accurately capture multi-frequency motions under a fixed frame rate. Extensive experiments on diverse datasets demonstrate that our model shows significant improvements on fidelity and temporal consistency over state-of-the-art works.



There are no comments yet.


page 1

page 2

page 7

page 8


Spatial and Temporal Mutual Promotion for Video-based Person Re-identification

Video-based person re-identification is a crucial task of matching video...

Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing

Compared with image scene parsing, video scene parsing introduces tempor...

Flow-Grounded Spatial-Temporal Video Prediction from Still Images

Existing video prediction methods mainly rely on observing multiple hist...

C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer

Human video motion transfer (HVMT) aims to synthesize videos that one pe...

Group-based Bi-Directional Recurrent Wavelet Neural Networks for Video Super-Resolution

Video super-resolution (VSR) aims to estimate a high-resolution (HR) fra...

Video Classification with FineCoarse Networks

A rich representation of the information in video data can be realized b...

Multiresolution and Hierarchical Analysis of Astronomical Spectroscopic Cubes using 3D Discrete Wavelet Transform

The intrinsically hierarchical and blended structure of interstellar mol...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised video prediction, as a fundamental vision problem, has attracted more and more attention in the research community and AI companies. It aims at predicting upcoming future frames based on the observation of previous frames. This looking-ahead ability has a broad application prospect on video surveillance [11], robotic systems [12] and autonomous vehicles [47]. However, building an accurate predictive model still remains very challenging because it requires to master not only the visual abstraction model of different objects but also the evolution of various motions over time.

A variety of recent deep learning methods 

[22, 46, 36, 3, 39, 38, 43, 21] have brought about great development on the video prediction task since the pioneering work of [35]. However, there still exists a clear gap between their predictions and the ground-truth (GT), as shown in Figure 1. The predictions of the compared methods suffer from deficient retention of high-frequency details and insufficient use of motion information, which results in distortion and temporal inconsistency. We detail the reasons mainly in the following two aspects:

Figure 1: A comparison of long-term prediction on a KTH [34] motion sequence. Our model generates predictions with higher fidelity and temporal consistency than the state-of-the-art methods, SAVP [22] and VarNet [19]. In the other two methods’ predictions, the person gradually blurs to distortion and runs out of the image too fast or too slowly, which is inconsistent to the ground truth.
Figure 2: Discrete Wavelet Transform (DWT) on time axis can capture the different motion frequencies between the slower car and the faster truck. (a) is a video sequence with length six. DWT of (a) on time axis results in the sub-bands in (b). (c) is the heat maps of the right three sub-bands in (b), which can clearly show the difference between their movements.
Figure 3: (A): Discrete Wavelet Transform in Spatial dimension (DWT-S) decomposes an image into one low frequency sub-band (LL) and three high frequency sub-bands of different directions (LH, HL, HH) which represent sub-bands of different directions (horizontal, vertical, diagonal). (B): An visualization example of (A). (C): Discrete Wavelet Transform in Temporal dimension (DWT-T) decomposes an image sequence into low frequency sub-bands and high frequency sub-bands on time axis. (D): An visualization example of (C). The sub-bands are visualized in heatmap style.

Loss of details. Down-sampling is commonly adopted to enlarge the receptive field and extract global information, resulting in inevitable loss of high-frequency details. However, video prediction is a pixel-wise dense prediction problem. Sharp predictions would not be made without the assistance of fine details. Although dilated convolution can be employed to avoid using down-sampling, it has the problem of grid effect and is not friendly to small objects, which hinders the application to video prediction.

Insufficient exploitation of temporal motions. Dynamic scenes are composed of motions occurring at more than one temporal frequency. In Figure 2

, we can observe the lower temporal motion of the smaller car in the left and the faster temporal motion of the bigger truck in the right. They have different moving frequencies. However, previous methods usually process them one by one at a fixed frame rate. Although Recurrent Neural Networks (RNNs) are used to memory dynamic dependencies, it has no ability to distinguish motions at different frequencies and cannot analyze time-frequency characteristics of temporal information.

Therefore, it is necessary to introduce multi-frequency analysis into video prediction task. Biological studies [16, 4] have shown that Human Visual System (HVS) exhibits multi-channel characteristics for spatial and temporal frequency information. The retinal images are decomposed to different frequency bands with approximately equal bandwidth on a logarithmic scale for processing [29], which includes a low frequency band and multiple high frequency bands. Besides spatial dimension, there also is a similar frequency band decomposition in temporal dimension. These characteristics enable the Human Visual System (HVS) to process visual content with better discrimination of detailed information and motion information. Wavelet analysis [6, 1] is a spatial-scale (temporal-frequency) analysis method, which has the characteristic of multi-resolution (frequency) analysis and can well represent the local characteristics of spatial (temporal) frequency signal, which is very similar to HVS.

Discrete Wavelet Transform (DWT) is a common wavelet analysis method for image processing. As shown in Figure 3(B), the Discrete Wavelet Transform in Spatial dimension (DWT-S) ( in Figure 3(A)) can decompose an image into one low frequency sub-band and three anisotropic high frequency sub-bands of different directions (horizontal, vertical, diagonal). Figure 3(D) shows the Discrete Wavelet Transform in Temporal dimension (DWT-T) (in Figure 3(C)) decomposes a video sequence of length four into two high-frequency sub-bands and two low-frequency sub-bands on time axis. The frequency on time axis here can be viewed as how fast the pixels change with time, which is related to temporal motions. So, inspired by the characteristics of HVS and wavelet transform, we propose to explore the multi-frequency analysis for high-fidelity and temporal-consistency video prediction. The main contributions are summarized as follows:

  • To the best of our knowledge, we are the first to propose a video prediction framework based on multi-frequency analysis that is trainable in an end-to-end manner.

  • To strengthen the spatial details, we develop a multi-level Spatial Wavelet Analysis Module (S-WAM) to decompose each frame into one low-frequency approximation sub-band and three high-frequency anisotropic detail sub-bands. The high-frequency sub-bands represent the boundary details well and are in favor of sharpening the prediction details. Besides, multi-level decomposition forms a spatial frequency pyramid, helping to extract objects’ features with multi scales.

  • To fully exploit the multi-frequency temporal motions of objects in dynamic scenes, we employ a multi-level Temporal Wavelet Analysis Module (T-WAM) to decompose buffered video sequence into sub-bands with different frequencies on time axis, promoting the description of multi-frequency motions and helping to comprehensively capture dynamic representations.

  • Both quantitative and qualitative experiments on diverse datasets demonstrate a significant performance boost than the state-of-the-art. Ablation studies are made to show the generalization ability of our model and the evaluation of sub-modules.

2 Related Work

2.1 Video Generation and Video Prediction

Video generation is to synthesize photo-realistic image sequences without the need to guarantee the fidelity of the results. It focuses on modeling the uncertainty of the dynamic development of video to produce results that may be inconsistent with the ground truth but reasonable. Differently, Video prediction is to perform deterministic image generation. It needs not only to focus on the per-frame visual quality, but also to master the internal temporal features to determine the most reliable development trend that is closest to the ground truth.

Stochastic Video Generation. Stochastic Video Generation models focus on handling the inherent uncertainty in predicting the future. They seek to generate multiple possible futures by incorporating stochastic models. Probabilistic latent variable models such as Variational Auto-Encoders (VAEs) [20, 33] and Variational Recurrent Neural Networks (VRNNs) [7] are the most commonly used structures.  [2] developed a stochastic variational video prediction (SV2P) method that predicted a different possible future for each sample of its latent variables, which was the first to provide effective stochastic multi-frame generation for real-world videos. SVG [8] proposed a generation model that combined deterministic prediction of the next frame with stochastic latent variables, introducing a per-step latent variables model(SVG-FP) and a variant with a learned prior (SVG-LP). SAVP [22] proposed a stochastic generation model combining VAEs and GANs.  [5] extended the VRNN formulation by proposing a hierarchical variant that used multiple levels of latents per timestep.

High-fidelity Video Prediction.

High-fidelity Video Prediction models aim to produce naturalistic image sequences as close to the ground truth as possible. The main consideration is to minimize the reconstruction error between the true future frame and the generated future frame. Such models can be classified as direct prediction models 

[35, 46, 43, 21, 3, 39, 30, 38, 18, 25] and transformation-based prediction models [49, 40, 37, 32]. Direct prediction models predict pixel values of future frames directly. In general, they use a combination of forward neural network and recurrent neural network to encode spatial and temporal features, and then perform decoding to get the prediction with the corresponding decoding network. Generative adversarial networks (GANs) are often employed to make the predicted frames more realistic. For example,  [24] developed a dual motion Generative Adversarial Net (GAN) architecture to explicitly enforce future-frame predictions to be consistent with the pixel-wise flows.  [21] trained a single generator that predict both future and past frames by enforcing the consistency of bi-directional prediction using retrospective prediction scheme. Meanwhile, they employed two discriminators not only to identify fake frames but also to distinguish fake contained image sequences from the real sequence. Transformation-based prediction models aim at modeling the source of variability and operate in the space of transformations between frames. They focus on learning the transformation kernels between frames which are applied to the previous frames to synthesize the future frames indirectly.

Here, latent variables in stochastic video generation models is not considered in our model. Such models learn and sample from a space of possible futures to generate the subsequent frames. Although reasonable results can be generated by sampling different latent variables, there is no guarantee of consistency with the ground truth. Moreover, the quality of generation results vary from sample to sample, which is uncontrollable. This limits the application of such models in some practical tasks requiring a high degree of certainty, such as autonomous driving. We focus on high-fidelity video prediction, aiming to construct a prediction model to predict realistic future frame sequences as close to the ground truth as possible. To overcome the challenges of lack of details and motion blur, we propose to explore multi-frequency analysis based video prediction by incorporating wavelet transform with generative adversarial network.

Figure 4: The pipeline architecture of our network. Note that the diagram takes the next frame prediction as an example. Multi-frame prediction can be done by feeding the predicted frame into the encoder network.

2.2 Wavelet Transform

Wavelet Transform (WT) has been widely applied in image compression [6], image reconstruction [17] and many other fields. In Wavelet Transform, a scalable modulation window is moved along the signal, calculating the spectrum at each position, and then repeated multiple times with a slightly shorter (or longer) window. Finally, the result will be a collection of time-frequency representations of the signal with different resolutions (frequencies). In image processing, Discrete Wavelet Transform (DWT) is often used. A fast implementation of it by using filter bank is proposed in [28]. The filter bank implementation of wavelets can be interpreted as computing the wavelet coefficients of a discrete set of child wavelets for a given mother wavelet. According to [28], we illustrate the process of DWT on space axes of an image and DWT on time axis of a video sequence in Figure 3. Multi-level DWT can be done by repeating a similar process on a sub-band images. The multi-resolution (frequency) analysis of DWT is consistent with Human Visual System (HVS), which provides a biological basis for our approach. We recommend to refer to [28] to learn more about Discrete Wavelet Transform (DWT).

3 Method

3.1 Problem Statement

We aim to synthesize future frames of high fidelity and temporal consistency by observing several beginning frames. Let be the input of length . represents the th frame, where . H, W and C are the height, width and channel number. Let represents the ground truth of future frame sequence of length and represents the prediction of . The goal is to minimize the reconstruction error between and . For the sake of clarity, we will introduce the network in detail by taking the next frame prediction as an example.

3.2 Network Architecture

We adopt generative adversarial network as the model structure. The Generator and discriminator are trained with competing goals: aims to predict frames that can fool , while aims to distinguish whether the input samples are real (from the training dataset) or fake (from ).

Figure 4 demonstrates the overall block diagram of the generator G to predict frame at time step

. It follows an encoder-decoder architecture. The encoder aims to transform the input sequence into a hidden feature tensor, while the decoder is in charge of decoding the feature tensor to generate the prediction of the next frame. The encoder consists of three part: stem CNN-LSTM, cascaded Spatial Wavelet Analysis Modules (S-WAMs) and Temporal Analysis Module (T-WAM). The decoder is composed of deconvolution and up-sampling layers.

The stem encoder is a ’CNN-LSTM’ structure. At each time step , the frame is passed through the stem network to extract multi-scale spatial information under different receptive fields. As video prediction is a pixel-wise visual task, to pursue a better expression of appearance features, we refer to the Residual-in-Residual Dense Block (RRDB) proposed by  [41] in the design of our stem CNN structure. It is a combination of multi-level residual network and dense connections. We make a modification: adding a down-sampling layer in each RRDB unit, which can reduce the size of feature maps.

To reserve more high-frequency spatial details in prediction, considering multi-resolution analysis of wavelet transform, we propose a Spatial Wavelet Analysis Module (S-WAM) to enhance the representation of high-frequency information. As illustrated in Figure 4

, S-WAM consists of two stages: Firstly, the input is decomposed into one low-frequency sub-band and three high-frequency detail sub-bands by Discrete Wavelet Transform on Spatial dimension (DWT-S); Secondly, the sub-bands are fed into a shallow CNN to do further feature extraction and obtain consistent number of channels with the corresponding m_RRDB unit. We cascade three S-WAMs to do multi-level wavelet analysis. The output of each level of S-WAM is added with the corresponding feature tensors of the m_RRDB unit. The cascaded S-WAMs provide the compensation of details to the stem network under multiple frequencies, which promotes the prediction of fine details.

On the other side, to model the temporal multi-frequency motions in video sequences, we design a multi-level Temporal Wavelet Analysis Module (T-WAM) decomposing the sequence into sub-bands under different frequencies on time axis. As shown above in Figure 3(B), one level DWT of a sequence with length on temporal axis will decompose it into low-frequency sub-bands and high-frequency sub-bands. In our experiments, we conduct multi-level DWT on temporal dimension (DWT-T) on the input sequence until the number of low-frequency sub-bands or high-frequency sub-bands equals two. We take three DWT-T as an example in Figure 4. Then we concatenate those sub-bands as the input of a CNN to extract features and adjust the size of feature maps. The output is fused with the historical information from LSTM cell to strengthen the ability to distinguish multi-frequency motions for the model. The fused feature tensors from the encoder network are fed to the decoder network to generate the prediction of the next frame. We conduct a discriminator network as [30] and train the discriminator to classify the input into class and the input into class .

3.3 Loss Function

We adopt multi-module losses which consists of the image domain loss and the adversarial loss.

Image Domain Loss. We combine loss with the Gradient Difference Loss (GDL) [30] as the image domain loss:


We define the loss as:


And the loss is given by:


where is an integer greater or equal to , and is the operation of absolute value function.

Adversarial Loss. Adversarial training involves a generator G and a discriminator D, where D learns to distinguish whether the frame sequence is from the real dataset or produced by G. The two networks are trained alternately, thus improving until D can no longer discriminate the frame sequence generated by G. In our model, the prediction model is regarded as a generator. We formulate the adversarial loss on the discriminator D as:


and the adversarial loss for the generator G as:


Hence, we combine the losses previously defined for our generator model with different weights:


where and are hyper-parameters to trade off between these distinct losses.

Figure 5: Quantitative comparison of different prediction models on BAIR datasets. Higher values for both PSNR and SSIM indicate better performance.

4 Experiments

In this section, we will first introduce the experiment setup, and then show the quantitative and qualitative evaluation on diverse datasets. Besides, we do ablation studies to show our model’s generalization capability and evaluation of sub-modules.

4.1 Experiment Setup

  • Datasets. We perform experiments on diverse datasets widely used to evaluate video prediction models. The KTH dataset [34]

    contains 6 types of actions from 25 persons. We use person 1-16 for training and 17-25 for testing. Models are trained to predict next 10 frames based on the observation of previous 10 frames. The prediction range of testing is extended to 20 or 40 frames. The hyper parameters in the loss function

    6 on KTH dataset are: and . The BAIR dataset [10] consists of a random moving robotic arm that pushes objects on a table. This dataset is particularly challenging due to the high stochasticity of the arm movements and the diversity of the background. We follow the setup in  [22] and train the models to predict future 28 frames. The hyper parameters in the loss function 6 on the BAIR dataset are: and . In addition, following the experiments settings in [24], we validate the generalization ability of our models on the car-mounted camera datasets (train: KITTI dataset [14], test:Caltech Pedestrian dataset [9]). The hyper parameters are: and .

  • Metrics.

    Quantitative evaluation of the the accuracy on the testing datasets is performed based on Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) metrics 

    [45]. Higher values indicate better results. To measure the realism of predicted results, we employ the metric of Learned Perceptual Image Patch Similarity (LPIPS) [48]. Lower values for LPIPS indicate better results.

  • Baselines. We consider representative baselines from two categories: stochastic video generation models [22, 2, 8, 5] and deterministic video prediction models [38, 31, 44, 42, 19, 23, 43, 3, 13, 21, 15]. All the experiment results of the baselines are from the cited papers or reproduced by the pre-trained models the authors report online.

Method KTH
MCNET [38] 25.95 0.804 - 23.89 0.73 -
fRNN [31] 26.12 0.771 - 23.77 0.678 -
PredRNN [44] 27.55 0.839 - 24.16 0.703 -
PredRNN++ [42] 28.47 0.865 - 25.21 0.741 -
VarNet [19] 28.48 0.843 - 25.37 0.739 -
E3D-LSTM [43] 29.31 0.879 - 27.24 0.810 -
MSNET [23] 27.08 0.876 - - - -
SAVP [22] 25.38 0.746 9.37 23.97 0.701 13.26
SAVP-VAE [22] 27.77 0.852 8.36 26.18 0.811 11.33
SV2P time-invariant [2] 27.56 0.826 17.92 25.92 0.778 25.21
SV2P time-variant [2] 27.79 0.838 15.04 26.12 0.789 22.48
Ours 29.85 0.893 11.81 27.56 0.851 14.13
Ours (no S-WAM) 29.13 0.872 12.33 26.42 0.805 16.06
Ours (no T-WAM) 28.57 0.839 15.16 26.08 0.782 17.45
Table 1: The average comparison results over predicted time steps and time steps based on time steps on the KTH dataset. The metrics are averaged over the predicted frames. The best results under each metric are marked in bold.
Method BAIR
SAVP [22] 18.42 0.789 6.34
SAVP-VAE [22] 19.09 0.815 6.22
SV2P time-invariant [2] 20.36 0.817 9.14
SVG-LP [8] 17.72 0.815 6.03
Improved VRNN [5] - 0.822 5.50
Ours 21.02 0.844 9.36
Table 2: Quantitative evaluation of different methods on the BAIR dataset. The metrics are averaged over the predicted frames. The best results under each metric are marked in bold.
Figure 6: The prediction visualization of future 40 time steps based on the 10 frames on the KTH dataset.
Figure 7: The prediction visualization comparison of future 28 time steps on the BAIR action free dataset. Our model predicts more consistent results to the ground truth.
Figure 8: Visualization examples on KITTI dataset (the first group) and CalTech Pedestrian dataset (the second group). We show the prediction of future time steps based on observation of frames.

4.2 Quantitative Evaluation

The results of methods [38, 31, 44, 42, 19, 43, 23, 5] are reported in the reference papers [43, 19, 23, 5]. For the models [22, 2, 8], we generate the results by running the pre-trained models the authors reported online.

Table 1 reports quantitative comparison on the KTH dataset. We can see that our model achieves the best result on PSNR and SSIM in terms of prediction for both future 20 frames and 40 frames, which indicates that our results are more consistent with the ground truth. However, on LPIPS, SAVP and its variants SAVP-VAE perform better than us. We analyze that the introduction of latent variables in the stochastic generation methods focuses more on the visual quality of the generated results and less on the consistency with ground truth. Nevertheless, our model focuses more on fidelity and temporal consistency with the original sequences, which is in line with our original intention.

Figure 5 illustrates the per-frame quantitative comparison of future time steps on the BAIR dataset. We also calculate the average results in Table 2. In consistent with the result on KTH dataset, we obtain the best PSNR and SSIM among the reported methods. While the Improved VRNN [5] achieves the highest on LPIPS. Because of the high stochasticity of the BAIR dataset, it is challenging to maintain fidelity and temporal consistency while making good visual effects.

4.3 Qualitative Evaluation

To show the prediction results more intuitively, we report the visualization examples on KTH dataset and BAIR datasets in Figure 6 and Figure 7.

The first row is the ground truth, in which the initial frames represent the input sequence. We can see that our model makes more accurate predictions while maintaining more details of the arms in the handclapping example in first group of Figure 6. Meanwhile, we predict a walking sequence that is more consistent with the ground truth in the second group of Figure 6, while for other methods, the person in the image walks out of the scene too quickly (VarNet) or two slowly (SAVP and SV2P time-invarient).

For the two groups of prediction on BAIR dataset, we are also the most consistent. Though the stochastic generation methods seem to generate more clear results, they are very different from the moving trajectories of the real sequence. This again confirms our belief that introducing more stochasticity in models will sacrifice fidelity. From the experiment results above, we can see that the multi-frequency analysis of discrete wavelet transform does help models to retain more detail information as well as temporal motion information.

Method PSNR SSIM LPIPS #param
PredNet [27] 27.6 0.905 7.47 6.9M
ContextVP [3] 28.7 0.921 6.03 8.6M
DVF [26] 26.2 0.897 5.57 8.9M
Dual Motion GAN [24] - 0.899 - -
CtrlGen [15] 26.5 0.900 6.38 -
DPG [13] 28.2 0.923 5.04 -
Cycle GAN [21] 29.2 0.830 - -
Ours 29.1 0.927 5.89 7.6M
Table 3: Evaluation of Next frame prediction on the CalTech Pedestrian dataset after trained on the KITTI dataset. All models are trained by observing 10 frames. The results of the compared methods are from [3, 21, 13]. The best results under each metric are marked in bold.

4.4 Ablation Study

To more fully verify the validity of our model, we did the following ablation studies on driving scene datasets: KITII dataset and Caltech Pedestrian dataset.

Evaluation of generalization ability. Consistent with the way that the previous works evaluate generalization ability of predictive models, we test our mdoel on the Caltech Pedestrian dataset after trained on KITTI dataset. Table 3 shows the comparison results with other models. It can be seen that we achieve the state-of-the-art performance. In addition, Figure 8 shows the visualization examples on KITTI dataset (the first group) and Caltech Pedestrian dataset (the second group). From Figure 8, we can see that our model can predict clearly the evolution of driving lines and the cars. Meanwhile, the results remain consistent with the ground truth, which verifies the good generalization ability of the model. Besides, we also report the number of model’s parameters in Table 3. Compared to ContextVP [3] and DVF [26], our model achieves better results with fewer parameters.

Evaluation of sub-modules. To assess the impact of each sub-module, we do ablation studies on the KTH dataset in the absence of S-WAM or T-WAM respectively. As shown in Table 1, it suggests that sub-modules, S-WAM and T-WAM, have both contributed to improving the prediction effect. To be specific, the model without S-WAM seems to gain more than the model without T-WAM. We analyze that the temporal motion information is of vital importance in the long-term prediction, especially for long-term prediction. Improving the expression of multi-frequency motion information in the model is the basis for making predictions with high-fidelity and temporal-consistency.

5 Conclusion

This paper discusses the issues of missing details and ignoring temporal multi-scale motions in current video prediction models, which always lead to blurry results. To address these issues, inspired by the mechanism in Human Visual System (HVS), we explore a video prediction network based on multi-frequency analysis, which integrates spatial-temporal wavelet transform and generative adversarial network. Specifically,the Spatial Wavelet Analysis Module (S-WAM) is proposed to reserve more detail information through multi-level decomposition of each frame. The Temporal Wavelet Analysis Module (T-WAM) is proposed to exploit the temporal motions through multi-level decomposition of video sequences on time axis. Extensive experiments on diverse datasets demonstrate the superiority of the proposed method when compared to the state-of-the-art methods. In the future, we will validate our method on more video datsets.