Log In Sign Up

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Despite video forecasting has been a widely explored topic in recent years, the mainstream of the existing work still limits their models with a single prediction space but completely neglects the way to leverage their model with multi-prediction spaces. This work fills this gap. For the first time, we deeply study numerous strategies to perform video forecasting in multi-prediction spaces and fuse their results together to boost performance. The prediction in the pixel space usually lacks the ability to preserve the semantic and structure content of the video however the prediction in the high-level feature space is prone to generate errors in the reduction and recovering process. Therefore, we build a recurrent connection between different feature spaces and incorporate their generations in the upsampling process. Rather surprisingly, this simple idea yields a much more significant performance boost than PhyDNet (performance improved by 32.1 dataset, and 21.4 evaluations on four datasets demonstrate the generalization ability and effectiveness of our approach. We show that our model significantly reduces the troublesome distortions and blurry artifacts and brings remarkable improvements to the accuracy in long term video prediction. The code will be released soon.


page 1

page 4

page 6

page 7

page 8


Hierarchical Model for Long-term Video Prediction

Video prediction has been an active topic of research in the past few ye...

Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction

Learning to predict the long-term future of video frames is notoriously ...

Curse of Small Sample Size in Forecasting of the Active Cases in COVID-19 Outbreak

During the COVID-19 pandemic, a massive number of attempts on the predic...

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

The crux of self-supervised video representation learning is to build ge...

A Multi-level Alignment Training Scheme for Video-and-Language Grounding

To solve video-and-language grounding tasks, the key is for the network ...

VLG: General Video Recognition with Web Textual Knowledge

Video recognition in an open and dynamic world is quite challenging, as ...

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

To build an artificial neural network like the biological intelligence s...

1 Introduction

Video prediction is a crucial problem in Computer Vision that aims to anticipate the future frames conditioned on an observed video clip. It is a challenging task, especially when it comes to long term prediction due to the drastic evolution of the pixel space and the complicated uncertainty of the future. Tremendous accomplishments have been made through the introduction of Long Short-Term Memory (LSTM) models and their variants to address this task, and in fact, the current trend is to mainly focus on designing sophisticated architectures to exploit the spatial-temporal clues. However, none of these existing approaches has paid particular attention to the prediction space used in the video prediction task.

In this paper, we first carry out an in-depth investigation of the existing work from the neglected prediction space perspective and reveal that a simple but effective strategy to utilize multi-prediction spaces can surprisingly bring about marginal gains over the recent leading work than excepted. For instance, we show that our model based on four-prediction spaces gains an improvement of and in terms of MSE (17.2) and MAE (47.7) respectively on the existing benchmark MNIST-2 in comparison with PhyDNet [7]. We demonstrate an example of comparison results in Figure 1, from which we can observe that our approach can correctly predict the movement of two numbers.

The feature space used as the input and output of the recurrent prediction process is noted as prediction space. According to the prediction space used, the attempts of existing research can be broadly categorized into two types. The first type of approaches conducts prediction directly in the pixel space [41, 36, 34, 37, 35, 31, 39], where their recurrent models directly predict the evolution of pixels and recursively generate future frames. Despite their advantages in modelling the sequence data, these methods present two main limitations: first, these models are limited to a pixel-to-pixel level prediction, and as a result they have difficulties learning to hold the semantic and structural information in the videos, e.g. shape, motion; secondly, errors in recurrent models are prone to accumulate in such a way that small discrepancies at the beginning will be amplified to serious compound error and distorted structures in the long term.

Another group of trending works tackles these limitations by predicting the video in high-level feature space first, then translating it back into pixel-level frames [7, 33, 32, 42, 4]. As memtioned in [23], researchers have proposed different frameworks to predict the instant segmentation [2], motion [33], human pose [42], object trajectory [40], or key points [14] in the video first, then translate them back into the pixel space. In this way, the video prediction task is simplified and the content in the video is better preserved, thus leading to a decrease of the pixel level error due to more accurate content. However, the reduction and recovering processes between the pixel-level space and the high-level feature space usually bring intractable errors.

Although numerous techniques have been introduced so far, most of the existing work only limited the prediction task to one single feature space. Performing video prediction simultaneously in multi-prediction spaces has never been investigated in the literature to our best knowledge. We believe that the different level of feature spaces can be complementary to each other.

In light of that, our key insight is to conduct the prediction task at multi-feature spaces and recover the final frame by assembling all the prediction results. In this paper, we dedicatedly investigate the prediction performance in different feature spaces and propose an architecture with Recurrent Connections. Specifically, the input video is downsampled through convolutional layers to obtain different feature spaces. After that, an LSTM-based module is employed to predict the feature sequences on different scales. The final frames will be synthesised by fusing all the results together.

Our main contributions can be summarized as follows:

  1. To our knowledge, we are the first to take notice of the previously unrealised multi-prediction space perspective and propose a novel recurrent connection to address the video prediction problem.

  2. The experiment results show that our model introduces considerable improvements compared to the state-of-the-art works on various challenging video datasets. Furthermore, the generalization of our idea is validated by introducing the same strategy on different architectures.

  3. We thoroughly analyze various ways to build the recurrent connection and fuse the prediction results together from multi-spaces. A series of comprehensive experiments are conducted to provide a variety of insights for the utilization of multi-prediction spaces.

2 Related Works

We reviewed the current approaches to tackle the task of video prediction from the perspective of the prediction space being used. They usually adopt two types, the pixel space and high-level feature space.

Pixel Space. Many recent works [17, 41, 28, 37, 39] attempted to directly predict future pixel intensities without any explicit modeling of the scene dynamics. Ranzato et al. [25]

constructed a Recurrent Neural Network (RNN) model to predict the next frames. Inspired by architectures used for language processing, Srivastava et al.

[29] introduced the sequence-to-sequence model to video prediction. Since this kind of model can only capture temporal variations, to instead learn both spatial and temporal variations in a unified network structure, Shi et al. [41] integrated the convolutional operator into recurrent state transition functions and proposed the Convolutional LSTM for joint modelling of both variations. However, in their approach, the LSTM cells in different layers are independent of each other and consequently the information from the last top layer cannot flow into the bottom layer during the next time step. In order to address this issue, [17, 6, 13, 28, 36, 34, 37] further extended the convolutional LSTM model and investigated spatio-temporal prediction. All of these models focus on rectifying the LSTM cells to capture more spatial and temporal clues.

Furthermore, more various strategies are carried out. Under the assumption that video sequences are symmetric, Kwon et al. [16] implemented a retrospective prediction network by training a generator for both, forward and backward predictions. Hu et al. [9]

presented a novel cycle-consistency loss function to be incorporated within their framework for more accurate predictions. In addition, several other studies

[21, 8, 22] further explored the benefits of forward and backward predictions. Another typical way of predicting video frames is to use 3D convolutions. Wang et al. [35] presented a gated-controlled self-attention module that effectively manages historical memory records across multiple time steps.

Overall, the performance of these models still suffers from a significant degradation in the long-term prediction task. The high dimensionality of the pixel space causes the prediction error to grows exponentially as the prediction goes longer.

High-Level Feature Space.

To deal with the curse of dimensionality, researchers proposed various approaches to learn the high-level representations in the latent space of videos via an encode networks first, then perform prediction in the learned high-level space.

The proposed Parsing with predictive feature Learning (PEARL) PEARL [11] is the first systematic predictive learning model for video scene parsing. Concurrently, Luc et al. [18] extended the msCNN model of [19] to the novel task of predicting semantic segmentation of future frames. However, both these models are not end-to-end and do not explicitly capture the temporal continuity across frames. To address this limitation, Jin et al. [12] first proposed a model for jointly predicting motion flow and scene parsing. An advantage of flow-based representations is that they implicitly draw temporal correlations from the input data, thus producing temporally coherent per-pixel segmentation. Aside from segmentation, other high-level spaces such as human pose and keypoints represent promising ways of approaching this problem. Minderer et al. [20] modelled dynamics in the keypoint coordinate space, achieved stable learning and avoided the compounding of errors in pixel space. Tang et al. [30]

proposed a pose-guided approach for appearance preserving video prediction by combining global and local information using Generative Adversarial Networks (GANs). However, a major drawback of these semantic feature spaces is that they are not generic for all the video data and usually requires laborious ground-truth labelling. Yu et al.


proposed the Conditionally Reversible Network (CrevNet) that theoretically ensures no information loss during the feature extraction process, in this way guaranteeing a higher level of efficiency for their model as well.

Figure 2: The workflow of the three types of approaches. Our model is presented in the middle image. The network architecture will repeat the same structure of the first two layers when the prediction spaces go deeper.

Some two-stream based work factorized the video data into content and motion and predict motion separately. For instance, [33, 3] proposed MCnet and Disentangled-representation Net (DRNET) to explicitly separate scene dynamics from the visual appearance. These methods usually assume the visual appearance is static and only predict the dynamics in motion spaces and, as a consequence, they are also limited to one prediction space according to our definition. Lastly, high-level feature spaces can be learned by encoder models, which are usually not semantic explainable anymore [32, 4]. PhyDNet [7] has been proposed to disentangle PDE dynamics to obtain a latent feature space by utilizing physical knowledge. However, finding a proper high-level feature space is very challenging and the dimensionality reduction of feature space usually comes down to a trade-off between simplicity and quality.

To date, all the aforementioned approaches attempted to use sophisticated architectures to improve the performance and still limited their models only to one prediction space. In contrast to them, we firstly thoroughly investigate different strategies to construct models that incorporate multi-prediction spaces in this paper.

3 Methodology

In this section, we formulate the mathematical descriptions for the video prediction problem and their architectures. In order to model video data through multi-prediction spaces, there are three important questions to answer: (1) How to obtain the multi-prediction spaces? (2) How to conduct the prediction process on the multi-prediction spaces? (3) How to connect the different prediction spaces? Hereafter, we address these three questions by conducting extensive studies and finally propose a full architecture to deal with this problem.

3.1 The Prediction Workflow

Given an input image sequence , , …, , our goal is to predict the future frame sequence , , …, . The most common way to video prediction problem is to generate prediction frames in a recursive way, i.e., training a model takes current timestep frame as input to output the next frame . As previously mentioned, prediction methods can be divided into the following three main categories, we illustrate their workflow in Figure 2 and present the mathematical formulations:

Pixel Space Prediction. The typical prediction method in pixel space can be written as:


where is the input image in time , is the output image, and is usually a RNN based network.

High-Level Feature Space Prediction. Encoders and Decoders are employed to translate the input images to the high level feature space and back. The corresponding prediction method can be written as:


where corresponds to the high level feature map obtained by the encoder, is the high level feature prediction and the output image is obtained through the decoder network.

Multi Feature Space Prediction. In our approach, multiple feature spaces for prediction are extracted. Since we believe the incorporation of these feature spaces is able to capture clues of the natural videos in a complementary manner. Therefore, the multi-prediction spaces in our experiment are designed. To be simplified, the equations for a two-prediction spaces scenario can be derived as the followings:


As illustrated in Figure 2, the two pathways of the prediction process in different feature space are fused together. The precise connection strategy between two prediction spaces is proposed in Section 3.3.

3.2 The Prediction Spaces

Model MSE MAE SSIM Paras
layer 56.3 150.7 0.816 71.2M
layer 34.8 98.1 0.907 70.8M
layer 25.9 75.5 0.941 72.0M
layer 25.7 76.1 0.942 70.7M
All layers 18.6 57.4 0.960 69.7M
Table 1: The comparison result of using each layer and all layers as prediction space.

The most common way to obtain different feature spaces is to operate different scales of convolutions. Usually, the high-level feature space-based methods spend a lot of effort in designing a special prediction space to enhance the performance. However, we discovered through our experiment that even simple feature spaces extracted by typical convolutions show much greater improvements when they are used together.

2D convolutional layers are employed to generate multi-scale features. More detailed information for the layers can be found in the implementation details section. In order to study the characteristic of each feature space, we trained the model, which adopts ConvLSTM [27] to predict the future sequences on each single feature spaces and reported their performance and accuracy results here.

Figure 3: The comparison results of using each layer as prediction space.

Discussion: The comparison results presented in Figure 3 and Table 1 show that the deeper feature space is able to preserve the content better without distortions and blurry artefacts. For example, digital numbers four and eight are still readable in the final predicted frame. On the other hand, the numbers are unreadable already from the first frame if the prediction is conducted directly in pixel space. Therefore, these observations validate the superiority of the higher-level representation and show that the deeper the feature space is, the higher the accuracy will be. Most importantly, the model which incorporates the prediction results of all feature spaces achieves the best performance. These phenomena verify that the comprehensive information of video frames in different prediction spaces can boost the video forecasting tasks.

3.3 The Recurrent Connection

The key issue to perform our idea is how to connect different prediction processes to fuse the generations from multi-feature space. Therefore, a novel Recurrent Connection strategy is designed in this paper to address this problem. Assuming the input frame at time and layer is denoted as , and the predicted feature on the next timestep and layer is denoted as . As a consequence, there is a spatial-temporal gap between these two features so that the core technologies are proposed to model the transition from to . As shown in Figure 2, the methods of pixel space prediction models this transition by a recurrent cell such as ConvLSTM. In our model, a recurrent connection strategy will connect the prediction features every two layers and construct a new way to model the transition from to .

The detailed design of the propagation of sequential data in our Recurrent Connection can be described as:


In the equation, and corresponds to the propagation in the layer . and represents the propagation in the layer. The feature is obtained from Recurrent Connection between the layer and the layer. The last layer of the architecture will perform the transition of frames by typical recurrent cells. Two LSTM cell are employed before and after the DownSample and UpSample process to strengthen the ability to capture the spatio-temporal clues. Three types of RNN block for sequences prediction are employed in our experiments, namely ConvLSTM [27], ConvGRU [28] and PredRNN [36]. After consideration of efficiency and performace, we use ConvLSTM in our final model.

Fusion Strategy. For every input frame , there will be one frame generated from the current prediction space and another frame obtained from the upsampling of the lower-level prediction space. It is quite important to investigate how the low-level prediction feature map should be fused with the high-level prediction feature map. Various fusion functions are proposed, we chose the most widely used four types of fusion functions [5] , namely Sum Fusion, Concatenation Fusion, Max Fusion and Attention Fusion. Comparison results are reported in Section 4.

4 Experiments

In this section, we will first introduce the datasets used and implementation details of our model. Then we will verify the performance of our model by comparing it with the five competitive models including ConvLSTM [27], PredRNN [36], MIM [37], CrevNet [44] and PhyDNet [7].

Implementation details. Our model is trained by using loss and ADAM [15] optimizer with an initial learning rate of . The kernel size of all convolutional layers and ConvLSTMs is set to

. The RNN path is composed of a 1-layer ConvLSTM with the same number of filters as its input channels. All the experiments are implemented in PyTorch


and conducted on NVIDIA V100 Tensor Core GPUs.

Evaluation metrics.

We use three common evaluation metrics used in previous video prediction works: Mean Squared Error (MSE), Mean Absolute Error (MAE) and Structural Similarity (SSIM

[38]). These metrics are averaged for each frame of the output sequence. Lower values of MSE, MAE and higher values for SSIM indicate better performances.

4.1 Datasets

Moving MNIST [29] is a standard benchmark in video prediction containing 2 random moving digits bouncing inside a 64 64 grid. Each sequence contains 20 frames, 10 for the input sequences and 10 for the prediction outcome. We generate Moving MNIST sequences following the method in [36] for training and use the fixed test set of 10,000 sequences provided by [29] for evaluation.

KTH [26] includes 6 kinds of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed in different scenarios. The video frames used in our experiment are subsampled from the original videos at 1Hz, and resized to be 128 128. All video frames are divided into a training set (subjects 1-16) and a test set (subjects 17-25). Following the common setup for this benchmark, we generate 10 frames from the last 10 observations. In our experiments, we use 172,372 sequences for training and 99,375 sequences for testing.

Human 3.6M [10] contains 17 kinds of human actions, including 3.6 million poses and corresponding images. The original images in Human3.6M dataset are 1000 1000 3. All images are centered and cropped to 800 800 3, and resized to 128 128 3 in our experiments. According to [37], subjects S1, S5, S6, S7, S8 are used for training and S9, S11 used for testing. We evaluate 4 future frames given the previous 4 input frames.

Radar Echo dataset is generated from historical radar maps from local weather with an interval of 6 minutes. Each frame is a 700 900 1 grid for an image, covering 700 900 square kilometers. To effectively train all the models for comparison, the images are firstly converted to 900 900

1 with zero paddings and then resized to 128

128 1 for training and testing. We predict 10 radar maps at a time interval of 6 minutes, covering the next hour. In our experiments, we use 9,600 radar sequences for training and 2,400 for testing.

ConvLSTM [27] 103.3 182.9 0.707 127.3 - 0.695
PredRNN [36] 56.8 126.1 0.867 83.1 - 0.822
MIM [37] 44.2 101.1 0.910 49.8* 138.2* 0.879*
CrevNet [43] 22.3 - 0.949 40.6 - 0.916
PhyDNet [7] 24.4 70.3 0.947 50.8* 136.4* 0.888*
Ours 17.2 47.7 0.967 29.8 80.1 0.944
Table 2: The comparison result of our model with other State-of-the-Art models on Moving MNIST-2 and Moving MNIST-3 datasets. * corresponds to results obtained by running the online code from the authors.

4.2 Comparison with the State-of-the-Art work

We report here the evaluation of our proposed strategies against leading research works in video prediction. In our final model, four layers of feature spaces and ConvLSTM are used. The top performance fusion function is selected as well.

Quantitative As shown in Table 2 and 3, our proposed model achieves marginal improvements in comparison with all the five baselines on all the datasets and terms, therefore this evidence validates the effectiveness of the idea to use multi-prediction spaces. From the results on the Moving MNIST with 3 handwritten digits, almost half of the value of MSE and MAE are drastically decreased when compared to the competitive PhyDNet model. It is worth noting that this big gap had not been introduced by the various previously proposed approaches. Moreover, our model outperforms the State-of-the-Art method significantly on the other three real-world video benchmarks. On the challenging Radar Echo Dataset, our model yields the best results (76.7 MSE, 319.3 MAE and 0.942 SSIM) to date. Moreover, the large and persistent performance boost is reported in the term of SSIM, indicating that our model also has superior ability in preserving the image quality during the prediction task. In addition, we reported the CSI scores [41] specific for weather forecasting task in Figure 5. Our model consistently outperforms the compared methods in the future 10 frames (1 hour), which indicates the superior performance of our model on the precipitation nowcasting task.

Figure 4: Prediction examples on the KTH dataset of our model and the State-of-the-Art works. The MIM [37] model predicts the future frames on pixel space and the PhyDNet [7] model conducts the prediction on high-level feature space.

Qualitative The qualitative comparison results of our model and the State-of-the-Art method is illustrated in Figure 4, 6, 7.

We can observe from Figure 4 that the proposed model can clearly predict the pose of human and their position, while blurry and incorrect results derive from the other two models.

For the Radar Echo Dataset, different colors represent different radar dBZ values, and high dBZ value is more important for weather nowcasting. As shown in Figure 6, out model can successfully preserve the high dBZ values (the area in the red blocks) in the prediction frames, our prediction results are much sharper than other models, which is rather useful to get more accurate weather forecasting.

Furthermore, the prediction results of Human action videos presented in Figure 7, compared with the State-of-the-Art method, surprisingly show that our multi-prediction spaces method is more advantageous – the human legs are still distinct and accurate in the predicted future.

All the visualization results validate the remarkable advantages of the idea to leverage both the high-level semantic information and low-level pixel information by using the recurrent connection mechanism. More visualization comparisons on all the datasets are provided in our Demo and Supplementary materials.

KTH Radar Echo Dataset Human 3.6M
PredRNN [36] 180.3 1985.1 0.831 92.5 365.3 0.737 481.4 1895.2 0.781
MIM [37] 122.2 1347.3 0.840 92.9 402.6 0.725 429.9 1782.8 0.790
PhyDNet [7] 91.2 1061.1 0.902 41.5 280.1 0.859 264.6 1129.2 0.929
Ours 75.9 834.0 0.924 39.7 241.7 0.872 230.1 947.9 0.942
Table 3: The comparison result of our model with other State-of-the-Art models on the KTH, Human 3.6M and Radar Echo Datasets. Our model is the most high-performing method over all datasets.
Figure 5: Frame-wise comparisons of the next 10 generated radar maps. Higher CSI curves indicate better forecasting results.

5 Ablation Study

Extensive ablation studies are carried out in this section to thoroughly investigate the best way to design the whole architecture, including the analysis of each component of our model. In the following, we address and deeply explore the three questions discussed in Section 3.

Figure 6: Prediction examples on the Radar Echo Dataset of our model and the State-of-the-Art works.

5.1 The Number of Feature Layers

As shown in Table 4 (a), We evaluated four different baselines on the Moving MNIST dataset to study the relationship between the number of prediction spaces and the accuracy. The - baseline denotes the model with layers. It can be clearly observed that the utilization of more prediction spaces will lead to more favourable performance gains. While - just perform prediction on the single Pixel Space, the involvement of the second layer of feature space brings significant improvements, with a drop by and in terms of MSE and MAE respectively. From the results of - and -, we can see the errors decrease slower however the improvements in performance will be harder at that level as well. When the prediction spaces are increased from one to four, we see a remarkable drop on MSE by 42.6 and on MAE by 104.5. Concurrently to the continuous enhancements on MSE and MAE, the quality of the predicted frames is also increasing significantly according to the SSIM metric. This also can be seen from the illustration in Figure 8. Based on all these observations, we can confidently conclude from the experiments results that increasing the number of prediction spaces will introduce further performance gains to the video forecasting task.

Figure 7: Prediction examples on the human action videos of our model and the State-of-the-Art works.

5.2 The Fusion Strategy

The other question we want to investigate is how the different fusion functions will contribute to the final results. Here we set the same number of prediction spaces and compare the four most commonly used fusion strategies [5], namely sum fusion, concatenate fusion, max fusion and attention fusion.

From the report in Table 4 (b), the Max fusion function achieved the top performance among the four on the term of MSE. It surpasses the concatenate fusion function by a gap of 1.9. However, the concatenate fusion function achieves the best performance in terms of MAE. A slightly lower error of 47.7 is obtained by concatenate fusion. On the last column of Table 4 (b), we can see that the SSIM values of the three fusion functions is almost the same. Therefore this evaluation suggests that all the different types of fusion strategies can achieve very similar results and more complex fusion function as attention does not lead to significantly better results.

Figure 8: The comparison results of using different number of prediction spaces.

5.3 The Recurrent Connection Module

Additionally, we carefully studied how the design of recurrent connections will affect the performance of the final model. To accomplish this, we selected the competitive recurrent models, i.e. ConvLSTM [27], ConvGRU [1] and ST-LSTM [36] for spatial-temporal data prediction.

The following observations can be found in Table 4 (c). Firstly, the ConvLSTM module outperforms the other two modules in a small gap, which is 1.0 MSE less than ST-LSTM and 4.4 MSE less than ConvGRU. The same conclusion can be deduced from the comparison of MAE in the second column. Secondly, the differences of the three baselines are quite small when they are compared with the sharper difference of 31.9 MSE by increasing new prediction spaces, therefore these results clearly underline the dominant importance of the proposed idea Single-To-Multiple.

(a) Model-1 59.8 152.2 0.816
Model-2 27.9 79.2 0.935
Model-3 19.2 58.7 0.958
Model-4 17.2 47.7 0.967
(b) Sum 15.4 48.6 0.967
Attention 16.1 49.6 0.966
Concatenate 17.2 47.7 0.967
Max 15.3 48.3 0.967
(c) Ours with ConvGRU 21.6 63.7 0.953
Ours with ST-LSTM 18.2 48.8 0.966
Ours with ConvLSTM 17.2 47.7 0.967
Table 4: Quantitative ablation results of our model. Part (a) shows the influence of the number of prediction space. Part (b) shows the sensitivity of of our model to the different fusion way. Part (c) shows the influence of different RNN cells in our recurrent connection module.

5.4 Generalization of our idea

Finally, a natural question arises, that is if this idea could also be applicable to other types of high-level feature space? Motivated by this, we evaluated our idea by employing another base module called PhyDNet [7], which is a typical encoder-predictor-decoder framework. The feature maps from the layers of the encoder are propagated to the layer of the decoder directly, in a similar way as to the U-Net model. However, there is no prediction process during this link on a different scale of feature maps. Therefore, we integrated and implemented our idea by revising the propagation so that multi-prediction processes are introduced in the model. Comparison results in terms of MSE, MAE and SSIM are presented in Table 5 show that considerable gains can be achieved to use this idea on the feature spaces of PhyDNet as well. Therefore, we can conclude that the evaluation study indicates that the key idea of this paper generalizes well for diverse types of feature spaces to achieve significant improvements.

PhyDNet [7] 33.2* 90.6* 0.925*
PhyDNet+Ours 26.5 75.5 0.942
Table 5: Generalization to other modalities. *results obtained by running online code from the authors, which are different from the scores reported in its original paper[[7]]. We tried to contact the author, but there was no response until submitted.

6 Conclusion

In this paper, to the best of our knowledge, we are the first to systematically investigate the largely neglected yet significant strategy to introduce multi-prediction spaces for the task of video forecasting. This is unlike the more common approach of just using a single prediction space. However, the results obtained through our extensive experiments on four datasets, indicate that our proposed idea is capable of drastically boosting the performance over the State-of-the-Art methods. We conducted a series of ablation studies on various ways to factorize the feature spaces, fuse the feature maps from different prediction spaces, and build up Recurrent Connection. Furthermore, evaluations on other prediction spaces demonstrate the generalization of our idea. Finally, we also found that the prediction results from pixel space and high-level feature spaces are complementary to each other and can be fused together to futher improve the performance. This study is a first step towards enhancing our understanding of this topic, and we believe that the various in-depth component analysis will provide a lot of important insights that will hopefully encourage the future research work in the video forecasting domain.