In video prediction, the predictor has to model both scene contents and motions. In recent years, deep learning approaches became the first choice for this task. Although having a deep network which can learn all the aspects of the task by itself is appealing, the history of deep learning shows that an appropriate network structure is key for learning from limited data. For example, typical properties of images are reflected in the structure of hierarchical convolutional networks. Video prediction is challenging, due to highly non-linear effects of local translations in the spatial domain. Estimating motion and using the estimated motion for prediction is much easier in the frequency domain. Multiple previous works tried to learn image relations by separating content and transformation. The learned features for these architectures are Gabor-like filters which decompose the signal according to spatial frequency and phase. In the Relational Auto-Encoder (RAE) 
, for example, the paired responses are then multiplied element-wise to estimate transformations between two consecutive frames. We argue that instead of element-wise multiplication of linear filter responses, we can compute the transformation by calculating phase difference in the frequency domain. The estimated phase difference can then easily be used for prediction in frequency space and the predicted frequency representation can be linearly transformed back into the spatial domain. We show the effectiveness of our proposed Frequency Domain Transformer Network (FDTN) approach on three synthetic datasets.
The code and datasets of this paper are publicly available.111 https://github.com/AIS-Bonn/FreqNet.
2 Related Work
Although many approaches to the video prediction task have been explored, the most successful approaches utilize deep learning models. Cricri et al.  proposed to add recurrent lateral connections in Ladder Networks to capture temporal dynamics of video. These recurrent connections, as well as lateral shortcuts, relive the deeper layers from modeling spatial detail. The VLN architecture achieves competitive results to Video Pixel Networks , the state-of-the-art on Moving MNIST dataset, using far fewer parameters.
Another well-known model is PGP , which is based on a gated auto-encoder and the bilinear transformation model of RAE . PGP has the assumption that two temporally consecutive frames can be described as a linear transformation of each other. In the PGP model, by using a bi-linear model, the hidden layer of mapping units encodes the transformation. These transformation encodings are then used to predict the next frame. Conv-PGP  reduces the number of parameters significantly, by utilizing convolutional layers.
Image registration is a fundamental task in image processing which estimates the relative transformation between two similar images. A well-known method for image registration using Fourier domain representation is Phase Correlation. Phase Correlation can be used to calculate the relative translative offset between two similar images. Reddy et al.  demonstrated that rotation and scaling differences between two images can be estimated by converting them to log-polar coordinates. Foroosh et al.  extended this method to work with subpixel transformation. Sarvaiya et al.  proposed an extended version of phase correlation which is more robust and can work under a higher scale. We are inspired by the phase correlation method and designed FDTN.
3 Frequency Domain Transformer Networks (FDTN)
If we assume periodic boundary conditions, it is possible to formulate the translation between two consecutive frames as element-wise differences of the phases of their complex frequency domain representation. We can then use this transformation to predict the next frame in the frequency domain by simple phase addition. The last step is converting the predicted frame to the spatial domain. Fig. 1 gives an overview of our proposed architecture. By using a Transform network in the frequency domain, we relax the periodic boundary assumption.
At the first step, we calculate the Fast Fourier Transform of two seed frames. To obtain the translation between two consecutive frames, we calculate the element-wise phase difference of those frames in frequency domain:
is a vector in the complex plane of the Fourier domain for the. is encoding the transformation in the Fourier domain which has the shape of , while each frame has the shape of . Note that a small positive constant is added for numerical stability.
It is possible to encode higher-order transformations like acceleration by calculating the difference of differences using Eq. 1. It is also possible to filter the noise in the by utilizing multiple observations.
We passed the transformation representation to “Transform Model”, a feed-forward network, to address the changes of the transformation. This model will change the in a way that it is suitable for the next frame prediction. Then, we use the transformed for predicting the next frame in the Fourier domain. We rotate each element by using the constructed rotation matrix:
where is the prediction of the next frame in Fourier domain. We can then obtain the predicted frame in time domain using the inverse FFT.
Although after inverse FFT we have the predicted frame, due to some numerical imprecisions, the result can become blurry after long prediction. This can be mitigated using the “Refine Model”, another feed-forward network that is designed for reconstructing detail in the spatial domain.
4 Experimental Results
We used three different datasets to evaluate our proposed architecture. Moving Morse Code is a simple one-dimensional dataset that contains patterns, which are chosen randomly from 36 different Morse codes. The patterns are moving with a random constant velocity. Moving MNIST contains ten frames with one MNIST digit moving inside a 4040 frame. Digits are chosen randomly from training and test set and placed at a random position with a random velocity. Bouncing Ball dataset contains ten frames with one round object moving inside a 4040 frame. Balls are positioned randomly with a random velocity. Note that the ball can move with subpixel velocity.
We used two different trainable models in our computational graph. Each has a different purpose. “Transform Model” is designed to change the transformations between frames. In Moving MNIST and Bouncing Ball, this model is responsible for changing the motion of digits. We propose two Transform Model versions: FDTN(FC) and FDTN(Conv). FDTN(FC) is utilizing two fully-connected layers with sigmoid activations. FDTN(Conv) is designed to utilize the structure of the data by using a three-layer convolutional network with ReLU activations. The convolutional version is more efficient than the fully connected version, and it has fewer parameters. The only issue using convolutional layers is the fact that due to the location-invariant nature of convolutions, we cannot model location-dependent features. To address this issue, we used location-dependent convolutional layers, proposed by Azizi et al.. Fig. 4(f) shows that if we eliminate the Transform Model we cannot predict the changes of transformations.
To mirror the velocity at the border in 2-dimensional datasets, we can flip around the desired axes. We calculate four different versions of ; the original
, flipped vertically, horizontally, and both. In the last part of the “Transform Model” network, we have a softmax layer, which can weight between these four different versions. The weighted sum is then routed for Phase Adding operation to calculate the next frame. The input to the model is the predicted frame without the “Transform Model” applied. Note that to have a more efficient inference implementation, if the object does not need to change transformation, we can route the predicted frame directly to the “Refine Model”. The implementation of the “Transform Model” for the Morse Code dataset is different. Since we don’t need to change the velocity at the border, we denoiseusing fully connected layers.
Due to numerical imprecision, the predicted frame can become blurry when predicted for a long time. To mitigate this issue, we propose the second learnable model, “Refine Model”, consisting of three convolutional layers followed by ReLU activations. The result of these is multiplied element-wise to create the output. The effect of eliminating this model is shown in Fig. 4(e).
In our first experiments, we used Moving Morse Code dataset to sanity-check our implementation. A sample result from this dataset is depicted in the Fig. 2.
In Moving Morse Code, we predicted 18 frames from two noisy seed frames. Transform Model can learn to denoise the frequency-domain representation.
We evaluate our architecture on both Moving MNIST and Bouncing Ball datasets. We used Conv-PGP and VLN model as the baselines for comparison. In these experiments, we predicted eight frames from two seed inputs. Sample results of our models, as well as used baselines, are presented in Fig. 3.
Table. 1 reports the prediction loss and the number of parameters for the evaluated models. It can be observed that both of our proposed models outperform our baselines on both Moving MNIST and Bouncing Ball datasets.
The model is trained end-to-end using backpropagation through time. We used Adam optimizer and MSE loss. Similar to VLN and Conv-PGP models, at each time-step our method predicts one frame, but in contrast to them our model which is trained for predicting ten sequences, can work well on longer sequences. One sample of longer prediction is presented in Fig.4.
|Model||Moving MNIST||Bouncing Ball||Number of parameters|
We propose an end-to-end learnable neural network which has a special structure to estimate the transformation between consecutive video frames in frequency domain and use this estimate to make predictions about future frames. Experiments indicate that our proposed architecture can solve video prediction task in synthetic datasets. Our proposed architecture significantly outperforms the results of both VLN and Conv-PGP models on Moving MNIST and Bouncing Ball datasets. The fully connected version performs better than the convolutional one, though with more parameters.
This work was funded by grant BE 2556/16-1 (Research Unit FOR 2535 Anticipating Human Behavior) of the German Research Foundation (DFG).
-  Vincent Michalski, Roland Memisevic, and Kishore Konda. Modeling deep temporal dependencies with recurrent grammar cells. In NIPS, 2014.
-  Roland Memisevic. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1829–1846, 2013.
-  Francesco Cricri, Xingyang Ni, Mikko Honkala, Emre Aksu, and Moncef Gabbouj. Video ladder networks. arXiv:1612.01756, 2016.
-  N. Kalchbrenner, A.v.d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. arXiv:1610.00527, 2016.
-  Filip De Roos. Modeling spatiotemporal information with convolutional gated networks. Master’s thesis, Chalmers University of Technology, 2016.
-  B.S. Reddy and B.N. Chatterji. An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Tr. on Image Processing, 5(8), 1996.
-  Hassan Foroosh, Josiane Zerubia, and Marc Berthod. Extension of phase correlation to subpixel registration. IEEE Tr. on Image Processing, 11(3):188–200, 2002.
J.N. Sarvaiya, S. Patnaik, and K. Kothari.
Image registration using log polar transform and phase correlation to
recover higher scale.
J. of Pattern Recognition Research, 7(1):90–105, 2012.
-  Niloofar Azizi, Hafez Farazi and Sven Behnke. Location dependency in video prediction. In International Conference on Artificial Neural Networks (ICANN), 2018.