Fast Fourier Inception Networks for Occluded Video Prediction

06/17/2023
by   Ping Li, et al.
0

Video prediction is a pixel-level task that generates future frames by employing the historical frames. There often exist continuous complex motions, such as object overlapping and scene occlusion in video, which poses great challenges to this task. Previous works either fail to well capture the long-term temporal dynamics or do not handle the occlusion masks. To address these issues, we develop the fully convolutional Fast Fourier Inception Networks for video prediction, termed FFINet, which includes two primary components, , the occlusion inpainter and the spatiotemporal translator. The former adopts the fast Fourier convolutions to enlarge the receptive field, such that the missing areas (occlusion) with complex geometric structures are filled by the inpainter. The latter employs the stacked Fourier transform inception module to learn the temporal evolution by group convolutions and the spatial movement by channel-wise Fourier convolutions, which captures both the local and the global spatiotemporal features. This encourages generating more realistic and high-quality future frames. To optimize the model, the recovery loss is imposed to the objective, , minimizing the mean square error between the ground-truth frame and the recovery frame. Both quantitative and qualitative experimental results on five benchmarks, including Moving MNIST, TaxiBJ, Human3.6M, Caltech Pedestrian, and KTH, have demonstrated the superiority of the proposed approach. Our code is available at GitHub.

READ FULL TEXT

page 1

page 2

page 4

page 6

page 9

page 10

research
07/20/2020

Learning Joint Spatial-Temporal Transformations for Video Inpainting

High-quality video inpainting that completes missing regions in video fr...
research
11/19/2022

NIO: Lightweight neural operator-based architecture for video frame interpolation

We present, NIO - Neural Interpolation Operator, a lightweight efficient...
research
10/17/2021

Robust Pedestrian Attribute Recognition Using Group Sparsity for Occlusion Videos

Occlusion processing is a key issue in pedestrian attribute recognition ...
research
05/24/2021

Taylor saves for later: disentanglement for video prediction using Taylor representation

Video prediction is a challenging task with wide application prospects i...
research
05/24/2019

From Here to There: Video Inbetweening Using Direct 3D Convolutions

We consider the problem of generating plausible and diverse video sequen...
research
09/15/2020

Comparison of Spatiotemporal Networks for Learning Video Related Tasks

Many methods for learning from video sequences involve temporally proces...

Please sign up or login with your details

Forgot password? Click here to reset