Hierarchical Masked 3D Diffusion Model for Video Outpainting

09/05/2023
by   Fanda Fan, et al.
0

Video outpainting aims to adequately complete missing areas at the edges of video frames. Compared to image outpainting, it presents an additional challenge as the model should maintain the temporal consistency of the filled area. In this paper, we introduce a masked 3D diffusion model for video outpainting. We use the technique of mask modeling to train the 3D diffusion model. This allows us to use multiple guide frames to connect the results of multiple video clip inferences, thus ensuring temporal consistency and reducing jitter between adjacent frames. Meanwhile, we extract the global frames of the video as prompts and guide the model to obtain information other than the current video clip using cross-attention. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem. The existing coarse-to-fine pipeline only uses the infilling strategy, which brings degradation because the time interval of the sparse frames is too large. Our pipeline benefits from bidirectional learning of the mask modeling and thus can employ a hybrid strategy of infilling and interpolation when generating sparse frames. Experiments show that our method achieves state-of-the-art results in video outpainting tasks. More results are provided at our https://fanfanda.github.io/M3DDM/.

READ FULL TEXT
research
06/02/2023

Video Colorization with Pre-trained Text-to-Image Diffusion Models

Video colorization is a challenging task that involves inferring plausib...
research
05/19/2022

Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation

Video prediction is a challenging task. The quality of video frames from...
research
04/07/2022

Video Diffusion Models

Generating temporally coherent high fidelity video is an important miles...
research
07/02/2023

Bidirectional Temporal Diffusion Model for Temporally Consistent Human Animation

We introduce a method to generate temporally coherent human animation fr...
research
04/04/2021

PDWN: Pyramid Deformable Warping Network for Video Interpolation

Video interpolation aims to generate a non-existent intermediate frame g...
research
06/15/2022

Diffusion Models for Video Prediction and Infilling

To predict and anticipate future outcomes or reason about missing inform...
research
12/03/2019

LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition

This paper presents LiteEval, a simple yet effective coarse-to-fine fram...

Please sign up or login with your details

Forgot password? Click here to reset