Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

11/23/2022
by   Tsu-Jui Fu, et al.
0

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

READ FULL TEXT

page 1

page 3

page 8

page 10

page 11

research
06/13/2023

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Large text-to-image diffusion models have exhibited impressive proficien...
research
03/29/2023

Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

As a combination of visual and audio signals, video is inherently multi-...
research
11/23/2017

Deep Video Generation, Prediction and Completion of Human Action Sequences

Current deep learning results on video generation are limited while ther...
research
04/02/2021

Language-based Video Editing via Multi-Modal Multi-Level Transformer

Video editing tools are widely used nowadays for digital design. Althoug...
research
08/28/2023

MagicAvatar: Multimodal Avatar Generation and Animation

This report presents MagicAvatar, a framework for multimodal video gener...
research
05/13/2023

Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval

Recently, masked video modeling has been widely explored and significant...
research
04/05/2019

Point-to-Point Video Generation

While image manipulation achieves tremendous breakthroughs (e.g., genera...

Please sign up or login with your details

Forgot password? Click here to reset