InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

05/21/2023
by   Bosheng Qin, et al.
0

We present an end-to-end diffusion-based method for editing videos with human language instructions, namely $\textbf{InstructVid2Vid}$. Our approach enables the editing of input videos based on natural language instructions without any per-example fine-tuning or inversion. The proposed InstructVid2Vid model combines a pretrained image generation model, Stable Diffusion, with a conditional 3D U-Net architecture to generate time-dependent sequence of video frames. To obtain the training data, we incorporate the knowledge and expertise of different models, including ChatGPT, BLIP, and Tune-a-Video, to synthesize video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To improve the consistency between adjacent frames of generated videos, we propose the Frame Difference Loss, which is incorporated during the training process. During inference, we extend the classifier-free guidance to text-video input to guide the generated results, making them more related to both the input video and instruction. Experiments demonstrate that InstructVid2Vid is able to generate high-quality, temporally coherent videos and perform diverse edits, including attribute editing, change of background, and style transfer. These results highlight the versatility and effectiveness of our proposed method. Code is released in $\href{https://github.com/BrightQin/InstructVid2Vid}{InstructVid2Vid}$.

READ FULL TEXT

page 7

page 10

page 11

page 12

page 13

page 14

page 15

page 16

research
11/17/2022

InstructPix2Pix: Learning to Follow Image Editing Instructions

We propose a method for editing images from human instructions: given an...
research
12/22/2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

To reproduce the success of text-to-image (T2I) generation, recent works...
research
04/12/2023

Improving Diffusion Models for Scene Text Editing with Dual Encoders

Scene text editing is a challenging task that involves modifying or inse...
research
03/17/2023

DialogPaint: A Dialog-based Image Editing Model

We present DialogPaint, an innovative framework that employs an interact...
research
04/02/2021

Language-based Video Editing via Multi-Modal Multi-Level Transformer

Video editing tools are widely used nowadays for digital design. Althoug...
research
04/05/2019

Point-to-Point Video Generation

While image manipulation achieves tremendous breakthroughs (e.g., genera...
research
10/22/2020

Castle in the Sky: Dynamic Sky Replacement and Harmonization in Videos

This paper proposes a vision-based method for video sky replacement and ...

Please sign up or login with your details

Forgot password? Click here to reset