Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

12/22/2022
by   Jay Zhangjie Wu, et al.
3

To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ large-scale text-video dataset for fine-tuning. However, such paradigm is computationally expensive. Humans have the amazing ability to learn new visual concepts from just one single exemplar. We hereby study a new T2V generation problemx2014One-Shot Video Generation, where only a single text-video pair is presented for training an open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion model pretrained on massive image data for T2V generation. We make two key observations: 1) T2I models are able to generate images that align well with the verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal Attention, which generates videos from text prompts via an efficient one-shot tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing temporally-coherent videos over various applications such as change of subject or background, attribute editing, style transfer, demonstrating the versatility and effectiveness of our method.

READ FULL TEXT

page 1

page 6

page 8

page 12

page 13

page 14

page 15

page 16

research
05/21/2023

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

We present an end-to-end diffusion-based method for editing videos with ...
research
08/18/2023

SimDA: Simple Diffusion Adapter for Efficient Video Generation

The recent wave of AI-generated content has witnessed the great developm...
research
05/26/2023

ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

In this paper, we present ControlVideo, a novel method for text-driven v...
research
06/02/2023

Probabilistic Adaptation of Text-to-Video Models

Large text-to-video models trained on internet-scale data have demonstra...
research
05/29/2022

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Large-scale pretrained transformers have created milestones in text (GPT...
research
08/16/2023

Dual-Stream Diffusion Net for Text-to-Video Generation

With the emerging diffusion models, recently, text-to-video generation h...
research
05/23/2023

Large Language Models are Frame-level Directors for Zero-shot Text-to-Video Generation

In the paradigm of AI-generated content (AIGC), there has been increasin...

Please sign up or login with your details

Forgot password? Click here to reset