SimDA: Simple Diffusion Adapter for Efficient Video Generation

08/18/2023
by   Zhen Xing, et al.
0

The recent wave of AI-generated content has witnessed the great development and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video (T2V) still falls short of expectations though attracting increasing interests. Existing works either train from scratch or adapt large T2I model to videos, both of which are computation and resource expensive. In this work, we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way. In particular, we turn the T2I model for T2V by designing light-weight spatial and temporal adapters for transfer learning. Besides, we change the original spatial attention to the proposed Latent-Shift Attention (LSA) for temporal consistency. With similar model architecture, we further train a video super-resolution model to generate high-definition (1024x1024) videos. In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our method could minimize the training effort with extremely few tunable parameters for model adaptation.

READ FULL TEXT

page 1

page 4

page 7

page 8

page 9

research
10/05/2022

Imagen Video: High Definition Video Generation with Diffusion Models

We present Imagen Video, a text-conditional video generation system base...
research
04/18/2023

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Diffusion Models (LDMs) enable high-quality image synthesis while...
research
12/22/2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

To reproduce the success of text-to-image (T2I) generation, recent works...
research
08/17/2023

Edit Temporal-Consistent Videos with Image Diffusion Model

Large-scale text-to-image (T2I) diffusion models have been extended for ...
research
03/31/2021

Video Exploration via Video-Specific Autoencoders

We present simple video-specific autoencoders that enables human-control...
research
11/20/2022

MagicVideo: Efficient Video Generation With Latent Diffusion Models

We present an efficient text-to-video generation framework based on late...
research
09/18/2020

DeepRemaster: Temporal Source-Reference Attention Networks for Comprehensive Video Enhancement

The remastering of vintage film comprises of a diversity of sub-tasks in...

Please sign up or login with your details

Forgot password? Click here to reset