CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

05/29/2022
by   Wenyi Hong, et al.
0

Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

READ FULL TEXT

page 1

page 4

page 9

page 13

page 14

research
07/21/2022

TinyViT: Fast Pretraining Distillation for Small Vision Transformers

Vision transformer (ViT) recently has drawn great attention in computer ...
research
07/26/2021

Don't Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers

Self-supervised pre-training of large-scale transformer models on text c...
research
12/22/2022

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

To reproduce the success of text-to-image (T2I) generation, recent works...
research
06/21/2021

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

We present CLIP2Video network to transfer the image-language pre-trainin...
research
08/05/2021

Finetuning Pretrained Transformers into Variational Autoencoders

Text variational autoencoders (VAEs) are notorious for posterior collaps...
research
07/07/2022

Training Transformers Together

The infrastructure necessary for training state-of-the-art models is bec...
research
04/18/2023

SViTT: Temporal Learning of Sparse Video-Text Transformers

Do video-text transformers learn to model temporal relationships across ...

Please sign up or login with your details

Forgot password? Click here to reset