NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

11/24/2021
by   Chenfei Wu, et al.
26

This paper presents a unified multimodal pre-trained model called NÜWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 14

page 15

page 16

page 21

page 22

page 25

page 26

02/15/2020

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

We propose UniViLM: a Unified Video and Language pre-training Model for ...
04/30/2021

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Generating videos from text is a challenging task due to its high comput...
11/19/2021

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pre-training to enable cross-moda...
12/08/2021

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...
01/17/2021

Narration Generation for Cartoon Videos

Research on text generation from multimodal inputs has largely focused o...
12/04/2021

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Vision-Language Pre-training (CLIP) has drown increasing att...
12/15/2021

StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation

Discovering meaningful directions in the latent space of GANs to manipul...

Code Repositories

NUWA

A unified 3D Transformer Pipeline for visual synthesis


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.