NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

by   Chenfei Wu, et al.

This paper presents a unified multimodal pre-trained model called NÜWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.



There are no comments yet.


page 13

page 14

page 15

page 16

page 21

page 22

page 25

page 26


UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

We propose UniViLM: a Unified Video and Language pre-training Model for ...

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions

Generating videos from text is a challenging task due to its high comput...

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pre-training to enable cross-moda...

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...

Narration Generation for Cartoon Videos

Research on text generation from multimodal inputs has largely focused o...

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Contrastive Vision-Language Pre-training (CLIP) has drown increasing att...

StyleMC: Multi-Channel Based Fast Text-Guided Image Generation and Manipulation

Discovering meaningful directions in the latent space of GANs to manipul...

Code Repositories


A unified 3D Transformer Pipeline for visual synthesis

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.