NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

11/24/2021
by   Chenfei Wu, et al.
26

This paper presents a unified multimodal pre-trained model called NÜWA that can generate new or manipulate existing visual data (i.e., images and videos) for various visual synthesis tasks. To cover language, image, and video at the same time for different scenarios, a 3D transformer encoder-decoder framework is designed, which can not only deal with videos as 3D data but also adapt to texts and images as 1D and 2D data, respectively. A 3D Nearby Attention (3DNA) mechanism is also proposed to consider the nature of the visual data and reduce the computational complexity. We evaluate NÜWA on 8 downstream tasks. Compared to several strong baselines, NÜWA achieves state-of-the-art results on text-to-image generation, text-to-video generation, video prediction, etc. Furthermore, it also shows surprisingly good zero-shot capabilities on text-guided image and video manipulation tasks. Project repo is https://github.com/microsoft/NUWA.

READ FULL TEXT

page 13

page 14

page 15

page 16

page 21

page 22

page 25

page 26

research
02/15/2020

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

We propose UniViLM: a Unified Video and Language pre-training Model for ...
research
06/14/2022

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Unified vision-language frameworks have greatly advanced in recent years...
research
11/19/2021

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pre-training to enable cross-moda...
research
05/23/2023

Perception Test: A Diagnostic Benchmark for Multimodal Video Models

We propose a novel multimodal video benchmark - the Perception Test - to...
research
09/21/2023

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

Text-guided image generation aimed to generate desired images conditione...
research
03/04/2022

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Most methods for conditional video synthesis use a single modality as th...
research
07/20/2022

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

In this paper, we present NUWA-Infinity, a generative model for infinite...

Please sign up or login with your details

Forgot password? Click here to reset