Unified Discrete Diffusion for Simultaneous Vision-Language Generation

11/27/2022
by   Minghui Hu, et al.
0

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

READ FULL TEXT

page 1

page 7

page 8

page 19

page 20

page 22

page 23

page 24

research
06/01/2023

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Text-conditional diffusion models are able to generate high-fidelity ima...
research
02/11/2023

A Reparameterized Discrete Diffusion Model for Text Generation

This work studies discrete diffusion probabilistic models with applicati...
research
03/12/2023

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

This paper proposes a unified diffusion framework (dubbed UniDiffuser) t...
research
02/01/2023

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Recent years have witnessed a big convergence of language, vision, and m...
research
09/06/2021

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Multimodal abstractive summarization (MAS) models that summarize videos ...
research
03/18/2023

3DQD: Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process

We develop a generalized 3D shape generation prior model, tailored for m...
research
06/19/2023

MotionGPT: Finetuned LLMs are General-Purpose Motion Generators

Generating realistic human motion from given action descriptions has exp...

Please sign up or login with your details

Forgot password? Click here to reset