One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

03/12/2023
by   Fan Bao, et al.
0

This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is – learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model – perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).

READ FULL TEXT

page 12

page 13

page 14

page 15

page 16

page 19

page 23

page 24

research
04/20/2023

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Diffusion models arise as a powerful generative tool recently. Despite t...
research
05/28/2023

Cognitively Inspired Cross-Modal Data Generation Using Diffusion Models

Most existing cross-modal generative methods based on diffusion models u...
research
01/12/2023

Multimodal Deep Learning

This book is the result of a seminar in which we reviewed multimodal app...
research
11/27/2022

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

The recently developed discrete diffusion models perform extraordinarily...
research
05/24/2023

T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities

Diffusion Probabilistic Field (DPF) models the distribution of continuou...
research
09/22/2022

UniColor: A Unified Framework for Multi-Modal Colorization with Transformer

We propose the first unified framework UniColor to support colorization ...
research
06/10/2022

Image Generation with Multimodal Priors using Denoising Diffusion Probabilistic Models

Image synthesis under multi-modal priors is a useful and challenging tas...

Please sign up or login with your details

Forgot password? Click here to reset