Masked Diffusion Transformer is a Strong Image Synthesizer

03/25/2023
by   Shanghua Gao, et al.
0

Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs' ability of contextual relation learning among object semantic parts in an image. During training, MDT operates on the latent space to mask certain tokens. Then, an asymmetric masking diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. Experimental results show that MDT achieves superior image synthesis performance, e.g. a new SoTA FID score on the ImageNet dataset, and has about 3x faster learning speed than the previous SoTA DiT. The source code is released at https://github.com/sail-sg/MDT.

READ FULL TEXT

page 1

page 2

page 4

page 5

page 12

page 13

research
12/27/2022

Exploring Transformer Backbones for Image Diffusion Models

We present an end-to-end Transformer based Latent Diffusion model for im...
research
10/20/2022

Representation Learning with Diffusion Models

Diffusion models (DMs) have achieved state-of-the-art results for image ...
research
03/30/2023

Token Merging for Fast Stable Diffusion

The landscape of image generation has been forever changed by open vocab...
research
12/19/2022

Scalable Diffusion Models with Transformers

We explore a new class of diffusion models based on the transformer arch...
research
12/06/2022

Semantic-Conditional Diffusion Networks for Image Captioning

Recent advances on text-to-image generation have witnessed the rise of d...
research
05/27/2022

MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning

In this study, we propose Mixed and Masked Image Modeling (MixMIM), a si...
research
06/08/2023

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

Image recognition and generation have long been developed independently ...

Please sign up or login with your details

Forgot password? Click here to reset