Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

06/16/2023
by   Geon Yeong Park, et al.
0

Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing.

READ FULL TEXT

page 8

page 9

page 19

page 20

page 21

page 22

page 23

page 24

research
08/18/2023

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Recently, large-scale diffusion models, e.g., Stable diffusion and DallE...
research
06/26/2023

A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

While recent developments in text-to-image generative models have led to...
research
03/05/2023

How to Construct Energy for Images? Denoising Autoencoder Can Be Energy Based Model

Energy-based models parameterize the unnormalized log-probability of dat...
research
02/08/2023

Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models

Recent advancements in large scale text-to-image models have opened new ...
research
11/21/2022

DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning

Large-scale text-to-image generation models have achieved remarkable pro...
research
03/30/2023

DiffCollage: Parallel Generation of Large Content with Diffusion Models

We present DiffCollage, a compositional diffusion model that can generat...
research
06/26/2023

Localized Text-to-Image Generation for Free via Cross Attention Control

Despite the tremendous success in text-to-image generative models, local...

Please sign up or login with your details

Forgot password? Click here to reset