A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

06/26/2023
by   Aishwarya Agarwal, et al.
0

While recent developments in text-to-image generative models have led to a suite of high-performing methods capable of producing creative imagery from free-form text, there are several limitations. By analyzing the cross-attention representations of these models, we notice two key issues. First, for text prompts that contain multiple concepts, there is a significant amount of pixel-space overlap (i.e., same spatial regions) among pairs of different concepts. This eventually leads to the model being unable to distinguish between the two concepts and one of them being ignored in the final generation. Next, while these models attempt to capture all such concepts during the beginning of denoising (e.g., first few steps) as evidenced by cross-attention maps, this knowledge is not retained by the end of denoising (e.g., last few steps). Such loss of knowledge eventually leads to inaccurate generation outputs. To address these issues, our key innovations include two test-time attention-based loss functions that substantially improve the performance of pretrained baseline text-to-image diffusion models. First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt, thereby reducing the confusion/conflict among various concepts and the eventual capture of all concepts in the generated output. Next, our attention retention loss explicitly forces text-to-image diffusion models to retain cross-attention information for all concepts across all denoising time steps, thereby leading to reduced information loss and the preservation of all concepts in the generated output.

READ FULL TEXT

page 1

page 2

page 3

page 6

page 7

page 12

page 14

page 15

research
06/16/2023

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Despite the remarkable performance of text-to-image diffusion models in ...
research
01/31/2023

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

Recent text-to-image generative models have demonstrated an unparalleled...
research
09/29/2022

DreamFusion: Text-to-3D using 2D Diffusion

Recent breakthroughs in text-to-image synthesis have been driven by diff...
research
06/08/2023

Grounded Text-to-Image Synthesis with Attention Refocusing

Driven by scalable diffusion models trained on large-scale paired text-i...
research
06/15/2023

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Text-conditioned image generation models often generate incorrect associ...
research
06/26/2023

Localized Text-to-Image Generation for Free via Cross Attention Control

Despite the tremendous success in text-to-image generative models, local...
research
10/10/2022

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Large-scale diffusion neural networks represent a substantial milestone ...

Please sign up or login with your details

Forgot password? Click here to reset