Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

06/15/2023
by   Royi Rassin, et al.
7

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, a query like “a pink sunflower and a yellow flamingo” may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

READ FULL TEXT

page 2

page 7

page 8

page 13

page 14

page 15

page 16

page 18

research
09/29/2022

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Research on text-to-image generation has witnessed significant progress ...
research
05/18/2023

X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models

This paper introduces a novel explainable image quality evaluation appro...
research
12/07/2022

Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation

Text-to-image generation methods produce high-resolution and high-qualit...
research
10/19/2022

DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models

We study the way DALLE-2 maps symbols (words) in the prompt to their ref...
research
05/24/2023

Transferring Visual Attributes from Natural Language to Verified Image Generation

Text to image generation methods (T2I) are widely popular in generating ...
research
06/26/2023

A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

While recent developments in text-to-image generative models have led to...
research
07/20/2023

Divide Bind Your Attention for Improved Generative Semantic Nursing

Emerging large-scale text-to-image generative models, e.g., Stable Diffu...

Please sign up or login with your details

Forgot password? Click here to reset