Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

08/11/2023
by   Yuki Endo, et al.
0

Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g, sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Moreover, we propose masked-attention guidance, which can generate images more faithful to semantic masks than the first approach. Masked-attention guidance indirectly controls attention to each word and pixel according to the semantic regions by manipulating noise images fed to diffusion models. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.

READ FULL TEXT

page 1

page 3

page 5

page 6

page 7

page 8

page 9

page 10

research
06/23/2023

Zero-shot spatial layout conditioning for text-to-image diffusion models

Large-scale text-to-image diffusion models have significantly improved t...
research
06/01/2023

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

Text-conditional diffusion models are able to generate high-fidelity ima...
research
03/01/2023

Collage Diffusion

Text-conditional diffusion models generate high-quality, diverse images....
research
05/25/2020

SegAttnGAN: Text to Image Generation with Segmentation Attention

In this paper, we propose a novel generative network (SegAttnGAN) that u...
research
06/30/2023

Counting Guidance for High Fidelity Text-to-Image Synthesis

Recently, the quality and performance of text-to-image generation signif...
research
04/06/2023

Training-Free Layout Control with Cross-Attention Guidance

Recent diffusion-based generators can produce high-quality images based ...
research
06/04/2023

Detector Guidance for Multi-Object Text-to-Image Generation

Diffusion models have demonstrated impressive performance in text-to-ima...

Please sign up or login with your details

Forgot password? Click here to reset