From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

09/08/2023
by   Changming Xiao, et al.
0

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another perspective, these generative models imply clues about the precise correlation between words and pixels. In this work, a simple but effective method is proposed to utilize the attention mechanism in the denoising network of text-to-image diffusion models. Without re-training nor inference-time optimization, the semantic grounding of phrases can be attained directly. We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting and our method achieves superior performance to prior methods. In addition, the acquired word-pixel correlation is found to be generalizable for the learned text embedding of customized generation methods, requiring only a few modifications. To validate our discovery, we introduce a new practical task called "personalized referring image segmentation" with a new dataset. Experiments in various situations demonstrate the advantages of our method compared to strong baselines on this task. In summary, our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.

READ FULL TEXT

page 2

page 4

page 5

page 7

page 9

page 10

research
09/04/2023

Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion

Although recent advancements in diffusion models enabled high-fidelity a...
research
06/06/2023

Conditional Diffusion Models for Weakly Supervised Medical Image Segmentation

Recent advances in denoising diffusion probabilistic models have shown g...
research
01/12/2023

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

The goal of this paper is to augment a pre-trained text-to-image diffusi...
research
06/01/2023

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Personalized text-to-image generation using diffusion models has recentl...
research
10/10/2022

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Large-scale diffusion neural networks represent a substantial milestone ...
research
06/14/2023

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Text-to-Image (T2I) generation with diffusion models allows users to con...
research
08/03/2023

ConceptLab: Creative Generation using Diffusion Prior Constraints

Recent text-to-image generative models have enabled us to transform our ...

Please sign up or login with your details

Forgot password? Click here to reset