Detector Guidance for Multi-Object Text-to-Image Generation

06/04/2023
by   Luping Liu, et al.
0

Diffusion models have demonstrated impressive performance in text-to-image generation. They utilize a text encoder and cross-attention blocks to infuse textual information into images at a pixel level. However, their capability to generate images with text containing multiple objects is still restricted. Previous works identify the problem of information mixing in the CLIP text encoder and introduce the T5 text encoder or incorporate strong prior knowledge to assist with the alignment. We find that mixing problems also occur on the image side and in the cross-attention blocks. The noisy images can cause different objects to appear similar, and the cross-attention blocks inject information at a pixel level, leading to leakage of global object understanding and resulting in object mixing. In this paper, we introduce Detector Guidance (DG), which integrates a latent object detection model to separate different objects during the generation process. DG first performs latent object detection on cross-attention maps (CAMs) to obtain object information. Based on this information, DG then masks conflicting prompts and enhances related prompts by manipulating the following CAMs. We evaluate the effectiveness of DG using Stable Diffusion on COCO, CC, and a novel multi-related object benchmark, MRO. Human evaluations demonstrate that DG provides an 8-22% advantage in preventing the amalgamation of conflicting concepts and ensuring that each object possesses its unique region without any human involvement and additional iterations. Our implementation is available at <https://github.com/luping-liu/Detector-Guidance>.

READ FULL TEXT

page 2

page 4

page 8

page 9

page 17

research
08/11/2023

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Text-to-image synthesis has achieved high-quality results with recent ad...
research
10/10/2022

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

Large-scale diffusion neural networks represent a substantial milestone ...
research
06/13/2023

Referring Camouflaged Object Detection

In this paper, we consider the problem of referring camouflaged object d...
research
03/25/2023

Freestyle Layout-to-Image Synthesis

Typical layout-to-image synthesis (LIS) models generate images for a clo...
research
03/20/2023

Localizing Object-level Shape Variations with Text-to-Image Diffusion Models

Text-to-image models give rise to workflows which often begin with an ex...
research
09/30/2022

Understanding Pure CLIP Guidance for Voxel Grid NeRF Models

We explore the task of text to 3D object generation using CLIP. Specific...
research
07/06/2023

On the Cultural Gap in Text-to-Image Generation

One challenge in text-to-image (T2I) generation is the inadvertent refle...

Please sign up or login with your details

Forgot password? Click here to reset