Collage Diffusion

by   Vishnu Sarukkai, et al.

Text-conditional diffusion models generate high-quality, diverse images. However, text is often an ambiguous specification for a desired target image, creating the need for additional user-friendly controls for diffusion-based image generation. We focus on having precise control over image output for scenes with several objects. Users control image generation by defining a collage: a text prompt paired with an ordered sequence of layers, where each layer is an RGBA image and a corresponding text prompt. We introduce Collage Diffusion, a collage-conditional diffusion algorithm that allows users to control both the spatial arrangement and visual attributes of objects in the scene, and also enables users to edit individual components of generated images. To ensure that different parts of the input text correspond to the various locations specified in the input collage layers, Collage Diffusion modifies text-image cross-attention with the layers' alpha masks. To maintain characteristics of individual collage layers that are not specified in text, Collage Diffusion learns specialized text representations per layer. Collage input also enables layer-based controls that provide fine-grained control over the final output: users can control image harmonization on a layer-by-layer basis, and they can edit individual objects in generated images while keeping other objects fixed. Collage-conditional image generation requires harmonizing the input collage to make objects fit together–the key challenge involves minimizing changes in the positions and key visual attributes of objects in the input collage while allowing other attributes of the collage to change in the harmonization process. By leveraging the rich information present in layer input, Collage Diffusion generates globally harmonized images that maintain desired object locations and visual characteristics better than prior approaches.


page 10

page 20

page 21

page 22

page 23

page 24

page 25

page 26


Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Text-to-image synthesis has achieved high-quality results with recent ad...

Directed Diffusion: Direct Control of Object Placement through Attention Guidance

Text-guided diffusion models such as DALLE-2, IMAGEN, and Stable Diffusi...

Expressive Text-to-Image Generation with Rich Text

Plain text has become a prevalent interface for text-to-image synthesis....

Localizing Object-level Shape Variations with Text-to-Image Diffusion Models

Text-to-image models give rise to workflows which often begin with an ex...

FISEdit: Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference

Due to the recent success of diffusion models, text-to-image generation ...

SpaText: Spatio-Textual Representation for Controllable Image Generation

Recent text-to-image diffusion models are able to generate convincing re...

M-VADER: A Model for Diffusion with Multimodal Context

We introduce M-VADER: a diffusion model (DM) for image generation where ...

Please sign up or login with your details

Forgot password? Click here to reset