X Fuse: Fusing Visual Information in Text-to-Image Generation

03/02/2023
by   Yuval Kirstain, et al.
0

We introduce X Fuse, a general approach for conditioning on visual information when generating images from text. We demonstrate the potential of X Fuse in three different text-to-image generation scenarios. (i) When a bank of images is available, we retrieve and condition on a related image (Retrieve Fuse), resulting in significant improvements on the MS-COCO benchmark, gaining a state-of-the-art FID score of 6.65 in zero-shot settings. (ii) When cropped-object images are at hand, we utilize them and perform subject-driven generation (Crop Fuse), outperforming the textual inversion method while being more than x100 faster. (iii) Having oracle access to the image scene (Scene Fuse), allows us to achieve an FID score of 5.03 on MS-COCO in zero-shot settings. Our experiments indicate that X Fuse is an effective, easy-to-adapt, simple, and general approach for scenarios in which the model may benefit from additional visual information.

READ FULL TEXT

page 1

page 5

page 6

page 11

page 12

page 13

page 14

page 15

research
02/24/2021

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better mod...
research
10/27/2022

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

Recent progress in diffusion models has revolutionized the popular techn...
research
05/26/2021

CogView: Mastering Text-to-Image Generation via Transformers

Text-to-Image generation in the general domain has long been an open pro...
research
08/13/2021

Detection and Captioning with Unseen Object Classes

Image caption generation is one of the most challenging problems at the ...
research
04/26/2023

Controllable Image Generation via Collage Representations

Recent advances in conditional generative image models have enabled impr...
research
12/02/2021

FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization

Generating images from natural language instructions is an intriguing ye...
research
12/06/2021

Embedding Arithmetic for Text-driven Image Transformation

Latent text representations exhibit geometric regularities, such as the ...

Please sign up or login with your details

Forgot password? Click here to reset