Re-Imagen: Retrieval-Augmented Text-to-Image Generator

09/29/2022
by   Wenhu Chen, et al.
0

Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs, and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval condition to balance the text and retrieval alignment. Re-Imagen achieves new SoTA FID results on two image generation benchmarks, such as COCO (ie, FID = 5.25) and WikiImage (ie, FID = 5.82) without fine-tuning. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple visual domains. Human evaluation on EntityDrawBench shows that Re-Imagen performs on par with the best prior models in photo-realism, but with significantly better faithfulness, especially on less frequent entities.

READ FULL TEXT

page 2

page 7

page 9

page 10

page 11

page 15

page 16

page 17

research
11/22/2022

Retrieval-Augmented Multimodal Language Modeling

Recent multimodal models such as DALL-E and CM3 have achieved remarkable...
research
06/15/2023

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Text-conditioned image generation models often generate incorrect associ...
research
04/02/2019

Semantics Disentangling for Text-to-Image Generation

Synthesizing photo-realistic images from text descriptions is a challeng...
research
04/01/2023

Subject-driven Text-to-Image Generation via Apprenticeship Learning

Recent text-to-image generation models like DreamBooth have made remarka...
research
10/16/2022

Large-scale Text-to-Image Generation Models for Visual Artists' Creative Works

Large-scale Text-to-image Generation Models (LTGMs) (e.g., DALL-E), self...
research
02/22/2022

Retrieval Augmented Classification for Long-Tail Visual Recognition

We introduce Retrieval Augmented Classification (RAC), a generic approac...
research
08/25/2022

Multimedia Generative Script Learning for Task Planning

Goal-oriented generative script learning aims to generate subsequent ste...

Please sign up or login with your details

Forgot password? Click here to reset