Retrieval-Augmented Multimodal Language Modeling

11/22/2022
by   Michihiro Yasunaga, et al.
28

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all learned knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant knowledge fetched by a retriever from external memory (e.g., multimodal documents on the web). Specifically, we implement a retriever using the pretrained CLIP model and a generator using the CM3 Transformer architecture, and train this model using the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate mixtures of text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30 capabilities such as knowledge-intensive image generation and multimodal in-context learning.

READ FULL TEXT

page 1

page 8

page 9

page 10

page 11

research
10/06/2022

MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text

While language Models store a massive amount of world knowledge implicit...
research
09/29/2022

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Research on text-to-image generation has witnessed significant progress ...
research
03/20/2023

Retrieving Multimodal Information for Augmented Generation: A Survey

In this survey, we review methods that retrieve multimodal knowledge to ...
research
02/09/2023

Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Augmenting pretrained language models (LMs) with a vision encoder (e.g.,...
research
07/07/2022

Multi-Task Retrieval-Augmented Text Generation with Relevance Sampling

This paper studies multi-task training of retrieval-augmented generation...
research
04/27/2023

DataComp: In search of the next generation of multimodal datasets

Large multimodal datasets have been instrumental in recent breakthroughs...
research
05/18/2023

The Web Can Be Your Oyster for Improving Large Language Models

Large language models (LLMs) encode a large amount of world knowledge. H...

Please sign up or login with your details

Forgot password? Click here to reset