Cross-Modal Retrieval Augmentation for Multi-Modal Classification

by   Shir Gur, et al.

Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.


page 1

page 2

page 3

page 4


Passage Retrieval for Outside-Knowledge Visual Question Answering

In this work, we address multi-modal information needs that contain text...

A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA

Knowledge-based Visual Question Answering (VQA) expects models to rely o...

Multi-Modal Answer Validation for Knowledge-Based VQA

The problem of knowledge-based visual question answering involves answer...

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

We propose Pixel-BERT to align image pixels with text by deep multi-moda...

Look, Read and Enrich. Learning from Scientific Figures and their Captions

Compared to natural images, understanding scientific figures is particul...

Question Generation for Evaluating Cross-Dataset Shifts in Multi-modal Grounding

Visual question answering (VQA) is the multi-modal task of answering nat...

Actively Seeking and Learning from Live Data

One of the key limitations of traditional machine learning methods is th...