Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering

06/28/2023
by   Alireza Salemi, et al.
0

This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OK-VQA systems is to retrieve relevant documents for the given multi-modal query. Current state-of-the-art asymmetric dense retrieval model for this task uses an architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9 current state-of-the-art asymmetric architecture. Additionally, the proposed pre-training approach exhibits a good ability in zero-shot retrieval scenarios.

READ FULL TEXT

page 1

page 6

research
05/09/2021

Passage Retrieval for Outside-Knowledge Visual Question Answering

In this work, we address multi-modal information needs that contain text...
research
04/26/2023

A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering

Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answeri...
research
06/30/2022

A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA

Knowledge-based Visual Question Answering (VQA) expects models to rely o...
research
03/01/2023

RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training

Vision-and-language multi-modal pretraining and fine-tuning have shown g...
research
10/21/2021

Single-Modal Entropy based Active Learning for Visual Question Answering

Constructing a large-scale labeled dataset in the real world, especially...
research
10/07/2022

Retrieval Augmented Visual Question Answering with Outside Knowledge

Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQ...
research
01/11/2023

Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

We present a new pre-training method, Multimodal Inverse Cloze Task, for...

Please sign up or login with your details

Forgot password? Click here to reset