A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering

04/26/2023
by   Alireza Salemi, et al.
0

Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answering a question about an image whose answer does not lie in the image. This paper presents a new pipeline for KI-VQA tasks, consisting of a retriever and a reader. First, we introduce DEDR, a symmetric dual encoding dense retrieval framework in which documents and queries are encoded into a shared embedding space using uni-modal (textual) and multi-modal encoders. We introduce an iterative knowledge distillation approach that bridges the gap between the representation spaces in these two encoders. Extensive evaluation on two well-established KI-VQA datasets, i.e., OK-VQA and FVQA, suggests that DEDR outperforms state-of-the-art baselines by 11.6 respectively. Utilizing the passages retrieved by DEDR, we further introduce MM-FiD, an encoder-decoder multi-modal fusion-in-decoder model, for generating a textual answer for KI-VQA tasks. MM-FiD encodes the question, the image, and each retrieved passage separately and uses all passages jointly in its decoder. Compared to competitive baselines in the literature, this approach leads to 5.5 and FVQA, respectively.

READ FULL TEXT
research
06/28/2023

Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering

This paper studies a category of visual question answering tasks, in whi...
research
10/17/2020

Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

Visual Question Answering (VQA) is challenging due to the complex cross-...
research
03/23/2021

Multi-Modal Answer Validation for Knowledge-Based VQA

The problem of knowledge-based visual question answering involves answer...
research
10/18/2022

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering

Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ...
research
10/07/2022

Retrieval Augmented Visual Question Answering with Outside Knowledge

Outside-Knowledge Visual Question Answering (OK-VQA) is a challenging VQ...
research
03/31/2020

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Answering questions that require reading texts in an image is challengin...
research
01/20/2020

Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models

Visual Question Answering (VQA) has emerged as a Visual Turing Test to v...

Please sign up or login with your details

Forgot password? Click here to reset