Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

05/30/2023
by   XingYu Fu, et al.
0

The open-ended Visual Question Answering (VQA) task requires AI models to jointly reason over visual and natural language inputs using world knowledge. Recently, pre-trained Language Models (PLM) such as GPT-3 have been applied to the task and shown to be powerful world knowledge sources. However, these methods suffer from low knowledge coverage caused by PLM bias – the tendency to generate certain tokens over other tokens regardless of prompt changes, and high dependency on the PLM quality – only models using GPT-3 can achieve the best result. To address the aforementioned challenges, we propose RASO: a new VQA pipeline that deploys a generate-then-select strategy guided by world knowledge for the first time. Rather than following the de facto standard to train a multi-modal model that directly generates the VQA answer, RASO first adopts PLM to generate all the possible answers, and then trains a lightweight answer selection model for the correct answer. As proved in our analysis, RASO expands the knowledge coverage from in-domain training data by a large margin. We provide extensive experimentation and show the effectiveness of our pipeline by advancing the state-of-the-art by 4.1 and models are released at http://cogcomp.org/page/publication_view/1010

READ FULL TEXT

page 1

page 2

research
06/03/2022

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge

The Visual Question Answering (VQA) task aspires to provide a meaningful...
research
04/23/2020

Visual Question Answering Using Semantic Information from Image Descriptions

Visual question answering (VQA) is a task that requires AI systems to di...
research
03/07/2020

PathVQA: 30000+ Questions for Medical Visual Question Answering

Is it possible to develop an "AI Pathologist" to pass the board-certifie...
research
03/03/2023

Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering

Knowledge-based visual question answering (VQA) requires external knowle...
research
11/21/2022

Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference

While large pre-trained language models are powerful, their predictions ...
research
08/30/2023

Prompting Vision Language Model with Knowledge from Large Language Model for Knowledge-Based VQA

Knowledge-based visual question answering is a very challenging and wide...

Please sign up or login with your details

Forgot password? Click here to reset