Zero-shot Visual Question Answering with Language Model Feedback

05/26/2023
by   Yifan Du, et al.
0

In this paper, we propose a novel language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA). Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM). As the major contribution, we leverage the guidance and feedback of the prediction model to improve the capability of the captioning model. In this way, the captioning model can become aware of the task goal and information need from the PLM. To develop our approach, we design two specific training stages, where the first stage adapts the captioning model to the prediction model (selecting more suitable caption propositions for training) and the second stage tunes the captioning model according to the task goal (learning from feedback of the PLM). Extensive experiments demonstrate the effectiveness of the proposed approach on the knowledge-based VQA task. Specifically, on the challenging A-OKVQA dataset, LAMOC outperforms several competitive zero-shot methods and even achieves comparable results to a fine-tuned VLP model. Our code is publicly available at https://github.com/RUCAIBox/LAMOC.

READ FULL TEXT

page 4

page 8

research
10/17/2022

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Visual question answering (VQA) is a hallmark of vision and language rea...
research
05/16/2023

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

In this paper, we study how to improve the zero-shot reasoning ability o...
research
07/22/2022

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

We introduce a zero-shot video captioning method that employs two frozen...
research
08/03/2023

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

We present the All-Seeing (AS) project: a large-scale data and model for...
research
08/16/2023

Pro-Cap: Leveraging a Frozen Vision-Language Model for Hateful Meme Detection

Hateful meme detection is a challenging multimodal task that requires co...
research
05/31/2023

Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models

Visual Question Answering is a challenging task, as it requires seamless...
research
04/03/2023

Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Recently, the zero-shot semantic segmentation problem has attracted incr...

Please sign up or login with your details

Forgot password? Click here to reset