Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

10/17/2022
by   Anthony Meng Huat Tiong, et al.
0

Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a modular framework for zero-shot VQA. In contrast to most existing works, which require substantial adaptation of pretrained language models (PLMs) for the vision modality, PNP-VQA requires no additional training of the PLMs. Instead, we propose to use natural language and network interpretation as an intermediate representation that glues pretrained models together. We first generate question-guided informative image captions, and pass the captions to a PLM as context for question answering. Surpassing end-to-end trained baselines, PNP-VQA achieves state-of-the-art results on zero-shot VQAv2 and GQA. With 11B parameters, it outperforms the 80B-parameter Flamingo model by 8.5 improvement of 9.1 released at https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa

READ FULL TEXT

page 4

page 13

page 14

page 15

research
12/21/2022

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models

Large language models (LLMs) have demonstrated excellent zero-shot gener...
research
06/16/2023

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Visual question answering (VQA) is a challenging task that requires the ...
research
09/14/2022

MUST-VQA: MUltilingual Scene-text VQA

In this paper, we present a framework for Multilingual Scene Text Visual...
research
05/04/2022

All You May Need for VQA are Image Captions

Visual Question Answering (VQA) has benefited from increasingly sophisti...
research
07/03/2023

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging bec...
research
05/27/2023

Modularized Zero-shot VQA with Pre-trained Models

Large-scale pre-trained models (PTMs) show great zero-shot capabilities....
research
05/26/2023

Zero-shot Visual Question Answering with Language Model Feedback

In this paper, we propose a novel language model guided captioning appro...

Please sign up or login with your details

Forgot password? Click here to reset