Modularized Zero-shot VQA with Pre-trained Models

05/27/2023
by   Rui Cao, et al.
0

Large-scale pre-trained models (PTMs) show great zero-shot capabilities. In this paper, we study how to leverage them for zero-shot visual question answering (VQA). Our approach is motivated by a few observations. First, VQA questions often require multiple steps of reasoning, which is still a capability that most PTMs lack. Second, different steps in VQA reasoning chains require different skills such as object detection and relational reasoning, but a single PTM may not possess all these skills. Third, recent work on zero-shot VQA does not explicitly consider multi-step reasoning chains, which makes them less interpretable compared with a decomposition-based approach. We propose a modularized zero-shot network that explicitly decomposes questions into sub reasoning steps and is highly interpretable. We convert sub reasoning tasks to acceptable objectives of PTMs and assign tasks to proper PTMs without any adaptation. Our experiments on two VQA benchmarks under the zero-shot setting demonstrate the effectiveness of our method and better interpretability compared with several baselines.

READ FULL TEXT

page 2

page 8

page 12

page 14

page 15

page 16

research
10/17/2022

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Visual question answering (VQA) is a hallmark of vision and language rea...
research
11/17/2016

Zero-Shot Visual Question Answering

Part of the appeal of Visual Question Answering (VQA) is its promise to ...
research
06/16/2023

Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering

Visual question answering (VQA) is a challenging task that requires the ...
research
03/16/2018

A dataset and architecture for visual reasoning with a working memory

A vexing problem in artificial intelligence is reasoning about events th...
research
10/20/2022

Composing Ensembles of Pre-trained Models via Iterative Consensus

Large pre-trained models exhibit distinct and complementary capabilities...
research
04/12/2022

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Training a referring expression comprehension (ReC) model for a new visu...
research
05/28/2023

Tab-CoT: Zero-shot Tabular Chain of Thought

The chain-of-though (CoT) prompting methods were successful in various n...

Please sign up or login with your details

Forgot password? Click here to reset