DeepAI AI Chat
Log In Sign Up

Zero-Shot Visual Question Answering

by   Damien Teney, et al.
The University of Adelaide

Part of the appeal of Visual Question Answering (VQA) is its promise to answer new questions about previously unseen images. Most current methods demand training questions that illustrate every possible concept, and will therefore never achieve this capability, since the volume of required training data would be prohibitive. Answering general questions about images requires methods capable of Zero-Shot VQA, that is, methods able to answer questions beyond the scope of the training questions. We propose a new evaluation protocol for VQA methods which measures their ability to perform Zero-Shot VQA, and in doing so highlights significant practical deficiencies of current approaches, some of which are masked by the biases in current datasets. We propose and evaluate several strategies for achieving Zero-Shot VQA, including methods based on pretrained word embeddings, object classifiers with semantic embeddings, and test-time retrieval of example images. Our extensive experiments are intended to serve as baselines for Zero-Shot VQA, and they also achieve state-of-the-art performance in the standard VQA evaluation setting.


page 3

page 12

page 13

page 14

page 15

page 16

page 17

page 18


Zero-shot Visual Question Answering using Knowledge Graph

Incorporating external knowledge to Visual Question Answering (VQA) has ...

Zero-Shot Transfer VQA Dataset

Acquiring a large vocabulary is an important aspect of human intelligenc...

Mind Reasoning Manners: Enhancing Type Perception for Generalized Zero-shot Logical Reasoning over Text

Logical reasoning task involves diverse types of complex reasoning over ...

Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval

Visual-semantic embedding is an interesting research topic because it is...

Continual VQA for Disaster Response Systems

Visual Question Answering (VQA) is a multi-modal task that involves answ...

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Contrastive language-image pretraining (CLIP) links vision and language ...

Modeling Task Effects on Meaning Representation in the Brain via Zero-Shot MEG Prediction

How meaning is represented in the brain is still one of the big open que...