Can I Trust Your Answer? Visually Grounded Video Question Answering

09/04/2023
by   Junbin Xiao, et al.
0

We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious correlations from language or irrelevant visual context. Towards this, we construct NExT-GQA – an extension of NExT-QA with 10.5K temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA, we scrutinize a variety of state-of-the-art VLMs. Through post-hoc attention analysis, we find that these models are weak in substantiating the answers despite their strong QA performance. This exposes a severe limitation of these models in making reliable predictions. As a remedy, we further explore and suggest a video grounding mechanism via Gaussian mask optimization and cross-modal learning. Experiments with different backbones demonstrate that this grounding mechanism improves both video grounding and QA. Our dataset and code are released. With these efforts, we aim to push towards the reliability of deploying VLMs in VQA systems.

READ FULL TEXT
research
02/04/2022

Grounding Answers for Visual Questions Asked by Visually Impaired People

Visual question answering is the task of answering questions about image...
research
09/08/2022

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Multi-modal video question answering aims to predict correct answer and ...
research
06/02/2022

Structured Two-stream Attention Network for Video Question Answering

To date, visual question answering (VQA) (i.e., image QA and video QA) i...
research
06/19/2021

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Video Question Answering is a task which requires an AI agent to answer ...
research
06/06/2022

Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering questions ab...
research
03/13/2022

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

The temporal answering grounding in the video (TAGV) is a new task natur...
research
04/16/2021

VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks

Neural module networks (NMN) have achieved success in image-grounded tas...

Please sign up or login with your details

Forgot password? Click here to reset