Equivariant and Invariant Grounding for Video Question Answering

07/26/2022
by   Yicong Li, et al.
0

Video Question Answering (VideoQA) is the task of answering the natural language questions about a video. Producing an answer requires understanding the interplay across visual scenes in video and linguistic semantics in question. However, most leading VideoQA models work as black boxes, which make the visual-linguistic alignment behind the answering process obscure. Such black-box nature calls for visual explainability that reveals “What part of the video should the model look at to answer the question?”. Only a few works present the visual explanations in a post-hoc fashion, which emulates the target model's answering process via an additional method. Nonetheless, the emulation struggles to faithfully exhibit the visual-linguistic alignment during answering. Instead of post-hoc explainability, we focus on intrinsic interpretability to make the answering process transparent. At its core is grounding the question-critical cues as the causal scene to yield answers, while rolling out the question-irrelevant information as the environment scene. Taking a causal look at VideoQA, we devise a self-interpretable framework, Equivariant and Invariant Grounding for Interpretable VideoQA (EIGV). Specifically, the equivariant grounding encourages the answering to be sensitive to the semantic changes in the causal scene and question; in contrast, the invariant grounding enforces the answering to be insensitive to the changes in the environment scene. By imposing them on the answering process, EIGV is able to distinguish the causal scene from the environment information, and explicitly present the visual-linguistic alignment. Extensive experiments on three benchmark datasets justify the superiority of EIGV in terms of accuracy and visual interpretability over the leading baselines.

READ FULL TEXT

page 2

page 5

page 8

research
06/06/2022

Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering questions ab...
research
11/21/2020

LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

The predominant approach to visual question answering (VQA) relies on en...
research
03/05/2023

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

The ideal form of Visual Question Answering requires understanding, grou...
research
09/08/2022

Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering

Multi-modal video question answering aims to predict correct answer and ...
research
12/12/2021

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering

Video question answering requires the models to understand and reason ab...
research
01/12/2020

Focal Visual-Text Attention for Memex Question Answering

Recent insights on language and vision with neural networks have been su...
research
10/11/2021

Pano-AVQA: Grounded Audio-Visual Question Answering on 360^∘ Videos

360^∘ videos convey holistic views for the surroundings of a scene. It p...

Please sign up or login with your details

Forgot password? Click here to reset