Discovering Spatio-Temporal Rationales for Video Question Answering

07/22/2023
by   Yicong Li, et al.
0

This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalization (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves new state-of-the-art (SoTA). Especially, on NExT-QA and Causal-VidQA which feature complex VideoQA, it significantly surpasses the previous SoTA by 5.8% and 6.8%, respectively. We then conduct extensive studies to verify the importance of STR as well as the proposed answer interaction mechanism. With the success of TranSTR and our comprehensive analysis, we hope this work can spark more future efforts in complex VideoQA. Code will be released at https://github.com/yl3800/TranSTR.

READ FULL TEXT

page 1

page 4

page 8

research
02/27/2023

Contrastive Video Question Answering via Video Graph Transformer

We propose to perform video question answering (VideoQA) in a Contrastiv...
research
04/25/2019

TVQA+: Spatio-Temporal Grounding for Video Question Answering

We present the task of Spatio-Temporal Video Question Answering, which r...
research
07/12/2022

Video Graph Transformer for Video Question Answering

This paper proposes a Video Graph Transformer (VGT) model for Video Quet...
research
08/07/2020

Location-aware Graph Convolutional Networks for Video Question Answering

We addressed the challenging task of video question answering, which req...
research
06/19/2021

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Video Question Answering is a task which requires an AI agent to answer ...
research
01/17/2020

Temporal Interlacing Network

For a long time, the vision community tries to learn the spatio-temporal...
research
05/13/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Videos convey rich information. Dynamic spatio-temporal relationships be...

Please sign up or login with your details

Forgot password? Click here to reset