Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering

07/25/2023
by   Yi Cheng, et al.
0

The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features without considering relative relations between objects, which may lead to inferior performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA. First, to make question features aware of keywords, we employ an attention mechanism to assign high weights to keywords during question encoding. The keyword-aware question features are then used to guide video graph construction. Second, because relations are relative, we integrate the relative relation modeling to better capture the spatio-temporal dynamics among object nodes. Moreover, we disentangle the spatio-temporal reasoning into an object-level spatial graph and a frame-level temporal graph, which reduces the impact of spatial and temporal relation reasoning on each other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets demonstrate the superiority of our KRST over multiple state-of-the-art methods.

READ FULL TEXT

page 1

page 3

page 7

page 8

research
08/07/2020

Location-aware Graph Convolutional Networks for Video Question Answering

We addressed the challenging task of video question answering, which req...
research
02/18/2022

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Spatio-temporal scene-graph approaches to video-based reasoning tasks su...
research
11/28/2022

Neuro-Symbolic Spatio-Temporal Reasoning

Knowledge about space and time is necessary to solve problems in the phy...
research
02/06/2023

INCREASE: Inductive Graph Representation Learning for Spatio-Temporal Kriging

Spatio-temporal kriging is an important problem in web and social applic...
research
10/18/2020

Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

Video QA challenges modelers in multiple fronts. Modeling video necessit...
research
07/10/2021

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Video question answering is a challenging task, which requires agents to...
research
06/11/2019

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

Video captioning aims to automatically generate natural language descrip...

Please sign up or login with your details

Forgot password? Click here to reset