Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

06/19/2021
by   Ahjeong Seo, et al.
0

Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question's intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at https://github.com/ahjeongseo/MASN-pytorch.

READ FULL TEXT

page 2

page 5

page 8

research
07/10/2021

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Video question answering is a challenging task, which requires agents to...
research
03/29/2018

Motion-Appearance Co-Memory Networks for Video Question Answering

Video Question Answering (QA) is an important task in understanding vide...
research
07/22/2023

Discovering Spatio-Temporal Rationales for Video Question Answering

This paper strives to solve complex video question answering (VideoQA) w...
research
06/16/2020

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Fact-based Visual Question Answering (FVQA) requires external knowledge ...
research
09/04/2023

Can I Trust Your Answer? Visually Grounded Video Question Answering

We study visually grounded VideoQA in response to the emerging trends of...
research
08/07/2023

Redundancy-aware Transformer for Video Question Answering

This paper identifies two kinds of redundancy in the current VideoQA par...
research
07/12/2022

Video Graph Transformer for Video Question Answering

This paper proposes a Video Graph Transformer (VGT) model for Video Quet...

Please sign up or login with your details

Forgot password? Click here to reset