Learning to Reason with Relational Video Representation for Question Answering

07/10/2019
by   Thao Minh Le, et al.
4

How does machine learn to reason about the content of a video in answering a question? A Video QA system must simultaneously understand language, represent visual content over space-time, and iteratively transform these representations in response to lingual content in the query, and finally arriving at a sensible answer. While recent advances in textual and visual question answering have come up with sophisticated visual representation and neural reasoning mechanisms, major challenges in Video QA remain on dynamic grounding of concepts, relations and actions to support the reasoning process. We present a new end-to-end layered architecture for Video QA, which is composed of a question-guided video representation layer and a generic reasoning layer to produce answer. The video is represented using a hierarchical model that encodes visual information about objects, actions and relations in space-time given the textual cues from the question. The encoded representation is then passed to a reasoning module, which in this paper, is implemented as a MAC net. The system is evaluated on the SVQA (synthetic) and TGIF-QA datasets (real), demonstrating state-of-the-art results, with a large margin in the case of multi-step reasoning.

READ FULL TEXT

page 2

page 3

research
06/25/2021

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Video Question Answering (Video QA) is a powerful testbed to develop new...
research
11/29/2021

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Video Question Answering (VideoQA), aiming to correctly answer the given...
research
04/08/2019

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

In this paper, we propose a novel end-to-end trainable Video Question An...
research
04/12/2021

Object-Centric Representation Learning for Video Question Answering

Video question answering (Video QA) presents a powerful testbed for huma...
research
04/30/2020

Dynamic Language Binding in Relational Visual Reasoning

We present Language-binding Object Graph Network, the first neural reaso...
research
02/25/2020

Hierarchical Conditional Relation Networks for Video Question Answering

Video question answering (VideoQA) is challenging as it requires modelin...
research
10/18/2020

Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

Video QA challenges modelers in multiple fronts. Modeling video necessit...

Please sign up or login with your details

Forgot password? Click here to reset