DeepAI
Log In Sign Up

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

06/25/2021
by   Long Hoang Dang, et al.
11

Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol-like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects. At each stage of the video event flow, these objects interact with each other, and their interactions are reasoned about with respect to the query and under the overall context of a video. This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) networks. This neural model maintains the objects' consistent lifelines in the form of a hierarchically nested spatio-temporal graph. Within this graph, the dynamic interactive object-oriented representations are built up along the video sequence, hierarchically abstracted in a bottom-up manner, and converge toward the key information for the correct answer. The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model's behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA.

READ FULL TEXT
04/12/2021

Object-Centric Representation Learning for Video Question Answering

Video question answering (Video QA) presents a powerful testbed for huma...
02/18/2022

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Spatio-temporal scene-graph approaches to video-based reasoning tasks su...
07/10/2019

Learning to Reason with Relational Video Representation for Question Answering

How does machine learn to reason about the content of a video in answeri...
10/18/2020

Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

Video QA challenges modelers in multiple fronts. Modeling video necessit...
02/25/2020

Hierarchical Conditional Relation Networks for Video Question Answering

Video question answering (VideoQA) is challenging as it requires modelin...
07/08/2022

Video Dialog as Conversation about Objects Living in Space-Time

It would be a technological feat to be able to create a system that can ...
09/17/2020

Dynamic Regions Graph Neural Networks for Spatio-Temporal Reasoning

Graph Neural Networks are perfectly suited to capture latent interaction...