DeepAI
Log In Sign Up

HySTER: A Hybrid Spatio-Temporal Event Reasoner

01/17/2021
by   Theophile Sautory, et al.
0

The task of Video Question Answering (VideoQA) consists in answering natural language questions about a video and serves as a proxy to evaluate the performance of a model in scene sequence understanding. Most methods designed for VideoQA up-to-date are end-to-end deep learning architectures which struggle at complex temporal and causal reasoning and provide limited transparency in reasoning steps. We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos. Our model leverages the strength of deep learning methods to extract information from video frames with the reasoning capabilities and explainability of symbolic artificial intelligence in an answer set programming framework. We define a method based on general temporal, causal and physics rules which can be transferred across tasks. We apply our model to the CLEVRER dataset and demonstrate state-of-the-art results in question answering accuracy. This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.

READ FULL TEXT

page 3

page 7

04/25/2019

TVQA+: Spatio-Temporal Grounding for Video Question Answering

We present the task of Spatio-Temporal Video Question Answering, which r...
05/28/2019

Blocksworld Revisited: Learning and Reasoning to Generate Event-Sequences from Image Pairs

The process of identifying changes or transformations in a scene along w...
12/08/2020

CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions

Recent advances in Artificial Intelligence and deep learning have revive...
12/06/2016

MarioQA: Answering Questions by Watching Gameplay Videos

We present a framework to analyze various aspects of models for video qu...
05/03/2022

Episodic Memory Question Answering

Egocentric augmented reality devices such as wearable glasses passively ...
03/30/2021

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Visual events are a composition of temporal actions involving actors spa...