From Representation to Reasoning: Towards both Evidence and Commonsense Reasoning for Video Question-Answering

05/30/2022
by   Jiangtong Li, et al.
0

Video understanding has achieved great success in representation learning, such as video caption, video object grounding, and video descriptive question-answer. However, current methods still struggle on video reasoning, including evidence reasoning and commonsense reasoning. To facilitate deeper video understanding towards video reasoning, we present the task of Causal-VidQA, which includes four types of questions ranging from scene description (description) to evidence reasoning (explanation) and commonsense reasoning (prediction and counterfactual). For commonsense reasoning, we set up a two-step solution by answering the question and providing a proper reason. Through extensive experiments on existing VideoQA methods, we find that the state-of-the-art methods are strong in descriptions but weak in reasoning. We hope that Causal-VidQA can guide the research of video understanding from representation learning to deeper reasoning. The dataset and related resources are available at <https://github.com/bcmi/Causal-VidQA.git>.

READ FULL TEXT

page 1

page 4

research
05/18/2021

NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions

We introduce NExT-QA, a rigorously designed video question answering (Vi...
research
07/08/2022

CoSIm: Commonsense Reasoning for Counterfactual Scene Imagination

As humans, we can modify our assumptions about a scene by imagining alte...
research
06/06/2022

Invariant Grounding for Video Question Answering

Video Question Answering (VideoQA) is the task of answering questions ab...
research
05/10/2023

VideoChat: Chat-Centric Video Understanding

In this study, we initiate an exploration into video understanding by in...
research
07/04/2021

Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory

Visual Commonsense Reasoning (VCR) predicts an answer with corresponding...
research
10/08/2022

EgoTaskQA: Understanding Human Tasks in Egocentric Videos

Understanding human tasks through video observations is an essential cap...
research
11/17/2022

Visual Commonsense-aware Representation Network for Video Captioning

Generating consecutive descriptions for videos, i.e., Video Captioning, ...

Please sign up or login with your details

Forgot password? Click here to reset