Dense but Efficient VideoQA for Intricate Compositional Reasoning

10/19/2022
by   Jihyeon Lee, et al.
0

It is well known that most of the conventional video question answering (VideoQA) datasets consist of easy questions requiring simple reasoning processes. However, long videos inevitably contain complex and compositional semantic structures along with the spatio-temporal axis, which requires a model to understand the compositional structures inherent in the videos. In this paper, we suggest a new compositional VideoQA method based on transformer architecture with a deformable attention mechanism to address the complex VideoQA tasks. The deformable attentions are introduced to sample a subset of informative visual features from the dense visual feature map to cover a temporally long range of frames efficiently. Furthermore, the dependency structure within the complex question sentences is also combined with the language embeddings to readily understand the relations among question words. Extensive experiments and ablation studies show that the suggested dense but efficient model outperforms other baselines.

READ FULL TEXT

page 1

page 13

page 14

research
03/30/2021

AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Visual events are a composition of temporal actions involving actors spa...
research
10/03/2022

Extending Compositional Attention Networks for Social Reasoning in Videos

We propose a novel deep architecture for the task of reasoning about soc...
research
09/05/2019

A Better Way to Attend: Attention with Trees for Video Question Answering

We propose a new attention model for video question answering. The main ...
research
01/17/2020

Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data

Conventional sequential learning methods such as Recurrent Neural Networ...
research
07/03/2019

Compositional Structure Learning for Sequential Video Data

Conventional sequential learning methods such as Recurrent Neural Networ...
research
04/12/2022

AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal Reasoning

Prior benchmarks have analyzed models' answers to questions about videos...
research
12/16/2021

Utilizing Evidence Spans via Sequence-Level Contrastive Learning for Long-Context Question Answering

Long-range transformer models have achieved encouraging results on long-...

Please sign up or login with your details

Forgot password? Click here to reset