AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning

Visual events are a composition of temporal actions involving actors spatially interacting with objects. When developing computer vision models that can reason about compositional spatio-temporal events, we need benchmarks that can analyze progress and uncover shortcomings. Existing video question answering benchmarks are useful, but they often conflate multiple sources of error into one accuracy metric and have strong biases that models can exploit, making it difficult to pinpoint model weaknesses. We present Action Genome Question Answering (AGQA), a new benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9.6K videos. We also provide a balanced subset of 3.9M question answer pairs, 3 orders of magnitude larger than existing benchmarks, that minimizes bias by balancing the answer distributions and types of question structures. Although human evaluators marked 86.02% of our question-answer pairs as correct, the best model achieves only 47.74% accuracy. In addition, AGQA introduces multiple training/test splits to test for various reasoning abilities, including generalization to novel compositions, to indirect references, and to more compositional steps. Using AGQA, we evaluate modern visual reasoning systems, demonstrating that the best models barely perform better than non-visual baselines exploiting linguistic biases and that none of the existing models generalize to novel compositions unseen during training.

READ FULL TEXT

page 1

page 19

page 20

page 22

page 23

research
04/12/2022

AGQA 2.0: An Updated Benchmark for Compositional Spatio-Temporal Reasoning

Prior benchmarks have analyzed models' answers to questions about videos...
research
05/04/2023

ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

Building benchmarks to systemically analyze different capabilities of vi...
research
10/19/2022

Dense but Efficient VideoQA for Intricate Compositional Reasoning

It is well known that most of the conventional video question answering ...
research
12/20/2016

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

When building artificial intelligence systems that can reason and answer...
research
04/14/2022

Measuring Compositional Consistency for Video Question Answering

Recent video question answering benchmarks indicate that state-of-the-ar...
research
01/01/2021

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

A video-grounded dialogue system is required to understand both dialogue...
research
05/15/2021

Show Why the Answer is Correct! Towards Explainable AI using Compositional Temporal Attention

Visual Question Answering (VQA) models have achieved significant success...

Please sign up or login with your details

Forgot password? Click here to reset