EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

08/17/2023
by   Karttikeya Mangalam, et al.
0

We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33 (random is 20 humans achieve about 76 intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code are open-sourced for both public and commercial use under the Ego4D license at http://egoschema.github.io

READ FULL TEXT

page 2

page 3

page 4

page 29

page 32

research
06/08/2021

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dat...
research
10/05/2022

Locate before Answering: Answer Guided Question Localization for Video Question Answering

Video question answering (VideoQA) is an essential task in vision-langua...
research
06/02/2022

Structured Two-stream Attention Network for Video Question Answering

To date, visual question answering (VQA) (i.e., image QA and video QA) i...
research
10/27/2020

Co-attentional Transformers for Story-Based Video Understanding

Inspired by recent trends in vision and language learning, we explore ap...
research
08/28/2023

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Although large language models (LLMs) demonstrate impressive performance...
research
06/21/2021

Towards Long-Form Video Understanding

Our world offers a never-ending stream of visual stimuli, yet today's vi...
research
06/06/2021

Technical Report: Temporal Aggregate Representations

This technical report extends our work presented in [9] with more experi...

Please sign up or login with your details

Forgot password? Click here to reset