Perception Test: A Diagnostic Benchmark for Multimodal Video Models

05/23/2023
by   Viorica Patraucean, et al.
0

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4 vs 43.6 multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_test

READ FULL TEXT

page 3

page 17

page 19

page 20

page 22

page 24

research
05/13/2023

On the Hidden Mystery of OCR in Large Multimodal Models

Large models have recently played a dominant role in natural language pr...
research
02/27/2023

Language Is Not All You Need: Aligning Perception with Language Models

A big convergence of language, multimodal perception, action, and world ...
research
11/24/2021

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

This paper presents a unified multimodal pre-trained model called NÜWA t...
research
12/21/2016

Temporal Tessellation: A Unified Approach for Video Analysis

We present a general approach to video understanding, inspired by semant...
research
04/01/2022

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

Large foundation models can exhibit unique capabilities depending on the...
research
06/15/2023

Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models

While VideoQA Transformer models demonstrate competitive performance on ...
research
08/27/2023

MM-AU:Towards Multimodal Understanding of Advertisement Videos

Advertisement videos (ads) play an integral part in the domain of Intern...

Please sign up or login with your details

Forgot password? Click here to reset