Visual Semantic Role Labeling for Video Understanding

04/02/2021
by   Arka Sadhu, et al.
9

We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling. We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event. To study the challenging task of semantic role labeling in videos or VidSRL, we introduce the VidSitu benchmark, a large-scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each). We provide a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Our code and dataset is available at vidsitu.org.

READ FULL TEXT

page 1

page 6

page 11

page 14

research
06/15/2023

Towards Long Form Audio-visual Video Understanding

We live in a world filled with never-ending streams of multimodal inform...
research
05/02/2020

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Procedural knowledge, which we define as concrete information about the ...
research
09/05/2020

Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability

A key capability of an intelligent system is deciding when events from p...
research
10/09/2018

Event Representation through Semantic Roles: Evaluation of Coverage

Semantic role theory is a widely used approach for event representation....
research
06/23/2016

VideoMCC: a New Benchmark for Video Comprehension

While there is overall agreement that future technology for organizing, ...
research
05/01/2020

HLVU : A New Challenge to Test Deep Understanding of Movies the Way Humans do

In this paper we propose a new evaluation challenge and direction in the...
research
04/16/2012

Large-Scale Automatic Labeling of Video Events with Verbs Based on Event-Participant Interaction

We present an approach to labeling short video clips with English verbs ...

Please sign up or login with your details

Forgot password? Click here to reset