Large-Scale Automatic Labeling of Video Events with Verbs Based on Event-Participant Interaction

04/16/2012
by   Andrei Barbu, et al.
0

We present an approach to labeling short video clips with English verbs as event descriptions. A key distinguishing aspect of this work is that it labels videos with verbs that describe the spatiotemporal interaction between event participants, humans and objects interacting with each other, abstracting away all object-class information and fine-grained image characteristics, and relying solely on the coarse-grained motion of the event participants. We apply our approach to a large set of 22 distinct verb classes and a corpus of 2,584 videos, yielding two surprising outcomes. First, a classification accuracy of greater than 70 variety of 1-out-of-10 subsets of this labeling task is independent of the choice of which of two different time-series classifiers we employ. Second, we achieve this level of accuracy using a highly impoverished intermediate representation consisting solely of the bounding boxes of one or two event participants as a function of time. This indicates that successful event recognition depends more on the choice of appropriate features that characterize the linguistic invariants of the event classes than on the particular classifier algorithms.

READ FULL TEXT
research
08/14/2019

Detecting 11K Classes: Large Scale Object Detection without Fine-Grained Bounding Boxes

Recent advances in deep learning greatly boost the performance of object...
research
04/02/2021

Visual Semantic Role Labeling for Video Understanding

We propose a new framework for understanding and representing related sa...
research
08/18/2023

Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions

We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of...
research
04/12/2012

Video In Sentences Out

We present a system that produces sentential descriptions of video: who ...
research
08/09/2023

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Video Semantic Role Labeling (VidSRL) aims to detect the salient events ...
research
05/24/2016

EventNet Version 1.1 Technical Report

EventNet is a large-scale video corpus and event ontology consisting of ...
research
04/08/2018

Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

First-person vision is gaining interest as it offers a unique viewpoint ...

Please sign up or login with your details

Forgot password? Click here to reset