Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions

08/18/2023
by   Michael Joannou, et al.
0

We present Audiovisual Moments in Time (AVMIT), a large-scale dataset of audiovisual action events. In an extensive annotation task 11 participants labelled a subset of 3-second audiovisual videos from the Moments in Time dataset (MIT). For each trial, participants assessed whether the labelled audiovisual action event was present and whether it was the most prominent feature of the video. The dataset includes the annotation of 57,177 audiovisual videos, each independently evaluated by 3 of 11 trained participants. From this initial collection, we created a curated test set of 16 distinct action classes, with 60 videos each (960 videos). We also offer 2 sets of pre-computed audiovisual feature embeddings, using VGGish/YamNet for audio data and VGG16/EfficientNetB0 for visual data, thereby lowering the barrier to entry for audiovisual DNN research. We explored the advantages of AVMIT annotations and feature embeddings to improve performance on audiovisual event recognition. A series of 6 Recurrent Neural Networks (RNNs) were trained on either AVMIT-filtered audiovisual events or modality-agnostic events from MIT, and then tested on our audiovisual test set. In all RNNs, top 1 accuracy was increased by 2.71-5.94% by training exclusively on audiovisual events, even outweighing a three-fold increase in training data. We anticipate that the newly annotated AVMIT dataset will serve as a valuable resource for research and comparative experiments involving computational models and human participants, specifically when addressing research questions where audiovisual correspondence is of critical importance.

READ FULL TEXT

page 4

page 7

research
01/09/2018

Moments in Time Dataset: one million videos for event understanding

We present the Moments in Time Dataset, a large-scale human-annotated co...
research
05/01/2020

The AVA-Kinetics Localized Human Actions Video Dataset

This paper describes the AVA-Kinetics localized human actions video data...
research
02/01/2023

Epic-Sounds: A Large-scale Dataset of Actions That Sound

We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations cap...
research
04/16/2012

Large-Scale Automatic Labeling of Video Events with Verbs Based on Event-Participant Interaction

We present an approach to labeling short video clips with English verbs ...
research
06/03/2019

Mining YouTube - A dataset for learning fine-grained action concepts from webly supervised video data

Action recognition is so far mainly focusing on the problem of classific...
research
03/13/2015

The YLI-MED Corpus: Characteristics, Procedures, and Plans

The YLI Multimedia Event Detection corpus is a public-domain index of vi...
research
10/06/2022

Ambiguous Images With Human Judgments for Robust Visual Event Classification

Contemporary vision benchmarks predominantly consider tasks on which hum...

Please sign up or login with your details

Forgot password? Click here to reset