PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data

07/27/2023
by   Zheng Zhang, et al.
0

Audio-visual learning seeks to enhance the computer's multi-modal perception leveraging the correlation between the auditory and visual modalities. Despite their many useful downstream tasks, such as video retrieval, AR/VR, and accessibility, the performance and adoption of existing audio-visual models have been impeded by the availability of high-quality datasets. Annotating audio-visual datasets is laborious, expensive, and time-consuming. To address this challenge, we designed and developed an efficient audio-visual annotation tool called Peanut. Peanut's human-AI collaborative pipeline separates the multi-modal task into two single-modal tasks, and utilizes state-of-the-art object detection and sound-tagging models to reduce the annotators' effort to process each frame and the number of manually-annotated frames needed. A within-subject user study with 20 participants found that Peanut can significantly accelerate the audio-visual data annotation process while maintaining high annotation accuracy.

READ FULL TEXT

page 1

page 7

research
05/29/2020

Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

Recognizing sounds is a key aspect of computational audio scene analysis...
research
08/17/2023

Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

Finding the right sound effects (SFX) to match moments in a video is a d...
research
02/07/2018

Applying Cooperative Machine Learning to Speed Up the Annotation of Social Signals in Large Multi-modal Corpora

Scientific disciplines, such as Behavioural Psychology, Anthropology and...
research
03/17/2023

MMFace4D: A Large-Scale Multi-Modal 4D Face Dataset for Audio-Driven 3D Face Animation

Audio-Driven Face Animation is an eagerly anticipated technique for appl...
research
10/26/2021

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

The recent success of transformer models in language, such as BERT, has ...
research
03/01/2023

DTW-SiameseNet: Dynamic Time Warped Siamese Network for Mispronunciation Detection and Correction

Personal Digital Assistants (PDAs) - such as Siri, Alexa and Google Assi...
research
05/04/2021

Where and When: Space-Time Attention for Audio-Visual Explanations

Explaining the decision of a multi-modal decision-maker requires to dete...

Please sign up or login with your details

Forgot password? Click here to reset