Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data

05/29/2020
by   Haytham M. Fayek, et al.
8

Recognizing sounds is a key aspect of computational audio scene analysis and machine perception. In this paper, we advocate that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other. We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings. The proposed fusion model utilizes an attention mechanism to dynamically combine the outputs of the individual audio and visual models. Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model, which outperforms the single-modal models, and state-of-the-art fusion and multi-modal models. We achieve a mean Average Precision (mAP) of 46.16 on Audioset, outperforming prior state of the art by approximately +4.35 mAP (relative: 10.4

READ FULL TEXT

page 4

page 6

research
09/16/2018

Towards Good Practices for Multi-modal Fusion in Large-scale Video Classification

Leveraging both visual frames and audio has been experimentally proven e...
research
07/27/2023

PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data

Audio-visual learning seeks to enhance the computer's multi-modal percep...
research
11/29/2017

Predicting Depression Severity by Multi-Modal Feature Engineering and Fusion

We present our preliminary work to determine if patient's vocal acoustic...
research
12/14/2021

Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking

Multi-modal fusion is proven to be an effective method to improve the ac...
research
07/28/2021

Squeeze-Excitation Convolutional Recurrent Neural Networks for Audio-Visual Scene Classification

The use of multiple and semantically correlated sources can provide comp...
research
01/05/2023

What You Say Is What You Show: Visual Narration Detection in Instructional Videos

Narrated "how-to" videos have emerged as a promising data source for a w...
research
09/26/2022

Multi-encoder attention-based architectures for sound recognition with partial visual assistance

Large-scale sound recognition data sets typically consist of acoustic re...

Please sign up or login with your details

Forgot password? Click here to reset