A Light-Weight Multimodal Framework for Improved Environmental Audio Tagging

12/27/2017
by   Juncheng Li, et al.
0

The lack of strong labels has severely limited the state-of-the-art fully supervised audio tagging systems to be scaled to larger dataset. Meanwhile, audio-visual learning models based on unlabeled videos have been successfully applied to audio tagging, but they are inevitably resource hungry and require a long time to train. In this work, we propose a light-weight, multimodal framework for environmental audio tagging. The audio branch of the framework is a convolutional and recurrent neural network (CRNN) based on multiple instance learning (MIL). It is trained with the audio tracks of a large collection of weakly labeled YouTube video excerpts; the video branch uses pretrained state-of-the-art image recognition networks and word embeddings to extract information from the video track and to map visual objects to sound events. Experiments on the audio tagging task of the DCASE 2017 challenge show that the incorporation of video information improves a strong baseline audio tagging system by 5.3% absolute in terms of F_1 score. The entire system can be trained within 6 hours on a single GPU, and can be easily carried over to other audio tasks such as speech sentimental analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/27/2017

Multiple Instance Deep Learning for Weakly Supervised Audio Event Detection

State-of-the-art audio event detection (AED) systems rely on supervised ...
research
12/27/2017

Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

State-of-the-art audio event detection (AED) systems rely on supervised ...
research
08/12/2018

Sample Mixed-Based Data Augmentation for Domestic Audio Tagging

Audio tagging has attracted increasing attention since last decade and h...
research
03/02/2019

Weakly Labelled AudioSet Tagging with Attention Neural Networks

Audio tagging is the task of predicting the presence or absence of sound...
research
10/30/2018

General audio tagging with ensembling convolutional neural network and statistical features

Audio tagging aims to infer descriptive labels from audio clips. Audio t...
research
07/26/2018

General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline

This paper describes Task 2 of the DCASE 2018 Challenge, titled "General...
research
04/19/2022

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

This paper presents the details of our system designed for the Task 1 of...

Please sign up or login with your details

Forgot password? Click here to reset