Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

10/07/2021
by   Dawei Liang, et al.
0

Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an AED pipeline. Towards this end, we develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process and conduct thorough empirical studies to examine the performance on the public AudioSet [1] with different types of inputs. Our main observations are that: 1) Joint learning of audio and voice inputs improves the AED performance (mean average precision) for both a CNN baseline (0.292 vs 0.134 mAP) and a TALNet [2] baseline (0.361 vs 0.351 mAP); 2) Augmenting the extra voice features is critical to maximize the model performance with dual inputs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2023

Smartwatch-derived Acoustic Markers for Deficits in Cognitively Relevant Everyday Functioning

Detection of subtle deficits in everyday functioning due to cognitive im...
research
06/09/2023

Speaker Embeddings as Individuality Proxy for Voice Stress Detection

Since the mental states of the speaker modulate speech, stress introduce...
research
10/27/2022

Multi-dimensional Edge-based Audio Event Relational Graph Representation Learning for Acoustic Scene Classification

Most existing deep learning-based acoustic scene classification (ASC) ap...
research
05/30/2023

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Various applications of voice synthesis have been developed independentl...
research
11/16/2019

VOICe: A Sound Event Detection Dataset For Generalizable Domain Adaptation

The performance of sound event detection methods can significantly degra...
research
08/05/2023

A Systematic Exploration of Joint-training for Singing Voice Synthesis

There has been a growing interest in using end-to-end acoustic models fo...
research
09/23/2020

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Visual voice activity detection (V-VAD) uses visual features to predict ...

Please sign up or login with your details

Forgot password? Click here to reset