Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

03/15/2023
by   Kenneth Ooi, et al.
0

Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acoustic perception. Hence, we propose modular modifications to an existing attention-based deep neural network, to allow early, mid-level, and late feature fusion of participant-linked, visual, and acoustic features. Ablation studies on module configurations and corresponding fusion methods using the ARAUS dataset show that contextual features improve the model performance in a statistically significant manner on the normalized ISO Pleasantness, to a mean squared error of 0.1194±0.0012 for the best-performing all-modality model, against 0.1217±0.0009 for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability.

READ FULL TEXT
research
03/04/2023

The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

This paper further explores our previous wake word spotting system ranke...
research
11/14/2020

On the Benefits of Early Fusion in Multimodal Representation Learning

Intelligently reasoning about the world often requires integrating data ...
research
06/29/2021

Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs

We present two multimodal fusion-based deep learning models that consume...
research
08/03/2020

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

In this work, we explore a multimodal semi-supervised learning approach ...
research
01/29/2020

Audio-Visual Decision Fusion for WFST-based and seq2seq Models

Under noisy conditions, speech recognition systems suffer from high Word...
research
04/19/2022

Audio-Visual Wake Word Spotting System For MISP Challenge 2021

This paper presents the details of our system designed for the Task 1 of...
research
08/24/2023

Attention-Based Acoustic Feature Fusion Network for Depression Detection

Depression, a common mental disorder, significantly influences individua...

Please sign up or login with your details

Forgot password? Click here to reset