Robustness of Neural Architectures for Audio Event Detection

by   Juncheng B. Li, et al.

Traditionally, in Audio Recognition pipeline, noise is suppressed by the "frontend", relying on preprocessing techniques such as speech enhancement. However, it is not guaranteed that noise will not cascade into downstream pipelines. To understand the actual influence of noise on the entire audio pipeline, in this paper, we directly investigate the impact of noise on a different types of neural models without the preprocessing step. We measure the recognition performances of 4 different neural network models on the task of environment sound classification under the 3 types of noises: occlusion (to emulate intermittent noise), Gaussian noise (models continuous noise), and adversarial perturbations (worst case scenario). Our intuition is that the different ways in which these models process their input (i.e. CNNs have strong locality inductive biases, which Transformers do not have) should lead to observable differences in performance and/ or robustness, an understanding of which will enable further improvements. We perform extensive experiments on AudioSet which is the largest weakly-labeled sound event dataset available. We also seek to explain the behaviors of different models through output distribution change and weight visualization.


Evaluating robustness of You Only Hear Once(YOHO) Algorithm on noisy audios in the VOICe Dataset

Sound event detection (SED) in machine listening entails identifying the...

Improving Speech Enhancement via Event-based Query

Existing deep learning based speech enhancement (SE) methods either use ...

Audio Attacks and Defenses against AED Systems – A Practical Study

Audio Event Detection (AED) Systems capture audio from the environment a...

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

We tackle the task of environmental event classification by drawing insp...

Multi-Domain Processing via Hybrid Denoising Networks for Speech Enhancement

We present a hybrid framework that leverages the trade-off between tempo...

Audio-Visual Event Recognition through the lens of Adversary

As audio/visual classification models are widely deployed for sensitive ...

Please sign up or login with your details

Forgot password? Click here to reset