DeepAI AI Chat
Log In Sign Up

A Study on Robustness to Perturbations for Representations of Environmental Sound

by   Sangeeta Srivastava, et al.

Audio applications involving environmental sound analysis increasingly use general-purpose audio representations, also known as embeddings, for transfer learning. Recently, Holistic Evaluation of Audio Representations (HEAR) evaluated twenty-nine embedding models on nineteen diverse tasks. However, the evaluation's effectiveness depends on the variation already captured within a given dataset. Therefore, for a given data domain, it is unclear how the representations would be affected by the variations caused by myriad microphones' range and acoustic conditions – commonly known as channel effects. We aim to extend HEAR to evaluate invariance to channel effects in this work. To accomplish this, we imitate channel effects by injecting perturbations to the audio signal and measure the shift in the new (perturbed) embeddings with three distance measures, making the evaluation domain-dependent but not task-dependent. Combined with the downstream performance, it helps us make a more informed prediction of how robust the embeddings are to the channel effects. We evaluate two embeddings – YAMNet, and OpenL^3 on monophonic (UrbanSound8K) and polyphonic (SONYC UST) datasets. We show that one distance measure does not suffice in such task-independent evaluation. Although Fréchet Audio Distance (FAD) correlates with the trend of the performance drop in the downstream task most accurately, we show that we need to study this in conjunction with the other distances to get a clear understanding of the overall effect of the perturbation. In terms of the embedding performance, we find OpenL^3 to be more robust to YAMNet, which aligns with the HEAR evaluation.


HEAR 2021: Holistic Evaluation of Audio Representations

What audio embedding approach generalizes best to a wide range of downst...

Towards Learning Universal Audio Representations

The ability to learn universal audio representations that can solve dive...

Contrastive Environmental Sound Representation Learning

Machine hearing of the environmental sound is one of the important issue...

BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations

Pre-trained models are essential as feature extractors in modern machine...

Listening to the World Improves Speech Command Recognition

We study transfer learning in convolutional network architectures applie...

A General Framework for Learning Procedural Audio Models of Environmental Sounds

This paper introduces the Procedural (audio) Variational autoEncoder (Pr...

HCU400: An Annotated Dataset for Exploring Aural Phenomenology Through Causal Uncertainty

The way we perceive a sound depends on many aspects-- its ecological fre...