Audio-Visual Event Recognition through the lens of Adversary

by   Juncheng B. Li, et al.

As audio/visual classification models are widely deployed for sensitive tasks like content filtering at scale, it is critical to understand their robustness along with improving the accuracy. This work aims to study several key questions related to multimodal learning through the lens of adversarial noises: 1) The trade-off between early/middle/late fusion affecting its robustness and accuracy 2) How do different frequency/time domain features contribute to the robustness? 3) How do different neural modules contribute to the adversarial noise? In our experiment, we construct adversarial examples to attack state-of-the-art neural models trained on Google AudioSet. We compare how much attack potency in terms of adversarial perturbation of size ϵ using different L_p norms we would need to "deactivate" the victim model. Using adversarial noise to ablate multimodal models, we are able to provide insights into what is the best potential fusion strategy to balance the model parameters/accuracy and robustness trade-off and distinguish the robust features versus the non-robust features that various neural networks model tend to learn.


page 1

page 2

page 3

page 4


On Adversarial Robustness of Large-scale Audio Visual Learning

As audio-visual systems are being deployed for safety-critical tasks suc...

Understanding and Measuring Robustness of Multimodal Learning

The modern digital world is increasingly becoming multimodal. Although m...

Can audio-visual integration strengthen robustness under multimodal attacks?

In this paper, we propose to make a systematic study on machines multise...

Audio Attacks and Defenses against AED Systems – A Practical Study

Audio Event Detection (AED) Systems capture audio from the environment a...

Adversarial Feature Stacking for Accurate and Robust Predictions

Deep Neural Networks (DNNs) have achieved remarkable performance on a va...

Towards Adversarial Attack on Vision-Language Pre-training Models

While vision-language pre-training model (VLP) has shown revolutionary i...

Robustness of Neural Architectures for Audio Event Detection

Traditionally, in Audio Recognition pipeline, noise is suppressed by the...