Can audio-visual integration strengthen robustness under multimodal attacks?

04/05/2021
by   Yapeng Tian, et al.
0

In this paper, we propose to make a systematic study on machines multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weakly-supervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance.

READ FULL TEXT

page 1

page 6

page 8

page 12

page 15

page 16

research
04/07/2021

MPN: Multimodal Parallel Network for Audio-Visual Event Localization

Audio-visual event localization aims to localize an event that is both a...
research
10/03/2022

Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual Active Speaker Detection

Audio-visual active speaker detection (AVASD) is well-developed, and now...
research
11/24/2021

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Recognizing and localizing events in videos is a fundamental task for vi...
research
11/15/2020

Audio-Visual Event Recognition through the lens of Adversary

As audio/visual classification models are widely deployed for sensitive ...
research
05/30/2023

Learning Weakly Supervised Audio-Visual Violence Detection in Hyperbolic Space

In recent years, the task of weakly supervised audio-visual violence det...
research
09/13/2023

Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer

Never having seen an object and heard its sound simultaneously, can the ...
research
03/08/2022

Skating-Mixer: Multimodal MLP for Scoring Figure Skating

Figure skating scoring is a challenging task because it requires judging...

Please sign up or login with your details

Forgot password? Click here to reset