Where and When: Space-Time Attention for Audio-Visual Explanations

05/04/2021
by   Yanbei Chen, et al.
10

Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a crucial step forward and explore learnable explanations for audio-visual recognition. Specifically, we propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time. Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear, and when the predicted sounds occur in videos. We benchmark our model on three audio-visual video event datasets, comparing extensively to multiple recent multi-modal representation learners and intrinsic explanation models. Experimental results demonstrate the clear superior performance of our model over the existing methods on audio-visual video event recognition. Moreover, we conduct an in-depth study to analyze the explainability of our model based on robustness analysis via perturbation tests and pointing games using human annotations.

READ FULL TEXT

page 8

page 13

research
12/14/2021

Multi-Modal Perception Attention Network with Self-Supervised Learning for Audio-Visual Speaker Tracking

Multi-modal fusion is proven to be an effective method to improve the ac...
research
10/13/2022

Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

The objective of this paper is audio-visual synchronisation of general v...
research
04/15/2018

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

A major challenge for video captioning is to combine audio and visual cu...
research
11/10/2021

Space-Time Memory Network for Sounding Object Localization in Videos

Leveraging temporal synchronization and association within sight and sou...
research
09/12/2023

DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention

With the rise in manipulated media, deepfake detection has become an imp...
research
07/27/2023

PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data

Audio-visual learning seeks to enhance the computer's multi-modal percep...
research
02/25/2022

Human Detection of Political Deepfakes across Transcripts, Audio, and Video

Recent advances in technology for hyper-realistic visual effects provoke...

Please sign up or login with your details

Forgot password? Click here to reset