Multi-level Attention Fusion Network for Audio-visual Event Recognition

06/12/2021
by   Mathilde Brousmiche, et al.
0

Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition. Inspired by prior studies in neuroscience, we couple both modalities at different levels of visual and audio paths. Furthermore, the network dynamically highlights a modality at a given time window relevant to classify events. Experimental results in AVE (Audio-Visual Event), UCF51, and Kinetics-Sounds datasets show that the approach can effectively improve the accuracy in audio-visual event classification. Code is available at: https://github.com/numediart/MAFnet

READ FULL TEXT

page 3

page 6

research
03/23/2018

Audio-Visual Event Localization in Unconstrained Videos

In this paper, we introduce a novel problem of audio-visual event locali...
research
01/23/2020

Audiovisual SlowFast Networks for Video Recognition

We present Audiovisual SlowFast Networks, an architecture for integrated...
research
02/12/2022

Audio-Visual Fusion Layers for Event Type Aware Video Recognition

Human brain is continuously inundated with the multisensory information ...
research
07/11/2020

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

The novelty of this study consists in a multi-modality approach to scene...
research
08/03/2022

Estimating Visual Information From Audio Through Manifold Learning

We propose a new framework for extracting visual information about a sce...
research
05/27/2023

Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser

Audio-visual learning has been a major pillar of multi-modal machine lea...
research
07/02/2021

Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion

We propose an audio-visual spatial-temporal deep neural network with: (1...

Please sign up or login with your details

Forgot password? Click here to reset