Audio-Visual Fusion Layers for Event Type Aware Video Recognition

02/12/2022
by   Arda Senocak, et al.
0

Human brain is continuously inundated with the multisensory information and their complex interactions coming from the outside world at any given moment. Such information is automatically analyzed by binding or segregating in our brain. While this task might seem effortless for human brains, it is extremely challenging to build a machine that can perform similar tasks since complex interactions cannot be dealt with single type of integration but requires more sophisticated approaches. In this paper, we propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme. Unlike previous works where single type of fusion is used, we design event-specific layers to deal with different audio-visual relationship tasks, enabling different ways of audio-visual formation. Experimental results show that our event-specific layers can discover unique properties of the audio-visual relationships in the videos. Moreover, although our network is formulated with single labels, it can output additional true multi-labels to represent the given videos. We demonstrate that our proposed framework also exposes the modality bias of the video data category-wise and dataset-wise manner in popular benchmark datasets.

READ FULL TEXT

page 5

page 12

page 13

page 20

page 21

page 22

page 23

page 24

research
06/12/2021

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Event classification is inherently sequential and multimodal. Therefore,...
research
06/15/2023

Towards Long Form Audio-visual Video Understanding

We live in a world filled with never-ending streams of multimodal inform...
research
11/21/2021

Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

Binaural audio provides human listeners with an immersive spatial sound ...
research
06/10/2019

UniDual: A Unified Model for Image and Video Understanding

Although a video is effectively a sequence of images, visual perception ...
research
12/07/2018

An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the con...
research
09/08/2023

EGOFALLS: A visual-audio dataset and benchmark for fall detection using egocentric cameras

Falls are significant and often fatal for vulnerable populations such as...
research
04/02/2020

Multi-Modal Video Forensic Platform for Investigating Post-Terrorist Attack Scenarios

The forensic investigation of a terrorist attack poses a significant cha...

Please sign up or login with your details

Forgot password? Click here to reset