Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

06/27/2021
by   Anurag Bagchi, et al.
0

State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited. Audio fusion has been explored for the related but arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider audio and video modalities for supervised TAL. We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches. Specifically, they help achieve new state of the art performance on large-scale benchmark datasets - ActivityNet-1.3 (54.34 mAP@0.5) and THUMOS14 (57.18 mAP@0.5). Our experiments include ablations involving multiple fusion schemes, modality combinations and TAL architectures. Our code, models and associated data are available at https://github.com/skelemoa/tal-hmo.

READ FULL TEXT
research
11/01/2021

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

In egocentric videos, actions occur in quick succession. We capitalise o...
research
08/22/2019

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

We focus on multi-modal fusion for egocentric action recognition, and pr...
research
09/11/2022

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Currently, action recognition is predominately performed on video data a...
research
02/10/2022

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Temporal action localization (TAL) is an important task extensively expl...
research
10/01/2018

Learnable Pooling Methods for Video Classification

We introduce modifications to state-of-the-art approaches to aggregating...
research
06/03/2019

How Much Does Audio Matter to Recognize Egocentric Object Interactions?

Sounds are an important source of information on our daily interactions ...
research
03/21/2023

ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers

Lack of audio-video synchronization is a common problem during televisio...

Please sign up or login with your details

Forgot password? Click here to reset