MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

08/01/2023
by   Muhammad Bilal Shaikh, et al.
0

In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast to existing state-of-the-art strategies that focus solely on audio or video modalities, MAiVAR-T demonstrates superior performance. Our extensive empirical evaluations conducted on a benchmark action recognition dataset corroborate the model's remarkable performance. This underscores the potential enhancements derived from integrating audio and video modalities for action recognition purposes.

READ FULL TEXT

page 1

page 3

page 4

page 5

research
09/11/2022

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Currently, action recognition is predominately performed on video data a...
research
03/06/2022

Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

With the assumption that a video dataset is multimodality annotated in w...
research
06/20/2022

M M Mix: A Multimodal Multiview Transformer Ensemble

This report describes the approach behind our winning solution to the 20...
research
11/25/2021

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

Can we train a single transformer model capable of processing multiple m...
research
07/06/2023

Read, Look or Listen? What's Needed for Solving a Multimodal Dataset

The prevalence of large-scale multimodal datasets presents unique challe...
research
03/04/2021

Perceiver: General Perception with Iterative Attention

Biological systems understand the world by simultaneously processing hig...
research
09/10/2023

Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

Various types of sensors have been considered to develop human action re...

Please sign up or login with your details

Forgot password? Click here to reset