Unified Contrastive Fusion Transformer for Multimodal Human Action Recognition

09/10/2023
by   Kyoung Ok Yang, et al.
0

Various types of sensors have been considered to develop human action recognition (HAR) models. Robust HAR performance can be achieved by fusing multimodal data acquired by different sensors. In this paper, we introduce a new multimodal fusion architecture, referred to as Unified Contrastive Fusion Transformer (UCFFormer) designed to integrate data with diverse distributions to enhance HAR performance. Based on the embedding features extracted from each modality, UCFFormer employs the Unified Transformer to capture the inter-dependency among embeddings in both time and modality domains. We present the Factorized Time-Modality Attention to perform self-attention efficiently for the Unified Transformer. UCFFormer also incorporates contrastive learning to reduce the discrepancy in feature distributions across various modalities, thus generating semantically aligned features for information fusion. Performance evaluation conducted on two popular datasets, UTD-MHAD and NTU RGB+D, demonstrates that UCFFormer achieves state-of-the-art performance, outperforming competing methods by considerable margins.

READ FULL TEXT

page 1

page 4

research
01/05/2021

Trear: Transformer-based RGB-D Egocentric Action Recognition

In this paper, we propose a Transformer-based RGB-D egocentric action re...
research
08/22/2020

Multidomain Multimodal Fusion For Human Action Recognition Using Inertial Sensors

One of the major reasons for misclassification of multiplex actions duri...
research
08/01/2023

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

In line with the human capacity to perceive the world by simultaneously ...
research
09/12/2022

SANCL: Multimodal Review Helpfulness Prediction with Selective Attention and Natural Contrastive Learning

With the boom of e-commerce, Multimodal Review Helpfulness Prediction (M...
research
07/31/2023

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

Conversational engagement estimation is posed as a regression problem, e...
research
06/20/2022

M M Mix: A Multimodal Multiview Transformer Ensemble

This report describes the approach behind our winning solution to the 20...
research
08/21/2022

CMSBERT-CLR: Context-driven Modality Shifting BERT with Contrastive Learning for linguistic, visual, acoustic Representations

Multimodal sentiment analysis has become an increasingly popular researc...

Please sign up or login with your details

Forgot password? Click here to reset