Multimodal Transformer for Unaligned Multimodal Language Sequences

06/01/2019
by   Yao-Hung Hubert Tsai, et al.
0

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

READ FULL TEXT
research
12/03/2021

LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences

Learning modality-fused representations and processing unaligned multimo...
research
06/16/2022

Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos

Multimodal sentiment analysis in videos is a key task in many real-world...
research
11/22/2019

Factorized Multimodal Transformer for Multimodal Sequential Learning

The complex world around us is inherently multimodal and sequential (con...
research
07/26/2022

Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

In this paper, we propose a novel speech emotion recognition model calle...
research
08/17/2021

Graph Capsule Aggregation for Unaligned Multimodal Sequences

Humans express their opinions and emotions through multiple modalities w...
research
01/29/2023

Global Flood Prediction: a Multimodal Machine Learning Approach

Flooding is one of the most destructive and costly natural disasters, an...
research
12/24/2020

Detecting Hateful Memes Using a Multimodal Deep Ensemble

While significant progress has been made using machine learning algorith...

Please sign up or login with your details

Forgot password? Click here to reset