Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion

11/27/2020
by   Sijie Mai, et al.
5

In this paper, we study the task of multimodal sequence analysis which aims to draw inferences from visual, language and acoustic sequences. A majority of existing works generally focus on aligned fusion, mostly at word level, of the three modalities to accomplish this task, which is impractical in real-world scenarios. To overcome this issue, we seek to address the task of multimodal sequence analysis on unaligned modality sequences which is still relatively underexplored and also more challenging. Recurrent neural network (RNN) and its variants are widely used in multimodal sequence analysis, but they are susceptible to the issues of gradient vanishing/explosion and high time complexity due to its recurrent nature. Therefore, we propose a novel model, termed Multimodal Graph, to investigate the effectiveness of graph neural networks (GNN) on modeling multimodal sequential data. The graph-based structure enables parallel computation in time dimension and can learn longer temporal dependency in long unaligned sequences. Specifically, our Multimodal Graph is hierarchically structured to cater to two stages, i.e., intra- and inter-modal dynamics learning. For the first stage, a graph convolutional network is employed for each modality to learn intra-modal dynamics. In the second stage, given that the multimodal sequences are unaligned, the commonly considered word-level fusion does not pertain. To this end, we devise a graph pooling fusion network to automatically learn the associations between various nodes from different modalities. Additionally, we define multiple ways to construct the adjacency matrix for sequential data. Experimental results suggest that our graph-based model reaches state-of-the-art performance on two benchmark datasets.

READ FULL TEXT

page 1

page 4

page 6

page 9

page 11

page 12

research
08/17/2021

Graph Capsule Aggregation for Unaligned Multimodal Sequences

Humans express their opinions and emotions through multiple modalities w...
research
08/12/2018

Multimodal Language Analysis with Recurrent Multistage Fusion

Computational modeling of human multimodal language is an emerging resea...
research
10/22/2020

MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences

Human communication is multimodal in nature; it is through multiple moda...
research
09/17/2016

GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion

Data generated from real world events are usually temporal and contain m...
research
11/26/2021

Geometric Multimodal Deep Learning with Multi-Scaled Graph Wavelet Convolutional Network

Multimodal data provide complementary information of a natural phenomeno...
research
10/16/2020

Deep-HOSeq: Deep Higher Order Sequence Fusion for Multimodal Sentiment Analysis

Multimodal sentiment analysis utilizes multiple heterogeneous modalities...
research
08/30/2023

Adaptive Multi-Modalities Fusion in Sequential Recommendation Systems

In sequential recommendation, multi-modal information (e.g., text or ima...

Please sign up or login with your details

Forgot password? Click here to reset