Deep Multimodal Representation Learning from Temporal Data

04/11/2017
by   Xitong Yang, et al.
0

In recent years, Deep Learning has been successfully applied to multimodal learning problems, with the aim of learning useful joint representations in data fusion applications. When the available modalities consist of time series data such as video, audio and sensor signals, it becomes imperative to consider their temporal structure during the fusion process. In this paper, we propose the Correlational Recurrent Neural Network (CorrRNN), a novel temporal fusion model for fusing multiple input modalities that are inherently temporal in nature. Key features of our proposed model include: (i) simultaneous learning of the joint representation and temporal dependencies between modalities, (ii) use of multiple loss terms in the objective function, including a maximum correlation loss term to enhance learning of cross-modal information, and (iii) the use of an attention model to dynamically adjust the contribution of different input modalities to the joint representation. We validate our model via experimentation on two different tasks: video- and sensor-based activity classification, and audio-visual speech recognition. We empirically analyze the contributions of different components of the proposed CorrRNN model, and demonstrate its robustness, effectiveness and state-of-the-art performance on multiple datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2018

Dense Multimodal Fusion for Hierarchically Joint Representation

Multiple modalities can provide more valuable information than single on...
research
09/17/2016

GeThR-Net: A Generalized Temporally Hybrid Recurrent Neural Network for Multimodal Information Fusion

Data generated from real world events are usually temporal and contain m...
research
12/22/2021

Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding

Personality computing and affective computing have gained recent interes...
research
04/23/2019

Latent Variable Algorithms for Multimodal Learning and Sensor Fusion

Multimodal learning has been lacking principled ways of combining inform...
research
03/26/2015

Generalized K-fan Multimodal Deep Model with Shared Representations

Multimodal learning with deep Boltzmann machines (DBMs) is an generative...
research
02/16/2022

Cross-Modal Common Representation Learning with Triplet Loss Functions

Common representation learning (CRL) learns a shared embedding between t...
research
04/05/2020

Deep Multimodal Feature Encoding for Video Ordering

True understanding of videos comes from a joint analysis of all its moda...

Please sign up or login with your details

Forgot password? Click here to reset