Dense Multimodal Fusion for Hierarchically Joint Representation

10/08/2018
by   Di Hu, et al.
0

Multiple modalities can provide more valuable information than single one by describing the same contents in various ways. Hence, it is highly expected to learn effective joint representation by fusing the features of different modalities. However, previous methods mainly focus on fusing the shallow features or high-level representations generated by unimodal deep networks, which only capture part of the hierarchical correlations across modalities. In this paper, we propose to densely integrate the representations by greedily stacking multiple shared layers between different modality-specific networks, which is named as Dense Multimodal Fusion (DMF). The joint representations in different shared layers can capture the correlations in different levels, and the connection between shared layers also provides an efficient way to learn the dependence among hierarchical correlations. These two properties jointly contribute to the multiple learning paths in DMF, which results in faster convergence, lower training loss, and better performance. We evaluate our model on three typical multimodal learning tasks, including audiovisual speech recognition, cross-modal retrieval, and multimodal classification. The noticeable performance in the experiments demonstrates that our model can learn more effective joint representation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/19/2015

Multimodal sparse representation learning and applications

Unsupervised methods have proven effective for discriminative tasks in a...
research
04/11/2017

Deep Multimodal Representation Learning from Temporal Data

In recent years, Deep Learning has been successfully applied to multimod...
research
11/21/2018

Learning from Multiview Correlations in Open-Domain Videos

An increasing number of datasets contain multiple views, such as video, ...
research
08/26/2022

TFusion: Transformer based N-to-One Multimodal Fusion Block

People perceive the world with different senses, such as sight, hearing,...
research
05/29/2018

Learn to Combine Modalities in Multimodal Deep Learning

Combining complementary information from multiple modalities is intuitiv...
research
08/07/2018

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

We present an approach named JSFusion (Joint Sequence Fusion) that can m...
research
11/19/2019

Modal-aware Features for Multimodal Hashing

Many retrieval applications can benefit from multiple modalities, e.g., ...

Please sign up or login with your details

Forgot password? Click here to reset