Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

11/13/2019
by   Zhongkai Sun, et al.
0

Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video. This paper proposes a novel model, the Interaction Canonical Correlation Network (ICCN), to learn such multimodal embeddings. ICCN learns correlations between all three modes via deep canonical correlation analysis (DCCA) and the proposed embeddings are then tested on several benchmark datasets and against other state-of-the-art multimodal embedding algorithms. Empirical results and ablation studies confirm the effectiveness of ICCN in capturing useful information from all three views.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/29/2017

Benchmarking Multimodal Sentiment Analysis

We propose a framework for multimodal sentiment analysis and emotion rec...
research
07/15/2019

Multi-modal Sentiment Analysis using Deep Canonical Correlation Analysis

This paper learns multi-modal embeddings from text, audio, and video vie...
research
02/22/2018

Deep Multimodal Learning for Emotion Recognition in Spoken Language

In this paper, we present a novel deep multimodal framework to predict h...
research
09/03/2019

Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples

Key features of mental illnesses are reflected in speech. Our research f...
research
06/07/2018

Multimodal Relational Tensor Network for Sentiment and Emotion Classification

Understanding Affect from video segments has brought researchers from th...
research
10/06/2021

Unsupervised Multimodal Language Representations using Convolutional Autoencoders

Multimodal Language Analysis is a demanding area of research, since it i...
research
05/23/2017

Better Text Understanding Through Image-To-Text Transfer

Generic text embeddings are successfully used in a variety of tasks. How...

Please sign up or login with your details

Forgot password? Click here to reset