DeepAI AI Chat
Log In Sign Up

Self-Supervised learning with cross-modal transformers for emotion recognition

by   Aparna Khare, et al.

Emotion recognition is a challenging task due to limited availability of in-the-wild labeled datasets. Self-supervised learning has shown improvements on tasks with limited labeled datasets in domains like speech and natural language. Models such as BERT learn to incorporate context in word embeddings, which translates to improved performance in downstream tasks like question answering. In this work, we extend self-supervised training to multi-modal applications. We learn multi-modal representations using a transformer trained on the masked language modeling task with audio, visual and text features. This model is fine-tuned on the downstream task of emotion recognition. Our results on the CMU-MOSEI dataset show that this pre-training technique can improve the emotion recognition performance by up to 3


Using Large Pre-Trained Models with Cross-Modal Attention for Multi-Modal Emotion Recognition

Recently, self-supervised pre-training has shown significant improvement...

Multi-modal embeddings using multi-task learning for emotion recognition

General embeddings like word2vec, GloVe and ELMo have shown a lot of suc...

Jointly Fine-Tuning "BERT-like" Self Supervised Models to Improve Multimodal Speech Emotion Recognition

Multimodal emotion recognition from speech is an important area in affec...

Fine-grained Multi-Modal Self-Supervised Learning

Multi-Modal Self-Supervised Learning from videos has been shown to impro...

Multi-Modal Emotion Detection with Transfer Learning

Automated emotion detection in speech is a challenging task due to the c...

Exploration of A Self-Supervised Speech Model: A Study on Emotional Corpora

Self-supervised speech models have grown fast during the past few years ...

DialogueBERT: A Self-Supervised Learning based Dialogue Pre-training Encoder

With the rapid development of artificial intelligence, conversational bo...