Multi-level Fusion of Wav2vec 2.0 and BERT for Multimodal Emotion Recognition

07/11/2022
by   Zihan Zhao, et al.
0

The research and applications of multimodal emotion recognition have become increasingly popular recently. However, multimodal emotion recognition faces the challenge of lack of data. To solve this problem, we propose to use transfer learning which leverages state-of-the-art pre-trained models including wav2vec 2.0 and BERT for this task. Multi-level fusion approaches including coattention-based early fusion and late fusion with the models trained on both embeddings are explored. Also, a multi-granularity framework which extracts not only frame-level speech embeddings but also segment-level embeddings including phone, syllable and word-level speech embeddings is proposed to further boost the performance. By combining our coattention-based early fusion model and late fusion model with the multi-granularity feature extraction framework, we obtain result that outperforms best baseline approaches by 1.3 (UA) on the IEMOCAP dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

research
02/27/2023

Using Auxiliary Tasks In Multimodal Fusion Of Wav2vec 2.0 And BERT For Multimodal Emotion Recognition

The lack of data and the difficulty of multimodal fusion have always bee...
research
09/03/2019

Multimodal Deep Learning for Mental Disorders Prediction from Audio Speech Samples

Key features of mental illnesses are reflected in speech. Our research f...
research
06/23/2022

DeepSafety:Multi-level Audio-Text Feature Extraction and Fusion Approach for Violence Detection in Conversations

Natural Language Processing has recently made understanding human intera...
research
09/20/2023

Ensembling Multilingual Pre-Trained Models for Predicting Multi-Label Regression Emotion Share from Speech

Speech emotion recognition has evolved from research to practical applic...
research
10/13/2021

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

In this contribution, we investigate the effectiveness of deep fusion of...
research
11/17/2021

Information Fusion in Attention Networks Using Adaptive and Multi-level Factorized Bilinear Pooling for Audio-visual Emotion Recognition

Multimodal emotion recognition is a challenging task in emotion computin...
research
09/21/2020

Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition

Despite the recent achievements made in the multi-modal emotion recognit...

Please sign up or login with your details

Forgot password? Click here to reset