On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion Recognition: An Update for the Deep Learning Era

by   Shahin Amiriparian, et al.

Text encodings from automatic speech recognition (ASR) transcripts and audio representations have shown promise in speech emotion recognition (SER) ever since. Yet, it is challenging to explain the effect of each information stream on the SER systems. Further, more clarification is required for analysing the impact of ASR's word error rate (WER) on linguistic emotion recognition per se and in the context of fusion with acoustic information exploitation in the age of deep ASR systems. In order to tackle the above issues, we create transcripts from the original speech by applying three modern ASR systems, including an end-to-end model trained with recurrent neural network-transducer loss, a model with connectionist temporal classification loss, and a wav2vec framework for self-supervised learning. Afterwards, we use pre-trained textual models to extract text representations from the ASR outputs and the gold standard. For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep. Finally, we conduct decision-level fusion on both information streams – acoustics and linguistics. Using the best development configuration, we achieve state-of-the-art unweighted average recall values of 73.6 % and 73.8 % on the speaker-independent development and test partitions of IEMOCAP, respectively.


page 1

page 2

page 3

page 4


Fusing ASR Outputs in Joint Training for Speech Emotion Recognition

Alongside acoustic information, linguistic features based on speech tran...

CTA-RNN: Channel and Temporal-wise Attention RNN Leveraging Pre-trained ASR Embeddings for Speech Emotion Recognition

Previous research has looked into ways to improve speech emotion recogni...

Incorporating End-to-End Speech Recognition Models for Sentiment Analysis

Previous work on emotion recognition demonstrated a synergistic effect o...

Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features

Topic classification systems on spoken documents usually consist of two ...

Is Everything Fine, Grandma? Acoustic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition

Acoustic and linguistic analysis for elderly emotion recognition is an u...

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

In this contribution, we investigate the effectiveness of deep fusion of...

Feature Replacement and Combination for Hybrid ASR Systems

Acoustic modeling of raw waveform and learning feature extractors as par...

Please sign up or login with your details

Forgot password? Click here to reset