Modulation spectral features for speech emotion recognition using deep neural networks

01/14/2023
by   Premjeet Singh, et al.
0

This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis of sound comprise of two important cognitive parts: early auditory analysis and cortex-based processing. The early auditory analysis considers spectrogram-based representation whereas cortex-based analysis includes extraction of temporal modulations from the spectrogram. This temporal modulation representation of spectrogram is called modulation spectral feature (MSF). As the constant-Q transform (CQT) provides higher resolution at emotion salient low-frequency regions of speech, we find that CQT-based spectrogram, together with its temporal modulations, provides a representation enriched with emotion-specific information. We argue that CQT-MSF when used with a 2-dimensional convolutional network can provide a time-shift invariant and deformation insensitive representation for SER. Our results show that CQT-MSF outperforms standard mel-scale based spectrogram and its modulation features on two popular SER databases, Berlin EmoDB and RAVDESS. We also show that our proposed feature outperforms the shift and deformation invariant scattering transform coefficients, hence, showing the importance of joint hand-crafted and self-learned feature extraction instead of reliance on complete hand-crafted features. Finally, we perform Grad-CAM analysis to visually inspect the contribution of constant-Q modulation features over SER.

READ FULL TEXT

page 13

page 18

page 21

page 28

research
05/11/2021

Deep scattering network for speech emotion recognition

This paper introduces scattering transform for speech emotion recognitio...
research
11/29/2022

Analysis of constant-Q filterbank based representations for speech emotion recognition

This work analyzes the constant-Q filterbank-based time-frequency repres...
research
02/08/2021

Non-linear frequency warping using constant-Q transformation for speech emotion recognition

In this work, we explore the constant-Q transform (CQT) for speech emoti...
research
04/12/2019

Multimodal Speech Emotion Recognition and Ambiguity Resolution

Identifying emotion from speech is a non-trivial task pertaining to the ...
research
10/07/2021

SERAB: A multi-lingual benchmark for speech emotion recognition

Recent developments in speech emotion recognition (SER) often leverage d...
research
11/14/2022

Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition

Speech emotion recognition (SER) plays a vital role in improving the int...
research
08/15/2020

Deep Architectures for Modulation Recognition with Multiple Receive Antennas

Modulation recognition using deep neural networks has shown promising ad...

Please sign up or login with your details

Forgot password? Click here to reset