An Ensemble 1D-CNN-LSTM-GRU Model with Data Augmentation for Speech Emotion Recognition

12/10/2021
by   Md. Rayhan Ahmed, et al.
0

In this paper, we propose an ensemble of deep neural networks along with data augmentation (DA) learned using effective speech-based features to recognize emotions from speech. Our ensemble model is built on three deep neural network-based models. These neural networks are built using the basic local feature acquiring blocks (LFAB) which are consecutive layers of dilated 1D Convolutional Neural networks followed by the max pooling and batch normalization layers. To acquire the long-term dependencies in speech signals further two variants are proposed by adding Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM) layers respectively. All three network models have consecutive fully connected layers before the final softmax layer for classification. The ensemble model uses a weighted average to provide the final classification. We have utilized five standard benchmark datasets: TESS, EMO-DB, RAVDESS, SAVEE, and CREMA-D for evaluation. We have performed DA by injecting Additive White Gaussian Noise, pitch shifting, and stretching the signal level to generalize the models, and thus increasing the accuracy of the models and reducing the overfitting as well. We handcrafted five categories of features: Mel-frequency cepstral coefficients, Log Mel-Scaled Spectrogram, Zero-Crossing Rate, Chromagram, and statistical Root Mean Square Energy value from each audio sample. These features are used as the input to the LFAB blocks that further extract the hidden local features which are then fed to either fully connected layers or to LSTM or GRU based on the model type to acquire the additional long-term contextual representations. LFAB followed by GRU or LSTM results in better performance compared to the baseline model. The ensemble model achieves the state-of-the-art weighted average accuracy in all the datasets.

READ FULL TEXT

page 6

page 7

page 8

page 9

page 12

page 13

research
10/31/2020

Efficient Arabic emotion recognition using deep neural networks

Emotion recognition from speech signal based on deep learning is an acti...
research
07/06/2023

Evaluating raw waveforms with deep learning frameworks for speech emotion recognition

Speech emotion recognition is a challenging task in speech processing fi...
research
07/14/2020

Malware Detection for Forensic Memory Using Deep Recurrent Neural Networks

Memory forensics is a young but fast-growing area of research and a prom...
research
06/29/2022

DDKtor: Automatic Diadochokinetic Speech Analysis

Diadochokinetic speech tasks (DDK), in which participants repeatedly pro...
research
04/15/2022

Detecting Violence in Video Based on Deep Features Fusion Technique

With the rapid growth of surveillance cameras in many public places to m...
research
02/15/2018

Speech Emotion Recognition with Data Augmentation and Layer-wise Learning Rate Adjustment

In this work, we design a neural network for recognizing emotions in spe...
research
12/06/2019

A limited-size ensemble of homogeneous CNN/LSTMs for high-performance word classification

In recent years, long short-term memory neural networks (LSTMs) have bee...

Please sign up or login with your details

Forgot password? Click here to reset