DeepEMO: Deep Learning for Speech Emotion Recognition

09/09/2021 ∙ by Enkhtogtokh Togootogtokh, et al. ∙ 0

We proposed the industry level deep learning approach for speech emotion recognition task. In industry, carefully proposed deep transfer learning technology shows real results due to mostly low amount of training data availability, machine training cost, and specialized learning on dedicated AI tasks. The proposed speech recognition framework, called DeepEMO, consists of two main pipelines such that preprocessing to extract efficient main features and deep transfer learning model to train and recognize. Main source code is in repository



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Ability to understand and manage emotions, called emotional intelligence, has been shown to play an important role in decision-making. Some researchers suggest that emotional intelligence can be learned and strengthened it refers to the ability to perceive, control, and evaluate emotions. For machine intelligence, it is also important role to understand and even generate such emotional intelligence. Here we proposed the simple yet effective speech emotional recognition modern deep learning technique called DeepEMO framework. It has been conceived considering the main AI pipeline (from sensors to results) together with modern technology trends. The DeepEMO has two main pipelines which are to extract strong speech features and deep transfer learning for the emotion recognition task. We applied them on english emotional speech case. Generally it is possible to apply them on any natural language. There are inevitable demands to recognize the speech emotion with advanced technology.

Concretely, the key contributions of the proposed work are:

  • The industry level AI technology for speech emotion recognition

  • The speech recognition general modern deep learning framework for similar tasks

Systematic experiments conducted on real-world acquired data have shown as:

  • It is possible to be common framework for many type of speech recognition task.

  • It is possible to achieve 99.9% accuracy on well prepared training data to recognize.

  • It is possible to later generate realistic enough synthetic emotional speech data generation with multiple variations

The rest of the paper is organized as follows. The proposed framework is described in Section II. The recognition deep convolutional model is explained in Section II-B. The details about the experimental results are presented in Section III. Finally, Section IV provides the conclusions and future work.

Ii The proposed method (DeepEMO)

Fig. 1: The DeepEMO framework

In this section, we discuss the proposed DeepEMO model for speech emotion recognition AI applications as shown in Figure 1. The DeepEMO has two main pipelines which are the preprocessing to extract efficient features and deep transfer learning mechanism. We discuss them in detail with coming sections.

Ii-a The preprocessing feature extraction

Audio significant feature extraction is the important part of modern deep learning. There are many mechanisms to do it. Here we extract melspectrogram audio feature later to train machine with high accuracy.

Specifically, it consists of following general steps:

  • To compute fast Fourier transform (FFT)

  • To generate mel scale

  • To generate spectrogram

The FFT is an algorithm which efficiently computes the Fourier transform. The Fourier transform is a mathematical formula which decomposes a signal into it’s individual frequencies and the frequency’s amplitude. In other words, it converts the signal from the time domain into the frequency domain. The result is called a spectrum. The mathematical operation converts frequencies to the mel scale. Researchers proposed a unit of pitch such that equal distances in pitch sounded equally distant to the listener, which is called the mel scale. When signal’s frequency content varies over time as non periodic signals, it needs a right representation. As we can compute several spectrums by performing FFT on several windowed segments. It is called the short-time Fourier transform. The FFT is computed on overlapping windowed segments of the signal, which is called the spectrogram.

Ii-B The deep transfer learning

Specifically, we define the transfer learning model by[1]:

  • To prepare the pre-trained model

  • To re-define the last output layer as n (in case, n=8) neurons layer for the new task

  • To train the network

This is called transfer learning, i.e. we have a model trained on another task, and we need to tune it for the new dataset we have in hand.

To recognize speech emotion, we propose the deep convolutional transfer neural network. Since after melspectrogram feature extraction preprocessing, it is now generally computer vision problem. The deep convolutional backbone model is ResNet18

[2] which consists of assemble of convolutional layers and batch norms as shown in Algorithm 1

. For simplicity, the Pytorch


style pseudo code is provided. Cross Entropy Loss function (CE) and Adam optimize implemented for the model.


  import torch

2:  import torch.nn as nn
3:  import torchvision
Algorithm 1 Transfer Learning Model

Iii Experimental Results

In this section, we discuss first about the setup, and then evaluate the deep transfer learning recognition and melspectrogram generation results are experimented in systematic scenarios.

Iii-a Setup

We train and test on ubuntu 18 machine with capacity of (CPU: Intel(R) Xeon(R) CPU @ 2.20GHz, RAM:16GB, GPU: NVidia GeForce GTX 1070, 16 GB).

Iii-B The dataset

We use the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)[4] dataset. It has 8 speech label emotions as neutral, calm, happy, sad, angry, fearful, disgust, and surprised speeches.

Iii-C The recognition results

Table I

shows the accuracy of training and validation on number of epochs. After 42 epochs, we achieved enough accuracy as loss, training, and validation are 100%, 0,009, and 100%, correspondingly.

Number of epoch Training accuracy Loss Validation accuracy
10 0.88 0.470 0.970
20 0.991 0.011 1.000
42 1.000 0.009 1.000
50 1.000 0.006 1.000

The deep convolutional neural network recognition model training and validation accuracy on epochs.

Figure 2, 3, 4, and 5

show the recognition results of testing data happy, calm, sad, and surprised emotional speeches, accordingly. We printed out top-8 probability classes to show cases with corresponding melspectrogram.

Fig. 2: The recognition result for happy emotional speech
Fig. 3: The recognition result for calm emotional speech
Fig. 4: The recognition result for sad emotional speech
Fig. 5: The recognition result for surprised emotional speech

Iv Conclusion

We proposed the modern AI deep learning framework as DeepEMO for speech emotional recognition and application for industry use case. Modern state-of-the-art deep learning approaches implemented to recognize the typical emotional speeches. Main algorithm is directly provided in this research to develop first phase of emotion recognition. The real visual results and some important evaluation accuracy scores are presented. In future works, we will publish next series of research to apply on emotional speech generation deep learning tasks.