Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning

04/16/2018 ∙ by Samarth Tripathi, et al. ∙ 0

Emotion recognition has become an important field of research in Human Computer Interactions as we improve upon the techniques for modelling the various aspects of behaviour. With the advancement of technology our understanding of emotions are advancing, there is a growing need for automatic emotion recognition systems. One of the directions the research is heading is the use of Neural Networks which are adept at estimating complex functions that depend on a large number and diverse source of input data. In this paper we attempt to exploit this effectiveness of Neural networks to enable us to perform multimodal Emotion recognition on IEMOCAP dataset using data from Speech, Text, and Motion capture data from face expressions, rotation and hand movements. Prior research has concentrated on Emotion detection from Speech on the IEMOCAP dataset, but our approach is the first that uses the multiple modes of data offered by IEMOCAP for a more robust and accurate emotion detection.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Works

Emotion is a psycho-physiological process that can be triggered by conscious and/or unconscious perception of objects and situations, associated with multitude of factors such as mood, temperament, personality, disposition, and motivation [1]. Emotions are very important in human decision handling, interaction and cognitive process [2]. With the advancement of technology and as our understanding of emotions is advancing, there is a growing need for automatic emotion recognition systems. Emotion recognition has been studied widely using speech [3] [4] [5], text [6], facial cues [7], and EEG based brain waves [8] individually.

One of the biggest open-sourced multimodal resources available in emotion detection is IEMOCAP dataset [9] which consists of approximately 12 hours of audio-visual data, including facial recordings, speech and text transcriptions. In this paper we combine these modes to make stronger and more robust detector for emotions. However most of the research on IEMOCAP has concentrated specifically on emotion detection using Speech data points. One of the early important papers on this dataset is [10]

which beat state of the art by 20% over techniques that used HMMs (Hidden Markov Models), SVMs (Support Vector Machines)


and other shallow learning methods. They perform segment level feature extraction, feed those features to a MLP based architecture, where the input is 750 dimensional feature vector, followed by 3 hidden layer of 256 neurons each with rectilinear units as non-linearity.

[3] follows [10]

and they train long short-term memory (LSTM) based recurrent neural network. First they divide each utterance into small segments with voiced region, then assume that the label sequences of each segment follows a Markov chain. They extract 32 features for every frame: F0 (pitch), voice probability, zero-crossing rate, 12-dimensional Mel-frequency cepstral coefficients (MFCC) with log energy, and their first time derivatives. The network contains 2 hidden layers with 128 BLSTM cells (64 forward nodes and 64 backward nodes).

Another research we closely follow is [4]

, where they use CTC loss function to improve upon RNN based Emotion prediction. They use 34 features including 12 MFCC, chromagram-based and spectrum properties like flux and roll-off. For all speech intervals they calculate features in 0.2 second window and moving it with 0.1 second step. The use of CTC loss helps, as often, almost the whole utterance has no emotion, but emotionality is contained only in a few words or phonemes in an utterance which the CTC loss handles well. Unlike

[3] which uses only the improv data, Chernykh et. al. use all the session data for the emotion classification. Another important research on Speech based Emotion recognition is the work of [12]

which uses transfer learning to improve on Neural Models for emotion detection. Their model uses 1D convolutions and GRU layers to initialize a neural model for Automatic Speech Recognition inspired by Deep Speech. They use many datasets for ASR based training on CTC loss, and then fine-tune this model on IEMOCAP.

To detect emotions using the data from the modalities of IEMOCAP we explore various deep learning based architectures to first get the best individual detection accuracy from each of the different modes. We then combine them in an ensemble based architecture to allow for training across the different modalities using the variations of the better individual models. Our ensemble consists of Long Short Term Memory networks, Convolution Neural Networks, Fully connected Multi-Layer Perceptrons and we complement them using techniques such as Dropout, adaptive optimizers such as Adam, pretrained word-embedding models and Attention based RNN decoders. Comparing our speech based emotion detection with

[3] we achieve 62.72% accuracy compared to their 62.85%; and comparing with [4] we achieve 55.65% accuracy compared to their CTC based 54% accuracy. After combining Speech (individually 55.65% accuracy) and Text (individually 64.78% accuracy) modes we achieve an improvement to 68.40% accuracy. When we also account MoCap data (individually 51.11% accuracy) we also achieve a further improvement to 71.04%.

2 Experimental Setup

IEMOCAP has 12 hours of audio-visual data from 10 actors where the recordings follow dialogues between a male and a female actor in both scripted or improvised topics. After the audio-visual data has been collected it is divided into small utterances of length between 3 to 15 seconds which are then labelled by evaluators. Each utterance is evaluated by 3-4 assessors. The evaluation form contained 10 options (neutral, happiness, sadness, anger, surprise, fear, disgust frustration, excited, other). We consider only 4 of them — anger, excitement (happiness), neutral and sadness so as to remain consistent with the prior research. We consider emotions where atleast 2 experts were consistent with their decision, which is more than 70 % of the dataset, again consistent with prior research.

Along with the .wav file for the dialogue we also have available the transcript each the utterance. For each session one actor wears the Motion Capture (MoCap) camera data which records the facial expression, head and hand movements of the actor. The Mocap data contains column tuples, for facial expressions the tuples are contained in 165 dimensions, 18 for hand positions and 6 for head rotations. As this Mocap data is very extensive we use it instead of the video recording in the dataset. These three modes (Speech, Text, Mocap) of data form the basis of our multi-modal emotion detection pipeline.

Next we preprocess the IEMOCAP data for these modes. For the speech data our preprocessing follows the work of [4]

. We use the Fourier frequencies and energy-based features Mel-frequency cepstral coefficients (MFCC) for a total of 34 features. They include 13 MFCC, 13 chromagram-based and 8 Time Spectral Features like zero crossing rate, short-term energy, short-term entropy of energy, spectral centroid and spread, spectral entropy, spectral flux, spectral rolloff. We calculate features in 0.2 second window and moving it with 0.1 second step and with 16 kHz sample rate. We keep a maximum of 100 frames or approximately for 10 seconds of the input, and zero pad the extra signal and end up with (100,34) feature vector for each utterance. We also experiment with delta and double-delta features of MFCC but they dont produce any performance improvement while adding extra computation overhead.

For the text transcript of each of the utterance we use pretrained Glove embeddings [13] of dimension 300, along with the maximum sequence length of 500 to obtain a (500,300) vector for each utterance. For the Mocap data, for each different mode such as face, hand, head rotation we sample all the feature values between the start and finish time values and split them into 200 partitioned arrays. We then average each of the 200 arrays along the columns (165 for faces, 18 for hands, and 6 for rotation), and finally concatenate all of them to obtain (200,189) dimension vector for each utterance.

3 Models and Results

3.1 Speech Based Emotion Detection

Figure 1: Neural Modal for Speech based Emotion detection

Our first model (Model1) consists of three layered Fully Connected MLP layers with 1024,512,256 hidden neural units with Relu as activation and 4 output neurons with Softmax. The model takes the flattened speech vectors as input and trains using cross entropy loss with Adadelta as the optimizer. Model2 uses two stacked LSTM layers with 512 and 256 units followed by a Dense layer with 512 units and Relu Activation. Model3 uses 2 LSTM layers with 128 units each but the second LSTM layer has Attention implementation as well, followed by 512 units of Dense layer with ReLu activation. Model4 improves both the encoding LSTM and Attention based decoding LSTM by making them bi-directional. All these last 3 models use Adadelta as the optimizer. We divide our dataset with a randomly chosen 20% validation split and report our accuracies based this set. As we can see the final Attention based LSTM model performs the best. We also try many variations of the speech data including using MelSpectrogram, smaller window (0.08s) with longer context (200 timestamps) as well as combining these approaches into one big network but do not achieve improvements.

Model Accuracy
Model1 50.6%
Model2 51.32%
Model3 54.15%
Model4 55.65%
Table 1: Speech emotion detection models and accuracy
Model Accuracy
Lee and Tashev [3] 62.85%
Ours (improv only) 62.72%
Chernykh [4] 54%
Neumann [5] 56.10%
Lakomkin [12] 56%
Ours (all) 55.65%
Ours (all, Speech + text + Mocap) 71.04%
Table 2: Comparison between our Speech emotion detection models and previous research

To compare our results with prior research we use our best model (Model4) and evaluate it in the manner similar to various conditions of the previous researches. We train using Session1-4 and use Session5 as our test set. Like [3] we use only the improvisation session for both Training and Testing and achieve similar results. To compare with [4] [5] [12] who use the both scripted and Improvisation sessions we again achieve similar results. One important insight of our results is with minimal preprocessing and no complex loss functions or noise injection into the training, we can easily match prior research’s performance using Attention based Bidirectional LSTMs.

3.2 Text based Emotion Recognition

Our task of performing emotion detection using only the text transcripts of our data resembles that of sentiment analysis, a very common and highly researched task of Natural Language Processing. Here we try two approaches Model1 which uses 1D convolutions of kernel size 3 each, with 256,128,64 and 32 filters using Relu as Activation and Dropout of 0.2 probability, followed by 256 dimension Fully Connected layer and Relu, feeding to 4 output neurons with Softmax. Model2 uses two stacked LSTM layers with 512 and 256 units followed by a Dense layer with 512 units and Relu Activation. Both these models are initialized with Glove Embeddings based word-vectors. We also try Randomized initialization with 128 dimensions in Model3 and obtain similar performance as Model2. The LSTM based models use Adadelta and Convolution based models use Adam as optimizers.

Figure 2: Neural Modal for Text based Emotion detection
Model Accuracy
Model1 62.55%
Model2 64.68%
Model3 64.78%
Table 3: Text emotion detection models and accuracy

3.3 MoCap based Emotion Detection

For the Mocap based emotion detection we use LSTM and Convolution based models. For emotion detection using only the head rotation we try 2 models, first one (Model1) uses LSTM with 256 units followed by Dense layer and Relu activation, while the second model (Model2) uses just 256 hidden unit based Dense Layer with Relu and achieves better performance. We use the two models again for Hand movement based emotion detection and Model2 again achieves better performance. For the facial expression based Mocap data (which has a larger dimensionality than Mocap head and hand data), we use two stacked LSTM layers with 512 and 256 units followed by a Dense layer with 512 units and Relu Activation as Model1. Model2 on Face Mocap uses 5 2D Convolutions each with kernel size 3, Stride 2 and 32,64,64,128,128 filters, along with Relu activation and 0.2 Dropout. These layers are then followed by a Dense Layer with 256 neurons and Relu followed by 4 output neurons and Softmax. We also try Model3 which is a slight variation of Model2 where we replace the last Convolution layer with a Dense layer of 1024 units. We finally use Model3 based architecture for the concatenated MoCap data architecture with 189 input feature length. The LSTM based models use Adadelta and Convolution and Fully Connected based models use Adam as optimizers.

Figure 3: Neural Modal for MoCap based Emotion detection
Model Accuracy
MoCap-head Model1 37.75%
MoCap-head Model2 40.28%
MoCap-hand Model1 33.70%
MoCap-hand Model2 36.94%
MoCap-face Model1 48.99%
MoCap-face Model2 48.58%
MoCap-face Model3 49.39%
MoCap-combined Model3 51.11%
Table 4: MoCap emotion detection models and accuracy

3.4 Combined Multi-modal Emotion Detection

Figure 4: Final Combined Neural Network
Figure 5: Accuracy graph of our Final Model
Model Accuracy
Text + Speech Model1 65.38%
Text + Speech Model2 67.41%
Text + Speech Model3 69.74%
Text + Speech + Mocap Model4 67.94%
Text + Speech + Mocap Model5 68.58%
Text + Speech + Mocap Model6 71.04%
Table 5: Multimodal emotion detection models and accuracy

For the final part of our experiment we train models using all the three modes discussed above. We first use the text transcript and speech based vectors for one model. We try architectures which use Model1 of text processing and Model1 of speech processing architectures, both without the output neurons, their final hidden layers concatenated to 512 dimension hidden layer feeding into 4 output neurons. This architecture does not yield good results. We then try a new model (Model1) which uses 3 Dense layers (1024,512,256) neurons each for both text and speech features concatenated followed by another Dense layer with 256 neurons using Relu and Dropout of 0.2 and 4 output softmax neurons. Our Model2 uses 256 units of 2 stacked LSTMS followed by a Dense layer with 256 neurons for text data; 2 Dense layers with 1024 and 256 neurons for speech data; concatenated followed by another Dense layer with 256 neurons using Relu and Dropout of 0.2 and 4 output softmax neurons. Both Model1 and Model 2 use random initalizations of 128 dimensional embeddings. For Model3 we replace Model2 with Glove embeddings.

We then proceed to also include MoCap data as well into one complete model. For Model4 we combine the previous Model3 with MoCap based Model 2 and concatenate all three 256 layer final outputs. For Model5 we combine the previous Model3 with MoCap based Model1 and concatenate all three 256 layer final outputs. For Model6 we replace the Dense Layers in Speech mode part of previous Model4 with Attention based LSTM architectures. All the code is openly available for reference.