Learning Alignment for Multimodal Emotion Recognition from Speech

Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.


page 1

page 2

page 3

page 4


Convolutional Attention Networks for Multimodal Emotion Recognition from Speech and Text Data

Emotion recognition has become a popular topic of interest, especially i...

Multimodal Speech Emotion Recognition using Cross Attention with Aligned Audio and Text

In this paper, we propose a novel speech emotion recognition model calle...

Learning Fine-Grained Multimodal Alignment for Speech Emotion Recognition

Speech emotion recognition is a challenging task because the emotion exp...

Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition

Emotion recognition is a challenging and actively-studied research area ...

Deep Auto-Encoders with Sequential Learning for Multimodal Dimensional Emotion Recognition

Multimodal dimensional emotion recognition has drawn a great attention f...

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Speech Emotion Recognition (SER) has emerged as a critical component of ...

Multimodal Speech Emotion Recognition Using Audio and Text

Speech emotion recognition is a challenging task, and extensive reliance...

1 Introduction

Despite the tremendous progress made in speech and natural language understanding in recent years, we are still far from being able to naturally interact with machines. Building a system to understand human emotions is paramount for many human-computer interaction applications. However, it is very challenging to build such systems.

Human express emotion through various modalities such as voice, facial expression, body posture, therefore utilizing multiple modalities may accurately capture expressed emotion and lead to better recognition results than unimodal approaches [ngiam2011multimodal]. Many studies focused on using audio-visual modalities for emotion recognition, because both are very informative features on emotional expression. However, in many real applications, it is not feasible to access audio-visual data and only audio data is available, for example, emotion recognition for call centers or fatigue detection for drivers. In this case, an emotion recognition system only using speech signals is favorable.

In daily life, human utter a sentence in a natural way which conveys emotion states through both voice and contents. Although there are many studies on emotion recognition in speech and sentiment analysis in text, only a few study considered doing them jointly. Furthermore, in the scenarios where only speech data is accessible, one can utilize the automatic speech recognition (ASR) technique to convert audio signals into text and then apply a multimodal model to learn emotion from speech and text simultaneously. In this way, text data are created by an ASR system, which is usually trained from another large amount of dataset for a speech recognition purpose. Therefore, it is arguably that we employ prior knowledge learned from another dataset for the emotion recognition task. This can be considered as a transfer learning scheme, similar to pretraining word embedding in natural language processing (NLP)


or pretraining models on ImageNet for object recognition


To effectively utilize both speech and text data, one needs to design a model to jointly learn features from different domains. Although some studies combined both features and trained a multimodal model, few work focused on the temporal relationship between speech and text in a fine-grained level. We believe that, since the speech and text inherently co-exist in the temporal dimension, a multimodal system will be benefit from using the alignment information. In fact, in an end-to-end speech recognition system, the model employs an attention mechanism to have a decoded word to attend to its corresponding speech frames [chorowski2015attention, chan2016listen]

. Inspired by this work, we utilize an attention network to learn the alignment between speech and text. The aligned speech and text features are combined in the word level and serve as multimodal features for an emotional utterance. We then use a recurrent network, for example a long short-term memory (LSTM) network, to model the sequence for emotion recognition. We emphasize that, although an ASR system can output an alignment result (i.e., hard alignment for hidden Markov model based systems, and soft alignment for attention-based systems), our approach does not require the alignment from ASR. The alignment is completely learned from the attention mechanism in the model. There are two advantages for using the learned alignment: first, our approach is suitable for the scenario where an ASR system is a black box and can only output the recognized text, for example, using Google speech recognition API; second, the alignment is learned for an emotion recognition purpose and may be better than the alignment from speech recognition.

In the next section, we relate our work to prior emotion recognition studies. We then describe our proposed approach in detail in Section 3. We show the experimental results in Section 4 and conclude the paper in Section 5.

2 Related Work

Machine learning technology has been used to resolve speech emotion recognition problems for decades. Previous studies usually extracted engineered low-level features or high-level statistical features and applied a classifier for emotion recognition, such as Gaussian mixture models [neiberg2006emotion], hidden Markov model [nogueiras2001speech]

, support vector machines


, neural networks

[stuhlsatz2011deep, kim2013emotion].

Recent studies on deep learning have shown that neural networks are capable of learning high-level features from raw data and increasing studies attempted to build systems using neural architectures. In


, researchers demonstrated the effectiveness of emotional feature learning using deep neural networks (DNNs). Some studies employed recurrent neural networks (RNNs) for emotion recognition due to the sequential structure of speech signals, such as

[lee2015high, mirsamadi2017automatic, sarma2018emotion, li2018attention]

. In addition, since convolutional neural networks (CNNs) are designed to learn local spatial features which are suitable for feature extraction in the spectral domain, some studies utilized CNNs to extract features and combined with a sequential model, for example, LSTMs

[trigeorgis2016adieu, satt2017efficient].

Multimodal learning is an important topic in machine learning [ngiam2011multimodal]. In emotion recognition, many studies extracted features from audio, visual, or textual domains and then fuse them either in the feature levels or decision levels [busso2004analysis, wollmer2010context, poria2017review]. To leverage information from speech signals and text sequences, previous study [yoon2018multimodal] used neural networks to model two sequences separately and use direct concatenation of two modalities for emotion classification. In [zadeh2017tensor]

, a tensor fusion network was proposed to fuse features from different modalities and learn intra-modality and inter-modality dynamics. In

[poria2017context], an LSTM-based model was utilized to learn contextual information from the utterances for sentiment analysis.

Attention networks are also related to our work. In [bahdanau2014neural], an attention network was firstly proposed to align the input and the output sequences for machine translation in NLP. Following this study, researchers in the speech area adopted the idea and utilized the attention mechanism for end-to-end speech recognition [chorowski2015attention, chan2016listen]. In speech emotion recognition, several studies have been used attention networks [mirsamadi2017automatic, sarma2018emotion], however, they mainly utilized attention only for sequential modeling. To our knowledge, our work is the first work utilizing it to align speech and text sequences.

3 Algorithm Details

The architecture of the model is shown in Figure 1. There are two paths to process a given speech signal. One path is to directly extract features from audio for speech encoding, and another path is to use an ASR system to produce text and covert to embedding for text encoding. Therefore the whole model consists of a speech encoder, a text encoder, and an multimodal fusion network including an attention mechanism and an LSTM for classification. We describe each component in detail in this section.

Figure 1: The architecture of the proposed model. The yellow part indicates the speech encoder and the green part indicates the text encoder. The blue part is the multimodal fusion network consisting of an attention network to fuse both modalities and an LSTM for sequence classification.

3.1 Speech Encoder

We first discuss the speech encoder in our multimodal emotion recognition model. To extract acoustic features, we first convert time-domain speech signals into frames with a 20 ms window and shifted every 10 ms. The low-level speech feature extracted from each frame can computed from the time-domain (e.g., zero-crossing rate), the spectral-domain (e.g., spectral spread), or the cepstral-domain (e.g., Mel-frequency cepstral coefficients, i.e., MFCC). We represent the sequence of features in an utterance as , where is the number of frames in an utterance.

For speech encoding, we choose a bidirectional LSTM (BiLSTM) to model the sequential structure of speech frames:


Here and are the hidden states of two unidirectional LSTMs, respectively. is a concatenation of them, which will be used for alignment with text.

We mention that, although we do not focus on exploring speech encoders in this paper, we have experimented with various neural architectures similar to previous studies, such as CNN with LSTM [satt2017efficient] and LSTM with attention [mirsamadi2017automatic]. We observe comparable results for these architectures when combining with the proposed multimodal model.

3.2 Text Encoder

For emotion recognition of human speech, the speech can be translated to text with an ASR system. In our study, instead of training an ASR specific to the speech emotion recognition dataset, we use the public Google Cloud Speech API 111https://cloud.google.com/speech-to-text/ to generate the text from speech, demonstrating the generalization of the proposed approach. Note that, our approach can tolerate some recognition errors and it is sufficient to train a model using these imperfect text. We will analyze the effects of ASR in Section 4.

Given a sequence of words, we first convert each word as an embedding vector , and the sequence is represented as , where is the number of words in the sentence. Then, we use a BiLSTM to model the text sequence. The hidden state of the BiLSTM encodes the th word in the sequence and will be used for further multimodal alignment.


3.3 Attention Based Alignment

An attention network was originally proposed in a sequence-to-sequence setting, where a decoder learns which parts in the encoder it should pay attention to and decode a word step by step [bahdanau2014neural, chorowski2015attention]. In this study, instead of the decoding purpose, we utilize the attention mechanism to learn the alignment weights between speech frames and text words. This is similar to the self-attention approach in [vaswani2017attention], but the difference is that we learn the attention from two different sequences instead of the same sequence.

Specifically, an attention weight between the th speech frame and the th word is calculated by the hidden state of the text LSTM and the hidden state of the speech LSTM:


where , and are trainable parameters. is the normalized attention weight over the speech sequence, indicating the soft alignment strength between the th word and the th speech frame. is the weighted summation of hidden states from the speech LSTM, which is considered as an aligned speech feature vector corresponding to the th word.

We then concatenate the aligned speech feature and the hidden state of the text LSTM to form a combined multimodal feature vector, which is fed into a multimodal BiLSTM for feature fusion:


For emotion classification on a sequence, we apply an max-pooling layer over all hidden states in the sequence to get a fixed-length vector and then use a fully-collected layer with rectified linear units (ReLUs) for non-linear transformation. The loss

for each example is computed using a softmax layer with cross entropy for

-class classification.


where is a trainable weight matrix,

is a point-wise ReLU transformation,

is the th element in , and if the ground-truth label is else .

4 Evaluations

We discuss the dataset, implementation details and experimental results in this section.

4.1 Data

We use the Interactive Emotional Dyadic Motion Capture database (IEMOCAP)222https://sail.usc.edu/iemocap/index.html[busso2008iemocap] for experiments. The dataset was recorded from ten actors, and divided into five sessions. Each dialog contains audio, transcriptions, video, and motion-capture recordings, and we only use audio in our study. There are both performances of improvisations and scripts of two different gender actors in a session. The recorded dialogues have been segmented into utterances and labelled as 10 categories (angry, happy, sad, neutral, frustrated, excited, fearful, surprised, disgusted, other). Each utterance was annotated by three different evaluators. In our experiments, we use four emotions (angry, happy, neutral and sad) for classification and use four sessions for model training and remaining for testing. This setting is consistent with prior studies.

4.2 Implementation

For speech features, each utterance is sampled at 16 kHz with duration range from 0.5 to about 20 seconds. The time-domain signal is converted into 20 ms frames with 10 ms overlap. We use a Python library [giannakopoulos2015pyaudioanalysis] to extract a 34-dimensional feature vector from each frame including MFCC, zero-crossing rate, spectral spread, spectral centroid, etc.

For text features, as we mentioned before, we first use Google Cloud speech service to generate the text from speech signals. Based on the text transcripts provided by the IEMOCAP dataset, the word error rate of Google speech service is 14.7%. For word representation, we use a 300-dimensional GloVe embedding [pennington2014glove] as the pretrained text embedding.

To implement the model, we use 100 hidden units in each unidirectional LSTM in the speech encoder, the text encoder, and the multimodal encoder, so the dimensionality of a hidden state in a BiLSTM is 200. The attention network has 5 attention heads, each of which includes 40 weights. The fully-connected layer is a weight matrix corresponding to the number of hidden states and the number of classes. To train the model, we use Adam optimization with the learning rate of 0.001.

We adopt two widely used metrics for evaluation: weighted accuracy (WA) that is the overall classification accuracy and unweighted accuracy (UA) that is the average recall over the emotion categories.

4.3 Experiments

For comparison, we first train models with each single modality separately. For speech modality, we use an LSTM to model the sequence of speech frames and use an attention mechanism to learn a weighted sum over the sequence. This structure is the same as in [mirsamadi2017automatic] but with different speech features. Besides, we also report the results using CNN+LSTM in [satt2017efficient] and TDNN+LSTM in [sarma2018emotion] for comparison. Besides, we also report the results using CNN+LSTM in [satt2017efficient] for comparison. For text modality, we employ an LSTM with attention structure which is the same as the text encoder in our approach.

We also compare our approach with other multimodal approaches. To combine speech and text, a straightforward way is to train an LSTM for each modality separately, and then use pooling or attention to aggregate the hidden states to obtain a fixed-length vector for each sequence. The two vectors can be concatenated together for the sequence level classification. This “Concat” approach is similar to the method in [yoon2018multimodal] but with different features, and we show the results in the paper for comparison.

Methods WA UA
LSTM+Attn (our implementation) 63.4 57.4
LSTM+Attn (Mirsamadi et al., 2017) 63.5 58.8
CNN+LSTM (Satt et al., 2018) 68 59.4
TDNN+LSTM (Sarma et al., 2018) 70.1 60.7
Text-only (ASR text)
LSTM+Attn 60.3 54.8
Multimodal (ASR text)
Concat (our implementation) 68.1 66.0
Concat (Yoon et al., 2018) 69.1 -
Proposed 70.4 69.5
Table 1: Comparison results on the IEMOCAP dataset using speech-only, text-only, and multimodal models. All experiments in this table use recognized text. Bold fonts indicate the best performance.

As shown in Table 1, “LSTM+Attn” in speech and “LSTM+Attn” in text are two unimodal models corresponding to our multimodal approach. By combining speech and recognized text, the multimodal approaches significantly boost both WA and UA. For comparison on multimodal methods, the proposed approach outperforms the direct concatenation approaches, showing the advantage of learned alignment between speech and text. We also report the results shown in other original papers and the proposed approach achieves the best results on both WA and UA.

Since the IEMOCAP provides text transcripts and word-level alignment, we conduct several experiments to analyze the influence. At first, We do not change the structure of the proposed model and only replace the recognized text by the transcripts. This is considered as an upper bound for the proposed approach as it uses the oracle text. Another experiment is to use the oracle text with provided alignment. With the word-level alignment, it is not necessary to use the attention mechanism. For each word in the text sequence, we simply average the hidden states in the speech LSTM in corresponding frames and concatenate it with the hidden state in the text LSTM. This is a version of hard alignment for the proposed approach. For comparison, we also use the oracle text to train a unimodal model and a concatenation model as in [yoon2018multimodal].

Methods WA UA
Oracle text
Text-only 63.3 57.8
Concat (our implementation) 71 67.7
Concat (Yoon et al., 2018) 71.8 -
Hard alignment 71.5 68.6
Proposed 72.5 70.9
Proposed (ASR text) 70.4 69.5
Table 2: Experiment results on the oracle text and alignment. The proposed approach with the recognized text is shown in the last row for reference. Bold fonts indicate the best performance.

Table 2 shows the results using provided transcripts and alignment. Comparing with the results in Table 1, the oracle text contributes around 3% improvement for the text-only method and the direct concatenation method. Yoon et al. [yoon2018multimodal] also used the oracle text for experiment and achieved slightly better results than our implementation. The proposed approach with the oracle text achieves the best results in the dataset, showing that further improvement can be achieved by more accurate speech recognition. It is interesting to compare the proposed attention alignment with the hard alignment. Although the hard alignment approach utilizes the ground-truth alignment to aggregate the speech features, the performance is lower than the attention based method, suggesting that the attention network is optimized for emotion recognition rather than speech recognition.

5 Conclusions

In this paper, we aim to address emotion recognition from speech. With an ASR system, we can generate text from speech signals and build a multimodal model for emotion recognition. We propose an attention mechanism to learn the alignment between the original speech and the recognized text, which is then used to fuse features from two modalities. The fused features are fed into a sequence model for emotion classification. The experiment results show that the proposed approach is superior to other approaches in terms of emotion recognition results. The experiments show that the proposed approach achieves state-of-the-art results on the dataset.