Towards adversarial learning of speaker-invariant representation for speech emotion recognition

by   Ming Tu, et al.

Speech emotion recognition (SER) has attracted great attention in recent years due to the high demand for emotionally intelligent speech interfaces. Deriving speaker-invariant representations for speech emotion recognition is crucial. In this paper, we propose to apply adversarial training to SER to learn speaker-invariant representations. Our model consists of three parts: a representation learning sub-network with time-delay neural network (TDNN) and LSTM with statistical pooling, an emotion classification network and a speaker classification network. Both the emotion and speaker classification network take the output of the representation learning network as input. Two training strategies are employed: one based on domain adversarial training (DAT) and the other one based on cross-gradient training (CGT). Besides the conventional data set, we also evaluate our proposed models on a much larger publicly available emotion data set with 250 speakers. Evaluation results show that on IEMOCAP, DAT and CGT provides 5.6 system without speaker-invariant representation learning on 5-fold cross validation. On the larger emotion data set, while CGT fails to yield better results than baseline, DAT can still provide 9.8 standalone test set.


page 1

page 2

page 3

page 4


Speaker-invariant Affective Representation Learning via Adversarial Training

Representation learning for speech emotion recognition is challenging du...

Speaker Attentive Speech Emotion Recognition

Speech Emotion Recognition (SER) task has known significant improvements...

Domain adversarial learning for emotion recognition

In practical applications for emotion recognition, users do not always e...

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

The prediction of valence from speech is an important, but challenging p...

A New Amharic Speech Emotion Dataset and Classification Benchmark

In this paper we present the Amharic Speech Emotion Dataset (ASED), whic...

Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends

Research on speech processing has traditionally considered the task of d...

Learning Domain Invariant Representations for Child-Adult Classification from Speech

Diagnostic procedures for ASD (autism spectrum disorder) involve semi-na...

1 Introduction

With intelligent speech assistants such as Alexa, Google Home, Siri and Cortana being used in our daily lives more than ever, we still notice the performance gap of these machine dialogues from human interactions, because these systems lack of capability of recognizing our emotions and react to them like a human partner would. Therefore, the demand is rising for speech emotion recognition (SER) to empower dialogue systems to respond in emotionally intelligent ways, especially for customer service chatbots. However, SER is challenging due to mismatches between training and testing data in terms of speaker variations, recording environment/channels etc. It is impractical to train one SER system which could cover every application scenario.

Various methods have been proposed to tackle the negative effect of data variations and domain mismatch in SER literature. Some studies resort to extracting richer features representing emotion variation or applying more powerful deep neural networks (DNN) with careful architecture designs, expecting better generalization ability to unseen scenarios [1, 2] . Another research direction explicitly deals with the domain mismatch problem by compensating the data variations with either robust feature learning or model/training strategy design [3, 4]

. Ideas of domain adaptation in general machine learning literature are borrowed to help similar tasks in the SER field.

Generative Adversarial Networks (GAN) has achieved much success in speech applications [5, 6, 7, 8]. Domain adaptation techniques developed from GANs have also been applied to common domain mismatch problems, such as: automatic speech recognition (ASR) [9, 10, 11], cross-corpus speaker recognition [12] and SER [13]. All these studies are based on the domain adversarial training (DAT) proposed in [14]

. DAT was applied to unsupervised domain adaptation by training the main task and domain classifier at the same time. A gradient reversal layer (GRL) was inserted to the domain classifier in order to confuse the domain classifier while accomplish the main task well. The representation learning in this way is more robust to domain shifts and variations, which have been proved in the aforementioned studies.

In this work, we propose to apply adversarial training to SER. In order to deal with speaker variations, this paper aims to learn speaker-invariant representations for SER, and expects the representation learning network to generalize well to unseen testing speakers. Our model consists of three parts: a representation learning network with time-delay neural network (TDNN) and LSTM with statistical pooling layer (a variation of the popular x-vector

[15] for speaker recognition in Kaldi [16]), an emotion classification network and a speaker classification network. Both the emotion and speaker classification networks take the output of representation learning network as input. Two adversarial training strategies are employed to achieve speaker-invariant representations: one based on the original DAT [14], and the other one based on cross-gradient training (CGT)[17]. In contrast to using a GRL in DAT, CGT develops a domain-guided perturbation that serves as data-augmentation during training, and proves this training strategies can generalize well to unseen domains. We evaluate the proposed systems on two data sets: one is the commonly used Interactive Emotional Dyadic Motion Capture (IEMOCAP) data set, and the other one is a public available Mandarin speech emotion data set collected by SpeechOcean. It has 250 speakers (much more than IEMOCAP) in total. Evaluation results show that on IEMOCAP, DAT and CGT respectively provides 5.6% and 7.4% improvement over a baseline SER system without speaker-invariant representation learning on 5-fold cross validation. On the larger Mandarin speech emotion data set, while CGT fails to yield better results than baseline, DAT can still provide 9.8% relative improvement on a standalone test set.

Relation to prior work: Our work in this paper is related to previous studies on using domain adaptation based on DAT for speech applications [9, 10, 12, 13]. These studies mainly focus on cross-domain tasks and solve it as a domain adaptation problem. These work requires unlabeled target domain data to do adversarial training. Our primary task, however, is to derive speaker-invariant representation for SER, and expect it can generalize well to unseen speakers (more like a domain generalization problem). Furthermore, we do not assume target speaker data (either labeled or unlabeled) is available during training. Different from [11] which uses adversarial training to derive speaker-invariant features for ASR, we also employ another training strategy CGT besides DAT. Instead of using GRL to achieve adversrial learning, CGT uses the gradients of adversarial tasks to guide perturbations on input to achieve domain generalization[17]. To the best of our knowledge, this paper is the first work to apply CGT on SER. Our work is also different from previous work on domain adaptation/generalization for SER or cross-corpus SER [3, 4], in that we employ the adversarial training strategies to achieve generalization to unseen speakers.

2 Systems Description

Assume data set , where is the sequence of Mel Frequency Cepstral Coefficients (MFCC) of utterance ( is the MFCC dimension and is the sequence length). and

are the one-hot encoded emotion label and speaker label of utterance

. Our goal is to derive a fixed-length embedding for utterance , which is supposed to be able to well discriminate speech emotions while be irrelevant with speaker identities, i.e., to train an embedding network for SER that is robust to unseen speakers.

2.1 Model architecture

To achieve speaker-invariant embeddings, we propose a network with multi-task learning setting to utilize both emotion and speaker labels of training set. Figure 1 illustrate the network architecture in this study. The model consists of three sub-networks. The embedding sub-network takes sequence of MFCC features

as input. Two layers of TDNN and 1 layer of bidirectional recurrent neural network (RNN) with Long Short Term Memory (LSTM) nodes are in charge of sequential feature extraction. Then, mean and standard deviation are calculated along the sequence and concatenated to derive a fixed-dimension utterance-level embedding

, which is input to two following sub-networks to accomplish corresponding classification tasks. The left sub-network outputs predicted emotion label and the right sub-network outputs predicted speaker label . Both of them are fully connected networks with two hidden layers. Without speaker recognition (SR) sub-network, it becomes a single-task learning SER system; With normal DNN training strategies for both of SER and SR, it becomes a multi-task learning system [18]. Next subsections will introduce two adversarial training strategies for the same model architecture.

Figure 1: Model architecture

2.2 Domain adversarial training (DAT)

The key idea of DAT is to treat to as domain label. Then the right sub-network in figure 1 becomes a domain classifier. In order to learn domain invariant representations between source and target domain, DAT inserts a gradient reversal layer (GRL) to reverse the gradient flow from right sub-network to the embedding sub-network by multiplying the gradients there by a negative value ; while in forward pass, there is no such effect and the GRL just works like identity mapping. In this way, tends to confuse the SR sub-network and still can predict the emotion label well. The loss of DAT can be defined as:

where is the model parameters, and is the loss of emotion classifier (same for the speaker classifier). Original DAT requires unlabelled data from or similar to target domain to achieve domain adaptation. Previous study either use recording condition or corpus identity as domain labels to deal with environment or corpus variation. This paper applies DAT to learn speaker-invariant representations by jointly adversarial training on training data with both emotion and speaker labels.

2.3 Cross-gradient training (CGT)

CGT is proposed in [17] to solve domain generalization problem. It frees the requirement for target domain data during training, and develops a scheme that can generalize to unseen domains during test. Instead of aiming at reducing domain specific information in , CGT introduces domain-guided perturbations of the input based on gradients of adversarial tasks. Then, the emotion recognition sub-network can be trained on both the original input and the input produced by the data augmentation during batch training. The domain-guided perturbation on inputs makes the model cover more domain variations and robust to potential domain shifts during testing. The basic training procedure of CGT is summarized as follows:

where and are the perturbed inputs. Eq. 4 gives the parameter update formula. It has been shown in [17] that CGT is more stable and easier to train than DAT.

3 Experimental Setup

3.1 Data preparation

IEMOCAP: IEMOCAP [19] was collected in 5 sessions, each of which has one female and one male speaker in both scripted and improvised scenarios. Categorical emotion annotations include 9 classes. In this paper, we only use the improvised recordings. We put happy and excitement into one class to achieve more balanced label distribution. There are totally 4 emotion classes {happy, sad, angry, neutral} and 2943 utterances. In order to get more reliable evaluation, 5-fold cross validation is employed with 4 sessions for training, one speaker in the remaining session for validation and one speaker for testing.

Mandarin speech emotion data set: This data set was collected by SpeechOcean with 250 recruited speakers. Each speaker was asked to read 240-260 sentences in Mandarin with four emotions {happy, sad, angry, surprise} together with laughing and cry. The four emotion classes were balanced to ensure there were almost the same number of utterances for each emotion class. Mobile phones belong to different brands and equipped with different operating systems were used as recorder. This data set is suitable for speech emotion recognition research for the reason that it has much more speakers and utterances than all available speech emotion data sets in literature; it also has very balanced emotion class distributions. For this study, 53803 utterances (durations: 2.800.68 seconds) with four emotions from all 250 speakers are employed. We randomly picked 200 speakers (43202 utterances) for training, 25 speakers (5407 utterances) for validation and 25 speakers (5194 utterances) for testing.

Feature extraction: 13-dimensional MFCC together with first and second order derivatives are extracted with Kaldi. Simple energy-based voice activity detection is used to remove silences to relieve its impact on statistics pooling. Finally, each utterance is with a sequence of 39-dimensional MFCC features, an emotion label and a speaker label. For both data sets, speakers in validation and testing set are never seen in training.

IEMOCAP Mandarin SE dataset
TDNN 128-5-2
TDNN 64-3-4
Bi-LSTM 64
FC 256
TDNN 128-5-2
TDNN 128-3-4
Bi-LSTM 128
FC 512
256-dim mean
+ 256-dim std
512-dim mean
+512-dim std
Emotion classification
FC 512-64-64-4 FC 1024-128-128-4
Speaker classification
FC 512-64-64-8 FC 1024-128-128-200
Table 1: Model configurations for two data sets.

3.2 Model configuration

In Table 1

, we show the model configurations for the two different data sets in terms of the three sub-networks. Our model is implemented with PyTorch

[20]. For embedding sub-network, TDNN layer is realized by 1-dimensional convolution, and “TDNN N-K-D” (for example “TDNN 128-5-2”) means 128 output nodes, 5

1 kernel size and dialation is 2. “Bi-LSTM N” means we use bidirectional RNN with N LSTM nodes. We also add a FC layer after the output of bidirectional LSTM to increase the feature dimension and avoid too much information loss by statistics pooling on the whole utterance. The dimension is set to 256 for IEMOCAP and 512 for Mandarin speech emotion dataset. We calculat the mean and standard deviation for the statistics pooling step. For emotion and speaker classification sub-network, we use a FC network configured as “D1-D2-D3-D4”, where D1 means number of input nodes, D2 and D3 for number of hidden nodes in each layer and D4 for number of output nodes. Both TDNN and FC layers (except for output layers) are followed by ReLU activation function, batch normalization (applied to RNN) and dropout with 0.5 keep probability (applied to RNN). We use stochastic gradient descent for optimization, the learning rate of which is set to 1E-03 with Nestrov momentum (factor 0.9). All experiments run 100 epochs, and the epoch with the best performance on validation set is saved for evaluation on testing set. We have four models in total: SER only model, SER and SR in multi-task learning (MTL) setting, DAT setting and CGT setting.

4 Results and Discussion

Val Test
SER_only 57.0% ( 1.0%) 53.9% ( 1.0%)
SER_SR_MTL 56.6% ( 1.1%) 54.9% ( 0.6%)
SER_SR_DAT 56.6% ( 1.2%) 56.5% ( 1.4%)
SER_SR_CGT 55.6% ( 0.8%) 57.3% ( 1.2%)
Table 2: Performance comparison on IEMOCAP

In Table 2, we show the 4 models’ accuracies of classifying 4 emotions in IEMOCAP on both validation and testing set of all 5 folds together. All numbers are the average of 5 times running with standard deviation in parentheses. It can be found that if the model is trained with SER only, it can achieve higher performance on validation set, possibly due to overtuning. However, this gives low accuracy on test set, which means its generalization ability is bad. For MTL, the performance gap between validation set and testing set is smaller, and there is some improvement over single task on testing set. DAT almost produces no performance gap between validation and testing set. CGT achieves the highest accuracy on testing set even with models do not have very high performance on validation set. This verifies that even without adaptation data, DAT can also learn speaker-invariant representation which can generalize to unseen testing speakers. We also observe that CGT can achieve domain generalization in SER on unseen speakers.

Val Test
SER_only 83.7% ( 0.7%) 81.7% ( 1.2%)
SER_SR_MTL 82.5% ( 0.7%) 80.9% ( 0.3%)
SER_SR_DAT 84.9% ( 1.2%) 83.5% ( 0.6%)
SER_SR_CGT 82.0% ( 0.6%) 81.1% ( 0.3%)
Table 3: Performance comparison on Mandarin speech emotion data set

Table 3 gives the accuracy comparison among the same 4 models on Mandarin speech emotion data set. All numbers are the average of 3 times running. On the large Mandarin speech emotion data set, a similar trend with IEMOCAP can be observed that the accuracy gap between validation set and testing set is decreasing although for all models, the accuracy on test set is a little worse than on validation set. The MTL model has no gain over single task model. The DAT gives the highest performance on both validation set and test set. It makes sense because when there are a large number of domain labels (speakers in this study), there may exist some training speakers similar with speakers in validation and testing set (target domain). This is equivalent to that DAT is provided with target domain data, thus providing overall improvement of the model. However, CGT is unable to beat the baseline system on this data set. The authors in [17] observed similar evaluation results when the size of domain labels is large, and they commented that in this case training data can cover more domain variations, thus the augmentation during training strategy shows no improvement. This explanation aligns with our experimental results: on the small IEMOCAP data set with only 8 speaker labels during training, CGT can yield better generalization ability; On the large Mandarin speech emotion data set with 200 speaker labels during training, it fails to bring benefit. We also show that the embedding generated by our proposed adversarial training is speaker-irrelevant in figure 2. For space consideration, we only show the t-SNE plots [21] of learnt by the DAT model. It is obvious that DAT can remove speaker information from embeddings, which allows generalization to new testing speakers.

Figure 2: t-SNE plots of validation set embeddings with both emotion labels (left) and speaker labels (right, we randomly picked 4 speakers).

5 Conclusion

In this paper, we propose to use two adversarial training schemes to achieve speaker-invariant representations for speech emotion recognition. The two training strategies are DAT and CGT respectively. While DAT aims to reduce the domain information in representation learning of speech emotion, CGT tackles the problem in a domain generalization way based on domain-guided data augmentation during training. Experiments on a small data set IEMOCAP and a larger Mandarin speech emotion database shows that even without data from target speakers, DAT can still provide gains when testing on new speakers. Although CGT shows on improvement on the larger data set, it still can generalize better than DAT when the size of domain labels is small.