In many typical applications for emotion recognition, users do not always exist in the training corpus. The mismatch between training speakers and testing speakers leads to a performance degradation of the trained models. Therefore, it is vital to build speaker-independent systems for emotion recognition.
To ensure the model is speaker-independent, prior works focus on the data split strategies [1, 2]. These methods ensure no speaker overlap in the training set and the testing set. For example, the IEMOCAP dataset  contains five sessions and each session has different actors. Hazarika et al.  used utterances from the first four sessions for training and others for testing. However, it is unclear whether these methods can actually learn speaker-independent representations.
To deal with these problems, we focus on the domain adversarial neural network (DANN)  for emotion recognition. Commonly, the mismatch in data distribution between the train (source domain) and test (target domain) sets leads to a performance degradation . DANN relies on adversarial training for domain adaption, aiming to reduce the mismatch between the source domain and the target domain. In this paper, we use DANN to reduce the mismatch between different speakers, thus ensuring that the model can learn speaker-independent representations. Furthermore, different from previous unsupervised learning methods , DANN can extract useful information from the unlabeled data while retaining the useful information for emotion recognition.
In this paper, we present a DANN based framework for emotion recognition. The main contributions of this paper lie in three aspects: 1) to deal with low-resource training samples, the proposed method can extract useful information from the unlabeled data while retaining emotion-related information; 2) to reduce the mismatch between speakers, the proposed method ensures the model can learn speaker-independent representations; 3) our proposed method is superior to other currently advanced approaches for emotion recognition. To the best of our knowledge, it is the first time that DANN is used for emotion recognition.
2 Proposed Method
In this paper, we propose a multimodal learning framework for emotion recognition. As shown in Fig. 1, the proposed framework consists of three components: the feature encoder, the domain classifier and the emotion classifier.
2.1 Problem Definition
Dataset: We have a training set with annotated emotional data, and a testing set with unlabeled data. Like previous experimental settings in , speaker identities are available for both the training set and the testing set.
Task: Define a conversation , where is the utterance in the conversation and is the total number of utterances. The primary task is to predict the emotion labels of each utterance. The secondary task is to learn a common representation where speaker identities can not be distinguished.
2.2 Feature Encoder
The feature encoder contains two key components: the Audio-Text Fusion component (AT-Fusion) for multi-modalities fusion and the Self-Attention based Gated Recurrent Unit (SA-GRU) for contextual feature extraction.
Multi-modalities Fusion (AT-Fusion): Different modalities have different contributions in emotion recognition. To focus on the important modalities, we utilize the attention mechanism for multi-modalities fusion. Specifically, we first extract acoustic features and lexical features from each utterance. Then we equalize the dimensions of these features to size using two fully-connected layers, respectively. This provides the final acoustic features and lexical features for the utterance . AT-Fusion takes and
as inputs, and outputs the attention vectorover these modalities. Finally, the fusion representation is generated as follows:
where and are trainable parameters. Here, and .
This multimodal representation is generated for utterances in the conversation , marked as .
Contextual Feature Extraction (SA-GRU): SA-GRU uses the bi-directional GRU (bi-GRU), in combination with the self-attention mechanism  to amplify the important contextual evidents for emotion recognition. Specifically, multimodal representations are given as inputs to the bi-GRU. Outputs of this layer form , where . Then is fed into the self-attention network. It consists of a multi-head attention to extract the cross-position information. Each head ( is the number of heads) is generated using the inner product as follows:
where , and are trainable parameters.
Then outputs of attention functions are concatenated together as final values . As contextual representations is generated for all utterances in the conversation , it can also be represented as , where .
2.3 Domain Adversarial Neural Network for Emotion Recognition
DANN is trained using labeled data from the training set and unlabeled data from the testing set. The network learns two classifiers – the emotion classifier and the domain classifier. Both classifiers share the feature encoder that determines the representations of the data used for classification. The approach introduce a gradient reversal layer  between the domain classifier and the feature encoder. This layer passes the data during forward propagation and inverts the sign of the gradient during backward propagation. Therefore, DANN attempts to minimize the emotion classification error and maximize the domain classification error. By considering these two goals, the model ensures a discriminative representation for the emotion recognition, while making the samples from different speakers indistinguishable.
In our proposed method, we train the emotion recognition task with the training set, for which we have emotion labels. For the domain classifier, we train the classifier with data from the training set and the testing set. Notice that the domain classifier does not require emotion labels, so we relay on unlabeled data from the testing set. These classifiers are trained in parallel. The objective function is defined as follows:
where is the emotion recognition loss, is the domain classification loss, represents the number of labeled data from the training set, and represents the number of unlabeled data from the testing set. Here,
is a hyperparameter that controls the trade off between two losses.
Compared with the fully supervised learning strategy (wherein Eq. (5)), DANN has following advantages. Firstly, DANN learns a representation that confuses the domain classifier, which ensures the model speaker-independent. Secondly, DANN uses available unlabeled data to further reduce the mismatch between different speakers. These speaker-independent representations retain discriminative information learned during the training of the models with emotional data from the training set. Therefore, the proposed method can extract useful information from the unlabeled data while retaining discriminative information for emotion recognition.
3 Experiments and Discussion
3.1 Corpus Description
We perform experiments on the IEMOCAP dataset . It contains audio-visual conversations spanning 12.46 hours of various dialogue scenarios. There are five sessions and a pair of speakers are grouped in a single session. All the conversations are split into small utterances, which are annotated using the following categories: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise and other. To compare our method with state-of-the-art methods [1, 2], we consider the first four categories, where happy and excited categories are merged into the single happy category. Thus 5531 utterances are involved. The number of utterances and dialogues of each session are listed in Table 1.
3.2 Experimental Setup
Features: We extract acoustic features using the openSMILE toolkit . Specifically, we use the Computational Paralinguistic Challenge (ComParE) feature set introduced by Schuller et al. . Totally, 6373-dimensional utterance-level acoustic features are extracted, including energy, spectral, MFCCs and their statistics; In the meantime, we use word embeddings to represent the lexical information. Specifically, we employ deep contextualized word representations using the language model ELMo . Compared with previous word vectors , these representations have proven to capture syntax and semantics aspects as well as the diversity of the linguistic context of words . To extract utterance-level lexical features, we calculate mean values of word representations in the utterance. Totally, 1024-dimensional utterance-level lexical features are extracted.
Settings: AF-Fusion equals the feature dimensions of different modalities to size . SA-GRU contains a bi-GRU layer (50 units for each GRU component) and a self-attention layer (100 dimensional states and 4 attention heads). We test different in Eq. (5) and we find that
gains the best recognition performance. To optimize the parameters, we use the Adam optimization scheme with a learning rate of 0.0001. We train our models for at least 200 epochs with a batch size of 20.regularization with the weight is also utilized to alleviate over-fitting problems. In our experiments, each configuration is tested 20 times with various weight initializations. The weighted accuracy (WA) is chosen as our evaluation criterion.
3.3 Classification Performance of the Proposed Method
Two systems are evaluated in the experiments. In additional to the proposed system, one comparison systems are also implemented to verify the effectiveness of our proposed method:
(1) Our proposed system (Our): It is our proposed framework. For the emotion classifier, we train the classifier with data from the training set, for which we have emotion labels. For the domain classifier, we train the classifier with data from the training set and the testing set. As the domain classifier does not require emotion labels, we relay on unlabeled data from the testing set.
(2) Comparison system 1 (C1): It comes from the proposed method, but ignoring the domain classifier. Specifically, we only optimize the emotion classifier by setting the in Eq. (5) to be 0.
Furthermore, to explore the impact of the amounts of labeled samples in the training set, five training settings are discussed, including TS_1234, TS_123, TS_134, TS_234 and TS_23. These training settings follow the same naming way. For example, TS_123 represents that the training data contains Session 13, while the testing data contains other sessions (Session 45). As Session 5 always belongs to the testing data under these settings, we evaluate the classification performance on Session 5. Experimental results of WA are listed in Table 2.
To verify the effectiveness of the proposed method, we compare the performance of the proposed method and C1. Experimental results in Table 2 demonstrate that our proposed method is superior to C1 in all cases. Compared with C1, our proposed method can learn speaker-independent representations. It ensures our model to focus on emotion-related information, while ignoring the difference between speaker identities. Therefore, this method can achieve better performance on unseen speakers (in Session 5). Furthermore, compared with C1, the proposed method can use unlabeled samples in the training process. This semi-supervised approach uses unlabeled samples to further reduce the mismatch between different speakers. Meanwhile, this approach retains discriminative information learned during the training of the models with emotional data. Therefore, our proposed method is more suitable for emotion recognition than C1.
To show the impact of the amount of training samples, we compare the performance under different training settings. Experimental results in Table 2 demonstrate that when we reduce training samples, C1 has 0.2%3.5% performance decrement. Without enough training samples, C1 faces the risk of over-fitting. Therefore, the recognition performance on the unseen data becomes worse. Interestingly, we notice that our proposed method gains 0.2%1.5% performance improvement when we reduce training samples. Meanwhile, we compare the performance of the proposed method and C1. We observe that the margin of improvement increases with small amounts of training samples. These phenomenons reveal that if we utilize unlabeled samples properly, we can even achieve better performance than fully supervised learning methods. Different from previous unsupervised learning methods , the proposed method can extract useful information from the unlabeled data while retaining discriminative information for emotion recognition.
3.4 Comparison to State-of-the-art Approaches
To verify the effectiveness of the proposed method, we further compare our method with other currently advanced approaches. Experimental results of different methods are listed in Table 3.
|Rozgić et al. (2012) ||67.40|
|Jin et al. (2015) ||69.20|
|Poria et al. (2017) ||74.31|
|Li et al. (2018) ||74.80|
|Hazarika et al. (2018) ||77.62|
|Li et al. (2019) ||79.20|
Compared with our proposed method, these approaches [1, 2, 14, 15, 16, 17] also utilized acoustic features and lexical features for emotion recognition. Context-free systems [14, 15, 16, 17] inferred emotions based on only the current utterance in conversations. While context-based networks  utilized the LSTMs to capture contextual information from their surroundings. However, context-based networks  suffered from incapability of capturing inter-speaker dependencies. To model the inter-speaker emotion influence, Hazarika et al.  used memory networks to perform speaker-specific modeling.
Experimental results in Table 3 demonstrate the effectiveness of the proposed method. Our proposed method shows an absolute improvement of 3.48% over state-of-the-art strategies. This serves as strong evidence that the domain adversarial neural network can yield a promising performance for emotion recognition.
In this paper, we present a DANN based approach for emotion recognition. Experimental results demonstrate that our method enables the model to focus on emotion-related information, while ignoring the difference between speaker identities. Interestingly, we notice that our proposed method gains performance improvement when we reduce training samples. It reveals that our method can utilize unlabeled samples properly. Due to above advantages, this novel framework is superior to state-of-the-art strategies for emotion recognition.
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh,
and Louis-Philippe Morency,
“Context-dependent sentiment analysis in user-generated videos,”in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, vol. 1, pp. 873–883.
-  Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 2122–2132.
-  Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
-  Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder, “The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 5–17, 2011.
Christopher Poultney, Sumit Chopra, Yann L Cun, et al.,
“Efficient learning of sparse representations with an energy-based model,”in Advances in neural information processing systems, 2007, pp. 1137–1144.
-  Carlos Busso and Shrikanth S Narayanan, “Interrelation between speech and facial gestures in emotional utterances: a single subject study,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331–2347, 2007.
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo
Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky,
“Domain-adversarial training of neural networks,”
The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
-  Carlos Busso, Murtaza Bulut, Shrikanth Narayanan, J Gratch, and S Marsella, “Toward effective automatic recognition systems of emotion in speech,” Social emotions in nature and artifact: emotions in human and human-computer interaction, J. Gratch and S. Marsella, Eds, pp. 110–127, 2013.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
Florian Eyben, Martin Wöllmer, and Björn Schuller,
“Opensmile: the munich versatile and fast open-source audio feature extractor,”in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
-  Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al., “The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism,” in Interspeech, 2013.
-  Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer, “Deep contextualized word representations,” Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237, 2018.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean,
“Efficient estimation of word representations in vector space,”Proceedings of the 1th International Conference on Learning Representations (ICLR), 2013.
-  Viktor Rozgić, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad, “Ensemble of svm trees for multimodal emotion recognition,” in Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2012, pp. 1–4.
-  Qin Jin, Chengxin Li, Shizhe Chen, and Huimin Wu, “Speech emotion recognition with acoustic and lexical features,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 4749–4753.
-  Runnan Li, Zhiyong Wu, Jia Jia, Jingbei Li, Wei Chen, and Helen Meng, “Inferring user emotive state changes in realistic human-computer conversational dialogs,” in 2018 ACM international conference on Multimedia. ACM, 2018, pp. 136–144.
-  Runnan Li, Zhiyong Wu, Jia Jia, Yaohua Bu, Sheng Zhao, and Helen Meng, “Towards discriminative representation learning for speech emotion recognition,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 5060–5066.