Domain adversarial learning for emotion recognition

10/24/2019 ∙ by Zheng Lian, et al. ∙ 0

In practical applications for emotion recognition, users do not always exist in the training corpus. The mismatch between training speakers and testing speakers affects the performance of the trained model. To deal with this problem, we need our model to focus on emotion-related information, while ignoring the difference between speaker identities. In this paper, we look into the use of the domain adversarial neural network (DANN) to extract a common representation between different speakers. The primary task is to predict emotion labels. The secondary task is to learn a common representation where speaker identities can not be distinguished. By using the gradient reversal layer, the gradients coming from the secondary task are used to bring the representations for different speakers closer. To verify the effectiveness of the proposed method, we conduct experiments on the IEMOCAP database. Experimental results demonstrate that the proposed framework shows an absolute improvement of 3.48

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many typical applications for emotion recognition, users do not always exist in the training corpus. The mismatch between training speakers and testing speakers leads to a performance degradation of the trained models. Therefore, it is vital to build speaker-independent systems for emotion recognition.

To ensure the model is speaker-independent, prior works focus on the data split strategies [1, 2]. These methods ensure no speaker overlap in the training set and the testing set. For example, the IEMOCAP dataset [3] contains five sessions and each session has different actors. Hazarika et al. [2] used utterances from the first four sessions for training and others for testing. However, it is unclear whether these methods can actually learn speaker-independent representations.

Furthermore, obtaining large amounts of the realistic data is currently challenging and expensive for emotion recognition. The publicly available datasets (such as IEMOCAP [3] and SEMAINE [4]

) have relatively small number of total utterances. Prior works utilize unsupervised learning approaches to deal with low-resource training samples. One common method is to train autoencoders

[5]. Autoencoders work by converting original features into compressed representations, aiming to capture intrinsic structures of the data. However, it is unclear whether compressed representations preserve the emotion component of the input. In fact, prior works have found the emotion component can be lost after feature compression [6].

To deal with these problems, we focus on the domain adversarial neural network (DANN) [7] for emotion recognition. Commonly, the mismatch in data distribution between the train (source domain) and test (target domain) sets leads to a performance degradation [8]. DANN relies on adversarial training for domain adaption, aiming to reduce the mismatch between the source domain and the target domain. In this paper, we use DANN to reduce the mismatch between different speakers, thus ensuring that the model can learn speaker-independent representations. Furthermore, different from previous unsupervised learning methods [5], DANN can extract useful information from the unlabeled data while retaining the useful information for emotion recognition.

In this paper, we present a DANN based framework for emotion recognition. The main contributions of this paper lie in three aspects: 1) to deal with low-resource training samples, the proposed method can extract useful information from the unlabeled data while retaining emotion-related information; 2) to reduce the mismatch between speakers, the proposed method ensures the model can learn speaker-independent representations; 3) our proposed method is superior to other currently advanced approaches for emotion recognition. To the best of our knowledge, it is the first time that DANN is used for emotion recognition.

Figure 1: Overall structure of the proposed framework.

2 Proposed Method

In this paper, we propose a multimodal learning framework for emotion recognition. As shown in Fig. 1, the proposed framework consists of three components: the feature encoder, the domain classifier and the emotion classifier.

2.1 Problem Definition

Dataset: We have a training set with annotated emotional data, and a testing set with unlabeled data. Like previous experimental settings in [2], speaker identities are available for both the training set and the testing set.

Task: Define a conversation , where is the utterance in the conversation and is the total number of utterances. The primary task is to predict the emotion labels of each utterance. The secondary task is to learn a common representation where speaker identities can not be distinguished.

2.2 Feature Encoder

The feature encoder contains two key components: the Audio-Text Fusion component (AT-Fusion) for multi-modalities fusion and the Self-Attention based Gated Recurrent Unit (SA-GRU) for contextual feature extraction.

Multi-modalities Fusion (AT-Fusion): Different modalities have different contributions in emotion recognition. To focus on the important modalities, we utilize the attention mechanism for multi-modalities fusion. Specifically, we first extract acoustic features and lexical features from each utterance. Then we equalize the dimensions of these features to size using two fully-connected layers, respectively. This provides the final acoustic features and lexical features for the utterance . AT-Fusion takes and

as inputs, and outputs the attention vector

over these modalities. Finally, the fusion representation is generated as follows:

(1)
(2)
(3)

where and are trainable parameters. Here, and .

This multimodal representation is generated for utterances in the conversation , marked as .

Contextual Feature Extraction (SA-GRU): SA-GRU uses the bi-directional GRU (bi-GRU), in combination with the self-attention mechanism [9] to amplify the important contextual evidents for emotion recognition. Specifically, multimodal representations are given as inputs to the bi-GRU. Outputs of this layer form , where . Then is fed into the self-attention network. It consists of a multi-head attention to extract the cross-position information. Each head ( is the number of heads) is generated using the inner product as follows:

(4)

where , and are trainable parameters.

Then outputs of attention functions are concatenated together as final values . As contextual representations is generated for all utterances in the conversation , it can also be represented as , where .

2.3 Domain Adversarial Neural Network for Emotion Recognition

DANN is trained using labeled data from the training set and unlabeled data from the testing set. The network learns two classifiers – the emotion classifier and the domain classifier. Both classifiers share the feature encoder that determines the representations of the data used for classification. The approach introduce a gradient reversal layer [7] between the domain classifier and the feature encoder. This layer passes the data during forward propagation and inverts the sign of the gradient during backward propagation. Therefore, DANN attempts to minimize the emotion classification error and maximize the domain classification error. By considering these two goals, the model ensures a discriminative representation for the emotion recognition, while making the samples from different speakers indistinguishable.

In our proposed method, we train the emotion recognition task with the training set, for which we have emotion labels. For the domain classifier, we train the classifier with data from the training set and the testing set. Notice that the domain classifier does not require emotion labels, so we relay on unlabeled data from the testing set. These classifiers are trained in parallel. The objective function is defined as follows:

(5)

where is the emotion recognition loss, is the domain classification loss, represents the number of labeled data from the training set, and represents the number of unlabeled data from the testing set. Here,

is a hyperparameter that controls the trade off between two losses.

Compared with the fully supervised learning strategy (where

in Eq. (5)), DANN has following advantages. Firstly, DANN learns a representation that confuses the domain classifier, which ensures the model speaker-independent. Secondly, DANN uses available unlabeled data to further reduce the mismatch between different speakers. These speaker-independent representations retain discriminative information learned during the training of the models with emotional data from the training set. Therefore, the proposed method can extract useful information from the unlabeled data while retaining discriminative information for emotion recognition.

3 Experiments and Discussion

3.1 Corpus Description

We perform experiments on the IEMOCAP dataset [3]. It contains audio-visual conversations spanning 12.46 hours of various dialogue scenarios. There are five sessions and a pair of speakers are grouped in a single session. All the conversations are split into small utterances, which are annotated using the following categories: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise and other. To compare our method with state-of-the-art methods [1, 2], we consider the first four categories, where happy and excited categories are merged into the single happy category. Thus 5531 utterances are involved. The number of utterances and dialogues of each session are listed in Table 1.

Session 1 2 3 4 5
No.utterance 1085 1023 1151 1031 1241
No.dialogue 28 30 32 30 31
Table 1: The data distribution of the IEMOCAP dataset.

3.2 Experimental Setup

Features: We extract acoustic features using the openSMILE toolkit [10]. Specifically, we use the Computational Paralinguistic Challenge (ComParE) feature set introduced by Schuller et al. [11]. Totally, 6373-dimensional utterance-level acoustic features are extracted, including energy, spectral, MFCCs and their statistics; In the meantime, we use word embeddings to represent the lexical information. Specifically, we employ deep contextualized word representations using the language model ELMo [12]. Compared with previous word vectors [13], these representations have proven to capture syntax and semantics aspects as well as the diversity of the linguistic context of words [12]. To extract utterance-level lexical features, we calculate mean values of word representations in the utterance. Totally, 1024-dimensional utterance-level lexical features are extracted.

Settings: AF-Fusion equals the feature dimensions of different modalities to size . SA-GRU contains a bi-GRU layer (50 units for each GRU component) and a self-attention layer (100 dimensional states and 4 attention heads). We test different in Eq. (5) and we find that

gains the best recognition performance. To optimize the parameters, we use the Adam optimization scheme with a learning rate of 0.0001. We train our models for at least 200 epochs with a batch size of 20.

regularization with the weight is also utilized to alleviate over-fitting problems. In our experiments, each configuration is tested 20 times with various weight initializations. The weighted accuracy (WA) is chosen as our evaluation criterion.

3.3 Classification Performance of the Proposed Method

Two systems are evaluated in the experiments. In additional to the proposed system, one comparison systems are also implemented to verify the effectiveness of our proposed method:

(1) Our proposed system (Our): It is our proposed framework. For the emotion classifier, we train the classifier with data from the training set, for which we have emotion labels. For the domain classifier, we train the classifier with data from the training set and the testing set. As the domain classifier does not require emotion labels, we relay on unlabeled data from the testing set.

(2) Comparison system 1 (C1): It comes from the proposed method, but ignoring the domain classifier. Specifically, we only optimize the emotion classifier by setting the in Eq. (5) to be 0.

Furthermore, to explore the impact of the amounts of labeled samples in the training set, five training settings are discussed, including TS_1234, TS_123, TS_134, TS_234 and TS_23. These training settings follow the same naming way. For example, TS_123 represents that the training data contains Session 13, while the testing data contains other sessions (Session 45). As Session 5 always belongs to the testing data under these settings, we evaluate the classification performance on Session 5. Experimental results of WA are listed in Table 2.

TS_1234 TS_123 TS_134 TS_234 TS_23
C1 81.06 80.82 79.85 78.89 77.60
Our 81.14 82.68 82.27 82.43 81.39
Table 2: Experimental results of two systems under different training settings.

To verify the effectiveness of the proposed method, we compare the performance of the proposed method and C1. Experimental results in Table 2 demonstrate that our proposed method is superior to C1 in all cases. Compared with C1, our proposed method can learn speaker-independent representations. It ensures our model to focus on emotion-related information, while ignoring the difference between speaker identities. Therefore, this method can achieve better performance on unseen speakers (in Session 5). Furthermore, compared with C1, the proposed method can use unlabeled samples in the training process. This semi-supervised approach uses unlabeled samples to further reduce the mismatch between different speakers. Meanwhile, this approach retains discriminative information learned during the training of the models with emotional data. Therefore, our proposed method is more suitable for emotion recognition than C1.

To show the impact of the amount of training samples, we compare the performance under different training settings. Experimental results in Table 2 demonstrate that when we reduce training samples, C1 has 0.2%3.5% performance decrement. Without enough training samples, C1 faces the risk of over-fitting. Therefore, the recognition performance on the unseen data becomes worse. Interestingly, we notice that our proposed method gains 0.2%1.5% performance improvement when we reduce training samples. Meanwhile, we compare the performance of the proposed method and C1. We observe that the margin of improvement increases with small amounts of training samples. These phenomenons reveal that if we utilize unlabeled samples properly, we can even achieve better performance than fully supervised learning methods. Different from previous unsupervised learning methods [5], the proposed method can extract useful information from the unlabeled data while retaining discriminative information for emotion recognition.

3.4 Comparison to State-of-the-art Approaches

To verify the effectiveness of the proposed method, we further compare our method with other currently advanced approaches. Experimental results of different methods are listed in Table 3.

Approaches WA (%)
Rozgić et al. (2012) [14] 67.40
Jin et al. (2015) [15] 69.20
Poria et al. (2017) [1] 74.31
Li et al. (2018) [16] 74.80
Hazarika et al. (2018) [2] 77.62
Li et al. (2019) [17] 79.20
Proposed method 82.68
Table 3: The performance of state-of-the-art approaches and the proposed approach on the IEMOCAP database.

Compared with our proposed method, these approaches [1, 2, 14, 15, 16, 17] also utilized acoustic features and lexical features for emotion recognition. Context-free systems [14, 15, 16, 17] inferred emotions based on only the current utterance in conversations. While context-based networks [1] utilized the LSTMs to capture contextual information from their surroundings. However, context-based networks [1] suffered from incapability of capturing inter-speaker dependencies. To model the inter-speaker emotion influence, Hazarika et al. [2] used memory networks to perform speaker-specific modeling.

Experimental results in Table 3 demonstrate the effectiveness of the proposed method. Our proposed method shows an absolute improvement of 3.48% over state-of-the-art strategies. This serves as strong evidence that the domain adversarial neural network can yield a promising performance for emotion recognition.

4 Conclusions

In this paper, we present a DANN based approach for emotion recognition. Experimental results demonstrate that our method enables the model to focus on emotion-related information, while ignoring the difference between speaker identities. Interestingly, we notice that our proposed method gains performance improvement when we reduce training samples. It reveals that our method can utilize unlabeled samples properly. Due to above advantages, this novel framework is superior to state-of-the-art strategies for emotion recognition.

References

  • [1] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency,

    “Context-dependent sentiment analysis in user-generated videos,”

    in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, vol. 1, pp. 873–883.
  • [2] Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Erik Cambria, Louis-Philippe Morency, and Roger Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 2122–2132.
  • [3] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
  • [4] Gary McKeown, Michel Valstar, Roddy Cowie, Maja Pantic, and Marc Schroder, “The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent,” IEEE Transactions on Affective Computing, vol. 3, no. 1, pp. 5–17, 2011.
  • [5] Christopher Poultney, Sumit Chopra, Yann L Cun, et al.,

    “Efficient learning of sparse representations with an energy-based model,”

    in Advances in neural information processing systems, 2007, pp. 1137–1144.
  • [6] Carlos Busso and Shrikanth S Narayanan, “Interrelation between speech and facial gestures in emotional utterances: a single subject study,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331–2347, 2007.
  • [7] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky, “Domain-adversarial training of neural networks,”

    The Journal of Machine Learning Research

    , vol. 17, no. 1, pp. 2096–2030, 2016.
  • [8] Carlos Busso, Murtaza Bulut, Shrikanth Narayanan, J Gratch, and S Marsella, “Toward effective automatic recognition systems of emotion in speech,” Social emotions in nature and artifact: emotions in human and human-computer interaction, J. Gratch and S. Marsella, Eds, pp. 110–127, 2013.
  • [9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  • [10] Florian Eyben, Martin Wöllmer, and Björn Schuller,

    “Opensmile: the munich versatile and fast open-source audio feature extractor,”

    in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1459–1462.
  • [11] Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al., “The interspeech 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism,” in Interspeech, 2013.
  • [12] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer, “Deep contextualized word representations,” Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237, 2018.
  • [13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean,

    “Efficient estimation of word representations in vector space,”

    Proceedings of the 1th International Conference on Learning Representations (ICLR), 2013.
  • [14] Viktor Rozgić, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, and Rohit Prasad, “Ensemble of svm trees for multimodal emotion recognition,” in Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 2012, pp. 1–4.
  • [15] Qin Jin, Chengxin Li, Shizhe Chen, and Huimin Wu, “Speech emotion recognition with acoustic and lexical features,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 4749–4753.
  • [16] Runnan Li, Zhiyong Wu, Jia Jia, Jingbei Li, Wei Chen, and Helen Meng, “Inferring user emotive state changes in realistic human-computer conversational dialogs,” in 2018 ACM international conference on Multimedia. ACM, 2018, pp. 136–144.
  • [17] Runnan Li, Zhiyong Wu, Jia Jia, Yaohua Bu, Sheng Zhao, and Helen Meng, “Towards discriminative representation learning for speech emotion recognition,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), 2019, pp. 5060–5066.