BERT-Max based Contextual Emotion Classifier
We propose a contextual emotion classifier based on a transferable language model and dynamic max pooling, which predicts the emotion of each utterance in a dialogue. A representative emotion analysis task, EmotionX, requires to consider contextual information from colloquial dialogues and to deal with a class imbalance problem. To alleviate these problems, our model leverages the self-attention based transferable language model and the weighted cross entropy loss. Furthermore, we apply post-training and fine-tuning mechanisms to enhance the domain adaptability of our model and utilize several machine learning techniques to improve its performance. We conduct experiments on two emotion-labeled datasets named Friends and EmotionPush. As a result, our model outperforms the previous state-of-the-art model and also shows competitive performance in the EmotionX 2019 challenge. The code will be available in the Github page.READ FULL TEXT VIEW PDF
In this paper, we investigate the emotion recognition ability of the
In this paper we present an emotion classifier model submitted to the
This paper describes the system submitted by ANA Team for the SemEval-20...
Different from the emotion recognition in individual utterances, we prop...
The two most common paradigms for end-to-end speech recognition are
This paper describes our approach to the EmotionX-2019, the shared task ...
Emotion detection in dialogues is challenging as it often requires the
BERT-Max based Contextual Emotion Classifier
Sentiment analysis, considered as one of the most important methods to analyze real-world communication [Picard2000], is a kind of classification tasks to extract subjective information from language. The traditional sentiment analysis method, [Yu and Hatzivassiloglou2003], returns opinion polarity towards something, but the approach is confined to analyze just a single sentence or document, regardless of its surrounding information. To resolve this issue, [Wilson et al.2005] performed sentiment analysis with datasets that context information has to be considered. Another advanced dataset, Twitter corpus [Pak and Paroubek2010], is built from social media more similar to real-world communication.
Under this background, [Chen et al.2018] released an emotion-labeled corpus of multi-party conversations, EmotionLines, for contextual sentiment analysis. One example of the data set is described in Table 1. The example is a situation that a woman and a man are arguing since she has feelings for him but he is watching someone. The last utterance told by Rachel is supposed to be labeled as Joy if it is a single sentence, but it has to be labeled as Anger considering the whole dialogue context.
During the recent decade, various neural network models have been proposed to perform this task, arising from their promising performance in many classification tasks. Despite its progress, there are two critical problems of learning long-term dependency[Bengio et al.1994] and processing in parallel [Eckmiller et al.1990].
Recently, [Vaswani et al.2017] propose a self-attention mechanism that enables to capture long-term dependency and to compute in parallel. The multi-layered self-attention based language model [Devlin et al.2018]
drastically advances performances of several natural language processing tasks. We will discuss the pre-trained language models in detail in Section2.
In this paper, we propose the contextual emotion classifier applied to the EmotionX shared task. The proposed model leverages transferable language model and dynamic max pooling to effectively consider each utterance and its context information together. We consider the contextual emotion classification task as a sequence labeling problem in that each utterance has a context in a dialogue. In addition, we propose several machine learning techniques to handle inherent problem of the shared task, which lead to performance gains as a result.
The contributions of our paper are as follows:
We suggest a successful method to deal with the intended problems, context understanding, domain adaptation and class imbalance in the EmotionX shared task.
Our model advances the previous state-of-the-art model, EmotionX-AR [Khosla2018], by replacing the encoder to a self-attention based transferable language model.
Our study can be positioned in the connection of the following two topics.
Pre-trained langauge models, such as ELMo [Peters et al.2018], OpenAI GPT [Radford et al.2018], and BERT [Devlin et al.2018], have been broadly applied in a variety of NLP tasks (e.g., sentiment analysis, machine reading comprehension, and textual entailment) and have achieved a great success.
They can generate the deep contextualized embeddings since they are pre-trained on a massive unlabeled corpus (i.e., English Wikipedia), thus the method built on top of them can achieve better performance.
Several researches to leverage pre-trained language models on the task of sentiment analysis have been recently proposed. [Sun et al.2019] fine-tuned the BERT model and obtained outperforming results on sentiment analysis task. However, they only focused on single sentence classification and did not conduct an experiment on a conversational corpus where each utterance is semantically related.
One of the previous baselines of emotion detection in a dialogue proposes CNN-DCNN auto-encoder based emotion classifier, but they don’t utilize the dialogue context information for predicting emotion of utterance [Khosla2018]. Our proposed approach takes [Sun et al.2019]’s and [Khosla2018]’s work one step further by dialogue level with bi-directional pre-trained language model (BERT) on emotion detection, thus the context from both directions on dialogue is reflected much stronger.
Emotion detection task on a multi-party dialogue is similar to dialogue act sequence labeling task. Dialogue acts are semantic labels attached to utterances in a dialogue that present to briefly identify speaker’s intention in producing those utterances.
Dialogue acts identification can be interpreted as a sequence labeling problem and can be resolved naively by assigning a label to each element of the sequence independently. Inspired by the sequence labeling problem in a dialogue-level, we apply this concept to our task. For more details, the dialogue-level tokens are input to the post training language model, which takes into account not only their dialogue emotion, thus modeling the dependency among both, labels and utterances, an important consideration of natural dialogue [Kumar et al.2018].
More related on our task, [Xu et al.2018] study double embeddings and CNN based sequence labeling for aspect extraction. Aspect extraction is one of the sentiment analysis tasks and aims to extract opinion aspects from opinion based text [Yang et al.2018]. They achieve very good results on review sentences, but the purpose of these tasks is different from our task in terms of dialogue level. Furthermore, they use pre-trained general-purpose embedding for aspect extraction (e.g., GloVe [Pennington et al.2014]) while we fine-tune the pre-trained language model to be adapted for the dialogue situation.
To the best of our knowledge, this study is the first report such sequence labeling skeleton, especially on the dialogue level, based pre-trained language model for emotion detection.
The specification of our model is described in this section.
We propose a contextual emotion classifier with a combination of transferable language model and dynamic max pooling as in Figure 1. (1) Firstly, the input utterances are tokenized according to a byte pair encoding (BPE) algorithm. (2) Then the language model embeds the tokenized inputs into the deep contextualized token representations, which can be sequentially converted to the utterance representations via dynamic max pooling. (3) Finally, the classifier detects contextual emotion, considering the representations.
Every utterance is lower-cased and tokenized by the BPE tokenizer, and all the tokens in the same dialogue are appended with inserting the special token, [SEP], between utterances. Exceptively, if we try to append the tokens of the next utterance for building the list of tokens in the dialogue, but it exceeds the preset maximum length of input tokens, the tokens of the utterance are excluded from the list.
The tokens are embedded through WordPiece embeddings [Wu et al.2016]. The input embeddings, e.g., where is the index of utterance and is the length of -th utterance, are the summation of the token embeddings and the positional embeddings.
We adopt transformer [Vaswani et al.2017] based pre-trained language model (BERT) because it alleviates the long-term dependecy problem, helping capture the worthy context information effectively. Also, the language model shows the promising high performance, for the pre-trained parameters are transferable into other tasks.
To enhance the adaptability of the language model, we post-train [Xu et al.2019] the model via masked language model (MLM) and next sentence prediction (NSP). Then all layers of the model are fine-tuned while training for the mentioned emotion classification task.
The language model converts the tokenized inputs, e.g., , into deep contextualized token representations, e.g., . However, each dialogue has different number of utterances, and each utterance has different number of tokens. To handle this problem, we apply a dynamic max pooling technique to create uniform-sized utterance representations from the different number of tokens as in Figure 1. It is also considered that the max pooling can assist in keeping important information in each dimension.
The utterance representations from the encoder pass through the classifier what consists of two linear layers, one activation function and dropout as in Figure1. The activation function, scaled exponetial linear units (SELUs) [Klambauer et al.2017], includes normalization processing so that gradient descents can converge more quickly. Dropout is applied to prevent overfitting.
The Friends and EmotionPush datasets suffers from a severe class imbalance problem as in Table 2. To deal with it, we use weighted cross entropy (WCE) as a training loss to weight the samples of minority classes as below.
where N is the number of classes. is the number of samples of class in a training set. is a ground-truth label, and
is a probability for corresponding class.
We discuss experiment results of our model in this section.
EmotionX 2019 Challenge is a shared task of Social NLP 2019222https://sites.google.com/site/socialnlp2019/ that detects emotion in dialogue utterances. Two datasets333https://sites.google.com/view/emotionx2019/shared-task/datasets are released for the challenge, and participants are asked to detect the emotion among four labels (i.e., Neutral, Joy, Sadness, and Anger). One is the Friends dataset [Chen et al.2018] , which is multi-party conversations collected from one of the famous TV series, and the other is the EmotionPush [Huang and Ku2018] that contains messages collected from social network messengers(e.g., Facebook).
Each dataset contains 4,000 dialogues including 1,000 English-language original version and 3,000 augmented versions, which is back-translated by French, German, and Italian, respectively. Each dialogue is composed of several utterances and they are labeled with 7 emotions (e.g., Neutral, Joy, Sadness, and etc.). Among the emotions, we regard four emotions (i.e., Fear, Surprise, Disgust, and Non-Neutral) as Out-Of-Domain, since they are not tested on the evaluation phase.
In Table 2, we describe data and label distribution of Friends and EmotionPush datasets. In terms of label distribution for both datasets, Neutral are the most common class, followed by Joy, Sadness, and Anger. Both datasets have imbalanced class distribution, and especially the ratio of Sadness and Anger is very small. For instance, they account for only 3.4% and 5.2%, respectively in the Friends dataset. In the case of EmotionPush, Anger label accounts for less than 1% of the training set.
For the evaluation phase, about 3,000 utterances from 240 dialogues are given to predict one of four emotions. The distribution of two data and classes is similar to the training set of each data.
The learning rate decreases from to according to a cosine annealing schedule [Loshchilov and Hutter2016] as follows:
where denotes the index of the run, and refers to the number of epochs after the last restart. We set = as an initial learning rate and adopt the Adam optimizer [Kingma and Ba2014] working with the scheduled learning rate.
We adopt the pre-trained uncased BERT-Base
model as the transferable language model where maximum input length is 512. The number of combination layers of a multi-head attention and a feed forward neural network, N in Figure1, is 12. The language model is post-trained via a next sentence prediction (NSP) task and masked language model (MLM) with released Friends, EmotionPush and Emory444https://github.com/emorynlp/emotion-detection [Zahiri and Choi2018]
datasets where the number of is 100,000 steps. The dimension of hidden representations is set to 768, and the internal hidden size of a classification layer is set to 384. The number of classes is five, including four classes and an out-of-domain class.
To evaluate the performance of prediction, we mainly use micro f1 score equivalent to weighted accuracy (WA) if every data is tagged with only one class like EmotionLines dataset, obtained by the formula below.
Experiments are conducted with the released Friends and EmotionPush datasets augmented via back-translation as a training set and their gold datasets for evaluation as a test set.
To convert the different length of tokens into the uniform sized representations, we design two converters, dynamic averaging and dynamic max pooling. Even though the former sometimes shows the better performance than the latter as in Table 3, the overall performance of the latter is better in the case of training together. Thus, we build our model with post-trained language model and the dynamic max pooling.
When training each dataset seperately compared to training them together, the overall scores on EmotionPush increase, but the performance on Friends dataset decreases. We guess that the parameters of the transferable language model pre-trained with formal corpus might be somehow destroyed when fine-tuning chat-based dialogues, EmotionPush.
For submission version, we implements the k-fold cross validation ensemble method to utilize all the datasets most efficiently where k is 5. Our ensemble model labels each utterance with the most voted emotion based on the decision of k models which are trained with different training set and validation set.
We proposed the contextual emotion classifier which consists of the transferable language model and dynamic max pooling. Our model successfully alleviates the three inherent problems in the EmotionX shared task, which is to capture contextual information, to understand informal text dialogues and to overcome a class imbalance problem. It outperforms the previous state-of-the-art model and shows competitive performance in the challenge. However, our model cannot consider all the utterances when the number of input tokens exceeds the preset maximum length, so we expect that future work overcomes this problem.
This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program (IITP-2018-0-01405) supervised by the IITP(Institute for Information & communications Technology Planning & Evaluation)
Emotionx-ar: Cnn-dcnn autoencoder based emotion classifier.In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, pages 37–44, 2018.
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
Glove: Global vectors for word representation.In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
Emotion detection on tv show transcripts with sequence-based convolutional neural networks.In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.