Emotion recognition in conversation (ERC) has attracted numerous interests from the NLP community in recent years due to its potential applications in many areas, such as opinion mining in social media chatterjee2019semeval, dialogue generation huang2018automatic and fake news detection guo2019dean. The objective of ERC is to detect emotions expressed by the speakers in each utterance of the conversation. Previous works on ERC usually solve this problem with two steps. At the first step, each utterance is encoded separately into an utterance-level representation, which will be used as the input for sequence-based models (majumder2019dialoguernn; hazarika2018icon; jiao2019higru) or graph-based models (ghosal2019dialoguegcn; ishiwatari2020relation) during the second step. Despite their success, previous works still have a lot of room for improvement poria2019emotion.
Curriculum learning (CL) bengio2009curriculum
is a training strategy which imitates the meaningful learning order in human curricula. The core idea of CL is to train the machine learning model with easier data subsets at first, and then gradually increase the difficulty level of data until the whole training dataset. As an easy-to-use plug-in, the CL strategy has demonstrated its power in improving the overall performance of various models in a wide range of scenarioswang2020survey. Inspired by the success of CL in other NLP tasks zhou2020uncertainty; liu2018curriculum; su2020dialogue, in this paper, we make effort to leverage the spirit of CL to improve the traditional ERC methods. Due to the hierarchical structure of the ERC datasets, we need to construct the curricula from two granularities: one curriculum sorts the conversations in the dataset from easy to hard, and the other sorts the utterances in each conversation from easy to hard.
The question arises how to measure the difficulty of conversations and utterances. Previous studies majumder2019dialoguernn; shen2021dialogxl have reported that most ERC methods mainly suffer from two issues: 1) “emotion shift” problem. It refers to that these methods cannot efficiently handle scenarios in which emotions of two consecutive utterances are different ghosal-etal-2021-exploring. 2) “confusing label” problem. Previous methods ghosal2019dialoguegcn; shen2021directed usually fail to distinguish between similar emotions very well. This is due to the subtle semantic difference between certain emotion labels such as happy and exciting. These two phenomena provide us the key to quantify the difficulty of conversations and utterances in ERC.
In this paper, we tailor-design a hybrid curriculum learning (HCL) framework for the ERC task. HCL framework consists of two complementary curriculum strategies, conversation-level curriculum (CC) and utterance-level curriculum (UC). In CC, we construct a difficulty measurer based on “emotion shift” frequency within a conversation, then the conversations with lower difficulty are presented to the model before harder ones. This way, the model gradually increases its ability to tackle the “emotion shift” problem.
While in UC, since ERC requires reasoning over multiple utterances in the conversation, we cannot directly schedule the utterances asynchronously in the “easy to hard” scheme. As a result, we design an emotion-similarity based curriculum (ESC) to implement utterance-level curriculum learning. Specifically, inspired by the “confusing label” problem mentioned above, we believe that in a conversation, those utterances with confusing labels are more difficult than others. Therefore, we make the model focus on the utterances with easily recognizable emotion labels in the early stage, then progressively strengthened the model’s capability of identifying the confusing emotions.
More specifically, based on previous studies plutchik1982psychoevolutionary; mikels2005emotional on psychology, we employ the degree of intersection angle between different emotion labels in Valence-Arousal 2D emotion space guo2019dean; yang2021circular
to measure the similarity between emotion labels. During ESC, instead of one-hot encoding, the target represents a probability distribution over all possible emotion labels. The probability of each label is determined by the similarity between current label and the gold label. In other words, instead of solely belonging to its true emotion label, each utterance can also belong to similar emotions to a lesser extent. In the beginning of the training process, the targets of utterances with emotionshappy and excited should almost be the same, but always be very different from sad. During the training process, the label representation gradually shifted to the one-hot encoding. This way, small mistakes are corrected less than big mistakes in the beginning, which resembling a curriculum in which broad concepts are explained before subtle differences are emphasized.
Our hybrid curriculum learning framework is model-agnostic. We evaluate our approach on five representative ERC models. Results on four benchmark datasets demonstrate that the proposed hybrid curriculum learning framework leads to significant performance improvements.
In summary, our main contributions are as follows:
We propose a hybrid curriculum learning framework to tackle the task of ERC. At conversation-level curriculum, we utilize an emotion-shift frequency to measure the difficulty of each conversation.
We propose emotion-similarity based curriculum learning to achieve utterance-level curriculum learning. It implements the basic idea that at early stage of training it is less important to distinguish between similar emotions compared to separating very different emotions.
We conduct experiments on four ERC benchmark datasets. Empirical results show that our proposed hybrid curriculum learning framework can effectively improve the overall performance of various ERC models, including the state-of-the-art.
Emotion Recognition in Conversation
Emotion recognition in conversations (ERC) has been widely studied due to its potential application prospect. The key point of ERC is how to effectively model the context of each utterance and corresponding speaker. Existing works generally resort to deep learning methods to capture contextual characteristics, which can be divided into sequence-based and graph-based methods. Another direction is to improve the performance of existing models by incorporating various external knowledge, which we classified as knowledge-based methods.
Sequence-based Methods Many previous works consider contextual information as utterance sequences. ICON hazarika2018icon and CMN hazarika2018conversational
both utilize gated recurrent unit (GRU) to model the utterance sequences. DialogueRNNmajumder2019dialoguernn employs a GRU to capture the global context which is updated by the speaker state GRUs. jiao2019higru
propose a hierarchical neural network model that comprises two GRUs for the modelling of tokens and utterances respectively.hu2021dialoguecrn introduce multi-turn reasoning modules on Bi-directional LSTM to model the ERC task from a cognitive perspective.
Graph-based Methods In this category, some existing works ghosal2019dialoguegcn; ishiwatari2020relation; zhang2019modeling utilize various graph neural networks to capture multiple dependencies in the conversation. DialogXL shen2021dialogxl modifies the memory block in XLNet yang2019xlnet to store historical context and leverages the self-attention mechanism in XLNet to deal with the multi-turn multi-party structure in conversation. shen2021directed design a directed acyclic graph (DAG) to model the intrinsic structure within a conversation, which achieves the state-of-the-art performance without considering the introduction of external knowledge.
Knowledge-based Methods KET zhong2019knowledge employs hierarchical transformers with concept representations extracted from the ConceptNet speer2017conceptnet for emotion detection, which is the first ERC model integrates common-sense knowledge. COSMIC ghosal2020cosmic adopts a network structure very close to DialogRNN and adds external commonsense knowledge from ATOMIC sap2019atomic to improve its performance. TODKAT zhu2021topic leverages an encoder-decoder architecture which incorporates topic representation with commonsense knowledge from ATOMIC for ERC.
Starting from the work by bengio2009curriculum, a variety of curriculum learning approaches (wang2020survey; soviany2021curriculum)
has been studied. In the field of NLP, curriculum learning has been used for various tasks such as neural machine translation(zhou2020uncertainty; liu2020norm), relation extraction huang2019self and natural answer generation liu2018curriculum. To the best of our knowledge, we leverage curriculum learning for the first time in the ERC task.
In ERC, a conversation C contains a sequence of textual utterances , where denotes the number of utterances. Each utterance consists of tokens, where is the length of . There are participants in C. Each utterance is uttered by one participant in . Given a pre-defined emotion label set , the objective of the ERC task is to predict the emotion label of each utterance in C with the information provided above.
In curriculum learning, a typical curriculum design consists of two core components: difficulty measurer and training scheduler bengio2009curriculum. Difficulty Measurer is used to quantify the relative “easiness” of each data example. The training scheduler arranges the sequence of data subsets throughout the training process based on the judgment from the difficulty measurer. For ERC oriented curriculum learning, the challenge is how to design suitable difficulty measurer and training scheduler for emotion recognition.
A conversation consists of a sequence of utterances. This hierarchical structure inspired us to construct two curricula for scheduling conversations and utterances respectively. Therefore, our framework consists of two nested curricula, conversation-level curriculum (CC) on the outside and utterance-level curriculum (UC) on the inside.
For CC, we design an emotion-shift based difficulty measurer. A widely used CL strategy called baby step spitkovsky2010baby is leveraged as training scheduler. For UC, due to the characteristics of the ERC task, the utterances in the same conversation must be input into a batch simultaneously during the training process. As a result, it is infeasible to employ traditional training scheduler such as baby step to arrange the training order of the utterances. We proposed emotion-similarity based curriculum learning to address this issue.
The proposed HCL framework is illustrated in Figure 1 and the details of CC and UC are elaborated in following two subsections, respectively.
To design conversation-level curriculum for ERC, we need to answer a question: what kind of conversation is supposed to be easier than other conversations? Since we have mentioned that previous ERC models majumder2019dialoguernn; shen2021dialogxl tend to suffer from emotion-shift issue, we adopt emotion-shift frequency to measure the difficulty of each conversation. The main idea is that, the more frequent emotion-shift in conversation occurs, the more difficult it is. Therefore, the conversation-level difficulty score of is defined as
where and denote the number of emotion-shift occurrences in and the total number of utterances in , respectively. is the number of speakers take part in and it acts as a smoothing factor.
We leverage baby step training scheduler spitkovsky2010baby to arrange conversations and organize the training process. Specifically, the whole training set D is divided into different buckets, i.e.
, in which those conversations with similar difficulty scores are categorized into the same bucket. The training starts from the easiest bucket. After a fixed number of training epochs or convergence, the next bucket is merged into the current training subset. Finally, after all the buckets are merged and used, the whole training process further continues several extra epochs. Our HCL framework is described in Algorithm 1 and the process of CC is illustrated asLine 1-Line 5.
As it is infeasible to employ a traditional CL training scheduler to asynchronously arrange the order of the utterances, the question arises how to measure the difficulty of the utterances and establish a feasible curriculum at utterance-level.
We address this problem by assuming that the utterances with confusing emotion labels are more difficult for prediction and our utterance-level curriculum is based on the pairwise similarities between the emotion labels.
Previous studies plutchik1982psychoevolutionary; mikels2005emotional; russell1980circumplex on psychology believe that emotion contains two dimensions: arousal and valence, and they are used to leverage a wheel-like 2D coordinate system to describe emotions. Inspired by these works, we propose a new emotion wheel as Figure 2, which contains all emotions in the standard ERC datasets. As depicted in Figure 2, each emotion label can be mapped to a point on the unit circle. Then we calculate the similarity between emotion labels as in Equation 2.
Here, stands for the similarity of label and label . denotes the valence value of . We take the cosine of the included angle between and as their similarity. If (i.e., ) the similarity is set to 0. If the valence polarities of and are opposite, then the similarity is also set to 0. The similarity between label neutral and other labels is defined as , where is the total number of emotions in corresponding datasets.
The process of emotion-similarity based curruclum learning (ESC) is described as Line 6 - Line 13 in Algorithm 1. We first calculate the similarity between each emotion label pair as Equation 2 and generate the emotion similarity matrix , then is normalized as . At the beginning of ESC training, we take the rows of as the initial target probability distributions over all possible classes for training, and each row corresponds to an emotion label. That is, instead of solely belonging to its ground-truth label, each input utterance can also belong to similar labels to a lesser extent. During the training process, this label representation is gradually shifted towards the standard one-hot-encoding. We define the update strategy as in Line 9 - Line 11, where denotes the probability of -th element of -th row in at training step . The constant parameter
controls how quickly the label vectors converge to the one-hot-encoded labels. Row-wise normalization is performed after each update. This update strategy leads to a proper label-weighting curriculum.
For each training step, the predicted probability distribution of utterance defined as
. Finally, the model is trained with the standard cross-entropy loss function as Equation3, where denotes the predicted probability that the label of in conversation is . denotes the target probability of label in current label-similarity matrix at training step . is total number of conversations in training set, is the utterance number of conversation . In this way, we implement UC through ESC.
We evaluate our method on the following four published ERC datasets 111These datasets are multi-modal datasets, we only focus on the textual information so as to be consistent with previous works.: IEMOCAP busso2008iemocap, MELD poria2019meld, DailyDialog li2017dailydialog, EmoryNLP zahiri2018emotion. The detailed statistics of the datasets are reported in Table 1 222 Some baseline methods made slight adjustments in data splits, we keep exactly the same settings as corresponding methods respectively for fair comparison..
Following previous works (ghosal2019dialoguegcn; zhong2019knowledge; ishiwatari2020relation)
, the evaluation metrics are chosen as micro-F1 excluding the extremely high majority class (neutral) for DailyDialog and weighted-F1 for other three datasets.
Since HCL is a model-agnostic framework, we choose the following five ERC models to verify whether HCL is able to further improve the performance of these models.
majumder2019dialoguernn This is a famous sequence-based ERC model, which uses three GRUs to model the speaker, the context given by the preceding utterances, and the emotion behind the preceding utterances.
ghosal2019dialoguegcn This is a representative graph-based ERC model. It captures self-dependency and inter-speaker dependency by using two-layer graph neural networks.
shen2021directed It is the state-of-the-art of all the ERC models that do not employ external knowledge. DAG-ERC utilizes directed acyclic graph to model the structure of a conversation.
ghosal2020cosmic It is a representative knowledge-based ERC model. It leverages external commonsense knowledge to improve the performance.
zhu2021topic This is the state-of-the-art knowledge-based ERC model. Besides commonsense knowledge, it also incorporates topic information.
All of the baseline models mentioned above have released their source codes. We keep exactly the same settings as reported in the original papers during our experiments. For HCL, the tunable hyperparameters include number of buckets in CC, max training epochs during each baby step, interval steps for training target updating in UC, decay factor in UC. These hyperparameters are manually tuned on each dataset with hold-out validation. The results reported in our experiments are all based on the average score of 5 random runs on the test set. Our experiments are conducted on a single Tesla V100M32 GPU.
Results and Analysis
The overall experimental results are reported in Table 2, where “X+HCL” means training the model X with the proposed HCL framework. We can see that HCL has improved the performance of all baseline models, showing the robustness and universality of our approach.
In general, the performance boosts achieved by HCL on models with simpler feature extractor (i.e., DialogueRNN and DialogueGCN) are more remarkable. An exception is that TODKAT+HCL achieves significant improvements on three datasets. The reason may be that the original TODKAT model does not take account of the speaker information, while our CC has introduced the inter-speaker emotion-shift in the difficulty measurer, which is equivalent to considering speaker information to a certain extent and is beneficial for TODKAT.
To reveal the individual effects of CC and UC, we try different variants of HCL on TODKAT by removing either CC or UC. The experimental results on IEMOCAP and EmoryNLP are shown in Table 3, from which we see that both CC and UC make positive contributions to the overall performance when used alone. Although only utilizing UC leads to larger improvements than only using CC, the optimal performance is achieved when CC and UC are combined, indicating that CC and UC are complementary to each other.
In addition, we also tried another two strategies to combine CC and UC: CC-First (CCF) and UC-First (UCF). CCF performs CC and UC in a pipeline manner. In UCF, the execute order of CC and UC is reversed. The results of CCF and UCF are also outlined in Table 3. It shows that UCF is better than CCF and HCL outperforms both CCF and UCF. This is intuitive, because UCF follows the order from fine-grained to coarse-grained, which is more in line with the “easy to hard” scheme in CL. Compared with UCF, HCL makes UC and CC interact with each other during the training process, which is more consist with the hierarchical structure of conversation, so the performance is even better than UCF.
Performance for Emotion-shift
To verify the effect of HCL in the emotion-shift scenario, we summarize the results of TODKAT+HCL on different types of utterances. The results are presented in Table 4, where ES and N-ES denote utterances with emotion-shift and utterances without emotion-shift, respectively. HCL improves the performance of TODKAT on both ES and N-ES of the two datasets. The improvement on ES in EmoryNLP is more significant than on ES in IEMOCAP.
A plausible explanation is that the training set of IEMOCAP contains much less conversations and the average length of conversations is much longer, so the difficulty scores of conversations in IEMOCAP are usually lower. Therefore, for IEMOCAP, the difficulty discrimination between different buckets in the training scheduler is not as obvious as EmoryNLP.
Performance on Different Emotions
In this subsection, we aim to verify whether HCL can improve the performance of baseline model on “confusing labels”. For each pair of emotion labels in ERC dataset, if their similarity (defined in Equation 2) is larger than 0, then both of them are regarded as confusing labels in our setting. 333 Neutral is not included. We report the results of DAG-ERC and DAG-ERC+HCL on every emotion label in IEMOCAP. There are a total of four confusing labels in this dataset: happy(H), excited(E), sad(S) and frustrated(F). As presented in Table 5, DAG-ERC+HCL outperforms DAG-ERC on all emotion labels other than neutral and the overall performance on the confusing labels is better ( 69.37 vs 67.88 on weighted-F1). This shows that HCL does strengthen the ability on distinguishing the confusing emotion labels of DAG-ERC. However, the performace is limited by neutral, the reason is that neutral is similar to every other label to some extent as in Equation 2, which increases the difficulty for recognition.
Figure 3(a) shows a conversation passage sampled from the IEMOCAP dataset. The goal is to predict the emotion label of the last utterance in the blue box. Due to emotion-shift occurs, all the baseline methods in our experiment are easy to mistakenly identify the emotion as frustrated. Most of our “X+HCL” methods are able to recognize the emotion of this utterance correctly, which indicates that HCL alleviates this problem to some extent. Figure 3(b) depicts a case with confusing labels. The gold emotion label of the last utterance in the red box is excited. Some of the baseline models such as DialogueGCN and DAG-ERC mistook the emotion as happy. After following HCL framework, DialogueGCN+HCL and DAG-ERC+HCL successfully identified the emotion as the correct label excited.
Why Curriculum Learning Works?
According to the theory of curriculum learning bengio2009curriculum, the curriculum will work only if the entropy of data distributions increases during the training process. In HCL, conversation-level curriculum leverages the emotion-shift frequency to measure the difficulty. The more frequent the emotion-shift occurs in a conversation, the greater the diversity of the emotion labels, in other words, the higher the entropy. For utterance-level curriculum, since emotion-similarity based CL does not distinguish similar emotion in the early stage, it is equivalent to merging some emotion labels and could be considered as reducing the diversity of emotions. As a result, it also meets the condition which the entropy should be increased gradually.
In this paper, we propose simple but effective hybrid curriculum learning (HCL) for emotion recognition in conversations. HCL is a flexible framework independent of the original training models. During training, HCL simultaneously employs conversation-level and utterance-level curricula to execute the training process as an easy to hard schema. Conversation-level curriculum consists of an emotion-shfit based difficulty measurer and a baby step scheduler. Utterance-level curriculum is implemented as emotion-similarity based CL. Experiments on four benchmark datasets have proved the generality and effectiveness of HCL.
In the future, we plan to improve our method in three directions. First, we will attempt to seek other suitable features to construct difficulty measurer for ERC. Second, we aim to introduce other training schedulers for CL to further improve the performance. Finally, we aim to apply a learning-based approach to model the similarity between emotion labels.