Code for WASSA'22 shared task.
Detecting emotions in languages is important to accomplish a complete interaction between humans and machines. This paper describes our contribution to the WASSA 2022 shared task which handles this crucial task of emotion detection. We have to identify the following emotions: sadness, surprise, neutral, anger, fear, disgust, joy based on a given essay text. We are using an ensemble of ELECTRA and BERT models to tackle this problem achieving an F1 score of 62.76 project (https://wandb.ai/acl_wassa_pictxmanipal/acl_wassa) is available.READ FULL TEXT VIEW PDF
Code for WASSA'22 shared task.
Even after engineering a 175B parameter language model like GPT-3Brown et al. (2020) we are far from artificial general intelligence. One key factor separating these language models from human cognition is emotions. Emotion is a concept that is challenging to describe. However, as human beings, we understand the emotional effect that situations could have on other people. It is interesting to see how we can infuse this knowledge into machines. This work explores whether it is possible for machines to map emotions to situations consciously. Emotion in text has been studied for quite a while and has given some interesting insights. The dataset that we are using is an extended version of the Ekman (1992) dataset. Our team, MPA_ED, participated in the WASSA 2022 Shared Task on Empathy Detection and Emotion Classification, Track 2: Emotion Classification (EMO), which consists of predicting the emotion at the essay level. Similar to the last edition of the workshop, our best single model solution is provided by using ELECTRA (Mundra et al., 2021). This paper has the following contributions:
We propose three new datasets generated using various sampling techniques which overcome the class imbalance. We present our ensemble based solution consisting of multiple ELECTRA and BERT (Devlin et al., 2018) models to solve the emotion classification task. We provide a detailed analysis of the performance of the cluster of models and reflect on the shortcomings of the models as well as the dataset generated that affected the performance.
The dataset consists of 1860 data points. Each data point has an essay and its emotion. The emotions are classified into seven types: anger, disgust,
fear, joy, neutral, sadness, and surprise. The validation and test split has 270 and 525 data points respectively. The classes for the training data expresses high imbalance, as shown in Fig 1
. Here we see that the emotion "sadness" has the maximum number of data points, whereas "joy" has the least number of data points. The distribution is highly skewed and hence data augmentation is required to mitigate that. We performed basic preprocessing like removing punctuation, numbers, multiple spaces, and single line characters.
To overcome the class imbalance, GoEmotions dataset is used, which is a similar dataset with 27 emotions. We explored the possibility of undersampling and (repeated) oversampling of the provided dataset. However, we observed this method performed poorly compared to our data augmentation methods. We suggest three data augmentation techniques using the dataset described as follows:
Augmented Over-UnderSampling (AOUS): If denotes the number of data points per class, in this method, if the data points in a particular class are greater than , we undersample the data by randomly removing the essays. Otherwise, the data is oversampled by simply adding Reddit comments with maximum lengths from GoEmotions dataset (sorted by lengths) (Fig 3). As the average length of comments in GoEmotions dataset is 12 and average length of essays in WASSA dataset is 84, the comments with maximum length are chosen for oversampling.
Random synthetic oversampling (RSO): We observe a significant difference in the average comment length of GoEmotions dataset and the average essay length in the WASSA dataset. To avoid disturbing the length distribution of the WASSA dataset after oversampling, we create synthetic essays by concatenating multiple random comments with same emotion (Fig 4). We match the distribution of lengths of the synthetically generated essays from GoEmotions dataset with the distribution of the original dataset using “Systematic Sampling.” We eliminate the deficit in each class by adding synthetically generated essays.
Augmented Oversampling (AOS): denotes the highest number of data points per class. If the number of data points is less than , the data is oversampled by adding comments from GoEmotions dataset with the highest lengths. (Fig 3)
The data distribution post augmentation is balanced with number of samples in AOS, RSO and AOUS datasets equal to 4528, 4828 and 2800 respectively.
Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) is a transformer-based (Vaswani et al., 2017)machine learning technique for natural language processing pre-training developed by Google.
ELECTRA (Clark et al., 2020) is a variation of BERT, having a different pre-training approach. It requires less compute time compared to BERT.
Linear Layer followed by Softmax is used. CrossEntropy loss was used. We employed the Adam optimizer with learning rate and batch size of . We fixed the seed for
torch to 3407.
In the final system, we use an ensemble of ELECTRA and BERT models trained on datasets with various augmentations.
We conducted extensive experimentation and observed some models to perform substantially better than others. We shortlisted the models based on the validation F1-score. We decided to ensemble these models for better performance. We shortlisted four models and used majority voting as our ensemble method: BERT with AOUS, ELECTRA with AOS, ELECTRA with RSO, ELECTRA with AOUS.
. The confusion matrix of the resultant ensemble is shown in6. Note that all confusion matrices are normalized by the number of true samples in each class of the evaluation dataset. We deduce the following observations:
When the true label is "disgust," all models confuse the emotions "anger" and "disgust". All models have below average performance on "anger" and "disgust".
Models trained on AOUS dataset (c, d in Fig 5) are less prone to confusion in multiple close classes like "disgust", "fear" and "sadness" .
The emotions "anger" and "disgust" do not benefit from the ensemble, whereas "fear" suffers a bit. However we observe, the emotions "neutral", "sadness" and "surprise" experience significant gains from this process.
Some of the observations made during our extensive experimentation is as follows:
Batch size 8 outperforms larger batch sizes: We observed improvements across all models
and datasets using a batch size of 8 over 32 or 64. We speculate this is because smaller batch size helps in generalization as the stochasticity of individual batches increase.
ELECTRA fine-tuned on the AOUS dataset outperforms other models: ELECTRA performs better than BERT for all our augmented datasets. We believe models finetuned on AOUS dataset perform better because AOUS dataset has 400 labels per class, making the dataset balanced while limiting the adulteration induced by the GoEmotions dataset.
Multi-task learning has poor performance: We experimented with multi-task learning where empathy and distress tasks (Track 1) and emotion classification task (Track 2) were trained together with a shared backbone. We observed that the training was erratic, and the training loss did not converge.
Models are sensitive to data imbalance: When trained on the original dataset with class imbalance, the model is biased towards predicting classes with more training samples. We used data augmentation techniques mentioned in Section 3 to tackle this issue. After handling the class imbalance with data augmentation, the macro F1 score of the BERT model increased from to .
Emotion "joy" vs "surprise": These are the only two positive emotions in the dataset. We expected all of the models to confuse these emotions as they are semantically similar. However, to our "surprise", we observed the models performed spectacularly on these two emotions. We think this is because "surprise" and "joy" have distinct appearances in
|(a) ELECTRA with AOS||(b) ELECTRA with RSO|
|(c) ELECTRA with AOUS||(d) BERT with AOUS|
the corpus. "surprise" examples have some sort of exclamation or a questioning tone in them. This leaves us with "joy", which happens to be the only positive emotion along with "surprise" in the corpus.
Randomly created synthetic essays provide little understanding: We observed the model trained on RSO augmented data often predicts other emotions as "sadness" (see Fig 5 (b)). We speculate this is because there was no addition of synthetically generated data for the "sadness" class as it is the largest class.
We further hypothesize the synthetic data in RSO, being randomly concatenated, disrupts the context of the entire essay as a whole.
. We present the following statistics. (True Positive (TP), standard deviation (), mean ())
The highest TP is for "sadness" and "fear" emotion with 76 and 67.25 values respectively. Interestingly both of these emotions also have the least TP with 3.92 and 2.87 values respectively.
The least TP is for "disgust" and "joy" emotion with 31 and 48.5 values respectively. "joy" also accounting for the highest TP with 8.81 value which infers that all the models are agreeing on different datapoints to classify as "joy". Whereas "disgust" has one of the least TP with 4.0 just following "fear" and "sadness", this suggests that all the models are able to agree on a very small sample space of the class data to be classified as "disgust".
In this work, we have explored an application of BERT and ELECTRA as a means to the task of emotion classification. Various data sampling techniques were used to overcome the large imbalance in data. In the end the best metrics were achieved by using majority voting of the 4 best models as an ensemble. We foresee multiple future directions, including multi-task learning of multiple tasks with a shared backbone, pretraining on the entire GoEmotions dataset, as well as studying and rectifying spurious behaviour of "anger" and "disgust" labels.