1 Introduction
Sentiment analysis is a subject that has long interested multiple researchers in the domain of natural language understanding. It is a task which aims to identify sentiment polarity for a given signal, which can be of the audio, visual or textual modality. Emotion recognition is a related task which consists of assigning more fine-grained labels, such as anger, joy, sadness, etc.
This work focuses on analyzing textual inputs. The ability to recognize the sentiment or emotion behind a given sentence or paragraph can lead to multiple applications, such as empathetic dialogue agents and tools to assess the mental state of a patient.
While sentiment analysis in the form of assigning polarities (positive, negative, and sometimes neutral) to text data is a task that is often studied and for which adequate results have already been obtained for multiple datasets, identifying finer-grained labels such as specific emotions is still a challenge. In addition to the task complexity, in most datasets available for this task, some emotions are much less represented than others, making the training data unbalanced. To address this issue, the model proposed in this work combines knowledge from less complex tasks and is trained using methods to counteract class imbalance. It is based on a Transformer-based model with a Fusion of Adapter layers to leverage knowledge from the more simple sentiment analysis task.
The results obtained are competitive with state-of-the-art multi-modal models on the CMU-MOSEI dataset (Bagher Zadeh et al., 2018), while only utilizing the textual modality. Our main contribution can be formulated as:
-
We designed a method that capitalizes on both pretrained Transformer language models and knowledge from complementary tasks to improve on the emotion recognition task, whilst using Adapter layers that require less training parameters than the conventional fine-tuning approach and taking into account class imbalance.
2 Prior Works and Background
There are multiple approaches that have been used to solve text-based sentiment analysis and emotion detection tasks, namely rule-based and machine learning approaches. Rule-based approaches consist of creating grammatical and logical rules to assign emotions and use lexicons to assign emotions or polarities to words. Previous works using this approach include the ones of
Udochukwu and He (2015), Tan et al. (2015) and Seal et al. (2019). These methods are limited by the size and contents of the lexicon used and by the ambiguity of some keywords.Most recent methods are based on the machine learning approach were the network is trained to learn the relationships between words and emotions. Methods such as those proposed by Abdul-Mageed and Ungar (2017), Tang et al. (2015) and Ma et al. (2019)
use recurrent neural networks to solve sentiment analysis tasks to break down sentences and understand the relationship between the succession of words and sentiments or emotions. Since the release of pretrained models, recent works
(Park et al., 2019; Acheampong et al., 2021)have been focused on fine-tuning transformer models, which have consistently outperformed previous methods thanks to the multi-head attention applied on words. To improve previous textual emotion recognition methods, we believe that in addition to transfer learning, multi-task learning and class imbalance should be considered.
2.1 Transfer Learning
Transfer learning is a method where the weights of a model trained on a task are used as starting point to train a model for another task. The use of transfer learning with pretrained models has been, for the past few years, the way to obtain state-of-the-art results for multiple natural language understanding (NLU) tasks. Transformer-based pretrained models such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019c), XLNet (Yang et al., 2019), etc. have been dominating the field over previously used methods.
2.2 Multi-Task Learning
Multi-task learning is used to train one model to solve multiple tasks instead of fine-tuning separate models. Multiple approaches have been used to solve multi-task learning problems. Liu et al. (2019b)
proposed a Multi-Task Deep Neural Network (MT-DNN) with a shared transformer encoder and task-specific heads.
Clark et al. (2019) and Liu et al. (2019a) presented a new training procedure based on knowledge distillation to improve the performances of the MT-DNN. These approaches allow the model to learn a shared representation between all tasks. Houlsby et al. (2019) introduced a new model architecture using task-specific adapter layers and keeping the weights of the pretrained encoder frozen. This method, while preventing task interference and catastrophic forgetting, does not allow to transfer knowledge between tasks. To counter this weakness, Pfeiffer et al. (2020a) proposed AdapterFusion, a way to combine knowledge from multiple adapters.2.3 Class Imbalance
Class imbalance is a challenge in resolving many artificial intelligence tasks. It occurs when one or multiple classes make up significantly less samples of the data than the majority class or classes, often leading to a poor predictive performance for those minority classes. Classic approaches to this problem include re-sampling minority class samples or weighting the loss function according to class frequency. In the field of computer vision,
Lin et al. (2018) proposed a modified version of the cross-entropy loss called the focal loss to handle imbalance.3 Proposed Approach
To improve over previous methods, we have based our method on transfer learning, multi-task learning and we specifically considered class imbalance. To capitalize on transfer learning, our method is based on a strong language model, BERT (Devlin et al., 2018). We motivate this choice by the fact that identifying emotion requires a good overall understanding of a language, as captured by BERT. Since, sentiment analysis and emotion detection are closely related, we propose a model that learns to combine knowledge from multiple tasks of that nature. This allows leveraging datasets that are annotated only with sentiment for the emotion detection task. Finally, our model is designed to consider class imbalance.
Our method is described in detail in the following.
3.1 Model
The proposed model is based on the pretrained Transformer encoder BERT (Devlin et al., 2018) and of a fusion of separately trained Adapter layers. The overall architecture of the model can be seen in Figure 1. We chose the BERT encoder (base size), which is comprised of a stack of twelve encoder layers, preceded by token, sentence and position embeddings. Following the encoder, the last hidden state corresponding to the special classification token ([CLS]) is fed to a classification head formed by two feed forward layers.
Adapters are layers inserted in each of the encoder layers and are trained to adapt the encoder pretrained knowledge to a specific task, while the weights of the encoder are kept frozen (see Figure 2). In this work, each adapter layer trained for a specific task has the same structure, which is the one Pfeiffer et al. (2020a) found to be the best across multiple diverse tasks. They are composed of a feed forward layer that projects the encoder hidden state to a lower dimension, a non-linear function and a feed forward layer that projects it back up to the original hidden size. Pfeiffer et al. (2020a) also found that a reduction factor of 16 for the projection down layer adds a reasonable number of parameters per task whilst still achieving good results. All adapters were therefore trained using this reduction factor.

There are as many adapter layers as there are tasks. Figure 2 illustrates that there are several adapter layers that are used in parallel in our model. To combine the knowledge of each adapter, the AdapterFusion method is used (Pfeiffer et al., 2020a). This method consists of learning a composition of the knowledge of different trained adapters. In this stage of the learning, the weights of the pretrained encoder and of all single adapters are frozen, while the classification and fusion layers are trained. The architecture of the fusion layers is also presented in Figure 2.
3.2 Loss Function
The loss function used to counter the imbalance present in emotion detection datasets is a modified version of the classic Binary Cross-Entropy (BCE) Loss used for multi-label classification and can be defined as followed:
(1) |
where is the number of samples in the batch, is the number of classes, is the output of the classification layer of the model for class of sample , and is the positive answer weighting factor for class defined as:
This weighting factor is computed on the statistics of the training set data. It weights the loss function to increase recall when the data contains more negative samples of class than positive samples, and to increase precision in the opposite situation.
Adapting the focal loss to multi-label classification was also tested but did not significantly improve the performances of the model in comparison to using the classic BCE loss.
4 Experiments
Our proposed method was tested using three datasets. We also performed several ablation studies to assess the contribution of each component.
4.1 Datasets
CMU-MOSEI (Bagher Zadeh et al., 2018): This dataset is comprised of visual, acoustic and textual features for around 23,500 sentences extracted from videos. This dataset is meant to be used to train multi-modal models, but in this work, only the textual inputs were used. The dataset is labelled for sentiment on a scale of [-3,3] and for Ekman emotions (Ekman, 1992)
of joy, sadness, anger, surprise, disgust and fear on a scale of [0, 3]. For binary sentiment classification, the labels are binarized to negative (labels lesser than 0) and non-negative (labels greater or equal to 0). The emotions are discretized to non-present (label equal to 0) or present (label greater than 0). Multiple emotions can be present for the same sample. The performance of models on this dataset is measured with standard binary accuracy (A) and F1 scores (F1) for each emotion, as well as an overall non-weighted mean accuracy score and an overall weighted F1 score.
SST-2 (Socher et al., 2013) & IMDB (Maas et al., 2011)
: SST-2 is comprised of over 60,000 sentences extracted from movie reviews. IMDB contains 50,000 movie reviews. Both are labelled for sentiment analysis in a 2-class split (positive or negative). These datasets were obtained using the HuggingFace Datasets library
111https://huggingface.co/datasets. The performance of models on these datasets is measured with the same binary accuracy scores (A) as CMU-MOSEI.4.2 Experimental Setup
All experiments use BERTbase (cased) (Devlin et al., 2018) as the pretrained model, which has 12 encoder layers and a hidden size of 768. Adapter and AdapterFusion layers are added to each of those encoder layers. The classification heads are composed of two fully connected linear layers with sizes equal to the hidden size of the transformer layer (768) and the number of labels (6) respectively, and with activation functions. The input of the first linear layer is the last hidden state of the BERT model corresponding to the classification token ([CLS]) at the beginning of the input sequence. All models were trained using AdamW (Loshchilov and Hutter, 2017)
with a linear rate scheduler, a learning rate of 1e-5, and a weight decay of 1e-2. All models were trained for 10 epochs with early stopping after 3 epochs if the validation metric did not improve. The Adapter-Transformers library
(Pfeiffer et al., 2020b) was used to incorporate the Adapter and AdapterFusion layers to the model. The results presented in the following section are averaged over 3 runs.Two types of fusion models were trained: one using a fusion of only CMU-MOSEI tasks (Fusion3: binary sentiment, 7-class sentiment and emotion classification) and one using additional knowledge from the SST-2 and IMDB sentiment analysis tasks (Fusion5).
4.3 Results
The results for the emotion detection task of CMU-MOSEI are presented in Table 1. The performance of the proposed model is compared to that of a fine-tuned BERT model and of a model using a single task specific adapter, both using the same classification head as our proposed model. The results of the current state-of-the art model for this dataset (Delbrouck et al., 2020) are also presented. Note that this state-of-the-art model is a Transformer-based model that utilizes both textual and audio modalities.
Model | Emotions | ||||||
---|---|---|---|---|---|---|---|
Joy | Sadness | Anger | Surprise | Disgust | Fear | Overall | |
A/F1 | A/F1 | A/F1 | A/F1 | A/F1 | A/F1 | A/F1 | |
TBJE1 | 66.0/71.7 | 73.9/17.8 | 81.9/17.3 | 89.2/3.5 | 86.5/45.3 | 90.6/0.0 | 81.5/40.5 |
BERT | 66.3/69.0 | 69.4/42.8 | 74.2/44.3 | 85.8/21.9 | 83.1/53.1 | 83.8/18.7 | 77.1/51.8 |
Adapter | 67.3/69.4 | 66.3/46.1 | 70.4/48.5 | 73.4/26.5 | 77.3/52.3 | 70.9/22.7 | 70.9/53.7 |
Fusion3 | 67.5/70.5 | 66.5/44.4 | 72.5/47.3 | 81.4/25.9 | 79.0/52.9 | 81.1/21.1 | 74.7/53.6 |
Fusion5 | 67.5/70.7 | 69.1/44.6 | 73.1/47.5 | 81.3/26.6 | 79.9/53.0 | 82.2/20.3 | 75.5/53.7 |
-
Accuracy scores obtained from (Delbrouck et al., 2020). F1 scores were computed using the two provided model checkpoints, as the ones presented in their paper were weighted F1 scores.
All models trained with our proposed loss function achieve better F1-scores than the current state-of-the-art. While a fully fine-tuned BERT model achieves better overall accuracy, the proposed Fusion model is the one that has best accuracy/F1-score trade-off for all emotions. As observed in Table 2, given that all distributions of emotions, except for joy, are heavily imbalanced, accuracy is not an appropriate metric for this dataset as it does not fully represent the model ability to identify each emotion. Therefore, it is better to use the F1-score as a measurement basis. Single Adapter models are able to achieve good F1 scores, but do not reach accuracy scores that are comparable to Fusion models, which further proves that combining knowledge from multiple tasks improves the performance of the model. Capitalizing on knowledge from additional sentiment analysis tasks outside of the CMU-MOSEI dataset also allows the Fusion5 model to perform slightly better than the Fusion3 model, which only includes knowledge from the CMU-MOSEI tasks. The proposed model also requires a lot less parameters to train, as can be seen in Table 3.
Joy | Sadness | Anger | Surprise | Disgust | Fear | |
Proportion of positive samples | 52% | 25% | 21% | 10% | 17% | 8% |
Model | All parameters | Trainable parameters |
---|---|---|
BERT (fine-tuned) | 108.3 M | 108.3 M |
Adapter | 109.8 M | 1.5 M |
Fusion | 132.8 M | 21.8 M |
Fusion | 134.6 M | 21.8 M |
4.4 Comparison of Loss Functions
The choice of loss function greatly impacts the performance of the model, especially on emotions that are less present in the dataset. The performance obtained with the different loss functions tested are presented in the Table 4.
Loss | Emotions | ||||||
---|---|---|---|---|---|---|---|
Joy | Sadness | Anger | Surprise | Disgust | Fear | Overall | |
A/F1 | A/F1 | A/F1 | A/F1 | A/F1 | A/F1 | A/F1 | |
BCE | 67.9/71.5 | 75.8/22.2 | 78.5/25.4 | 90.5/1.3 | 85.6/48.5 | 91.7/0.5 | 81.7/42.8 |
FL | 67.7/70.9 | 75.8/24.9 | 78.4/23.1 | 90.5/0.6 | 85.6/46.1 | 91.7/0.0 | 81.6/42.2 |
PL | 67.5/70.7 | 69.1/44.6 | 73.1/47.5 | 81.3/26.6 | 79.9/53.0 | 82.2/20.3 | 75.5/53.7 |
The difference in performance between the classic Binary Cross-Entropy loss and the Focal loss is not significant. While the use of the loss function proposed in this paper decreases to some extent the accuracy for most emotions, it greatly improves the F1-score for all emotions with the exception of joy.
4.5 Performance of Single Adapters
This section presents the performance of the single Adapters on the other tasks used for knowledge composition in the Fusion models. Tables 5 and 6 compare the results obtained by BERT trained with task-specific adapters (Adapter) to fully fine-tuned models and state-of-the-art models. Unless stated otherwise, all accuracy values were obtained by averaging the results over 3 runs using the experimental setup described in section 4.2. Base size versions of BERT, RoBERTa and XLNet models were used for a fair comparison.
Model | 2-class sentiment | 7-class sentiment |
A | A | |
TBJE1 | 84.2 | 45.5 |
BERT | 84.3 | 46.8 |
Adapter | 83.9 | 46.5 |
-
Accuracy scores obtained from (Delbrouck et al., 2020).
The performances of the Adapter models are comparable to those of a fully fine-tuned BERT model. In the case of CMU-MOSEI tasks, they performed on par with or better than state-of-the-art results. In the case of SST-2 and IMDB tasks, they slightly underperformed compared to state-of-the-art fine-tuned language models. However, regardless of the dataset, this experiment shows that adapters can capture useful task-specific information at lower training cost. Furthermore, adapter fusion allows to combine the knowledge from these several good performing task-specific adapters. This explains why our proposed adapter fusion model benefits from the related tasks of sentiment analysis to improve emotion recognition.
Model | SST-2 | IMDB |
A | A | |
RoBERTa | 94.8 | 94.5 |
XLNet | 93.4 | 95.1 |
BERT | 93.5 | 94.0 |
Adapter | 92.6 | 93.7 |
5 Conclusion
The model presented in this work surpasses state-of-the-art results for emotion recognition on CMU-MOSEI even while using only the textual modality. There is still improvement needed for the rarer emotions in the dataset, but at of the time of producing this article, the results presented are substantially stronger than other contributions in terms of F1-scores. Due to the lack of large-scale datasets for emotion detection in text, testing the model on purely textual data will have to be done in further studies as the data will become available.
Acknowledgments
This work was supported by Mitacs through the Mitacs Accelerate program and by Airudi.
References
- EmoNet: fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 718–728. External Links: Link, Document Cited by: §2.
- Transformer models for text-based emotion detection: a review of bert-based approaches. Artificial Intelligence Review. External Links: ISSN 1573-7462, Document, Link Cited by: §2.
- Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2236–2246. External Links: Link, Document Cited by: §1, §4.1.
- BAM! born-again multi-task networks for natural language understanding. CoRR abs/1907.04829. External Links: Link, 1907.04829 Cited by: §2.2.
- A transformer-based joint-encoding for emotion recognition and sentiment analysis. CoRR abs/2006.15955. External Links: Link, 2006.15955 Cited by: item 1, item 1, §4.3.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §2.1, §3.1, §3, §4.2.
- An argument for basic emotions. Cognition and Emotion 6 (3-4), pp. 169–200. External Links: Document, Link, https://doi.org/10.1080/02699939208411068 Cited by: §4.1.
- Parameter-efficient transfer learning for NLP. CoRR abs/1902.00751. External Links: Link, 1902.00751 Cited by: §2.2.
- Focal loss for dense object detection. External Links: 1708.02002 Cited by: §2.3.
- Improving multi-task deep neural networks via knowledge distillation for natural language understanding. CoRR abs/1904.09482. External Links: Link, 1904.09482 Cited by: §2.2.
- Multi-task deep neural networks for natural language understanding. CoRR abs/1901.11504. External Links: Link, 1901.11504 Cited by: §2.2.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §2.1.
- Fixing weight decay regularization in adam. CoRR abs/1711.05101. External Links: Link, 1711.05101 Cited by: §4.2.
- PKUSE at SemEval-2019 task 3: emotion detection with emotion-oriented neural attention network. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 287–291. External Links: Link, Document Cited by: §2.
-
Learning word vectors for sentiment analysis
. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Link Cited by: §4.1. - Toward dimensional emotion detection from categorical emotion annotations. CoRR abs/1911.02499. External Links: Link, 1911.02499 Cited by: §2.
- AdapterFusion: non-destructive task composition for transfer learning. CoRR abs/2005.00247. External Links: Link, 2005.00247 Cited by: §2.2, Figure 2, §3.1, §3.1.
- AdapterHub: a framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 46–54. Cited by: §4.2.
- Sentence-level emotion detection from text based on semantic rules. pp. 423–430. External Links: ISBN 978-981-13-7165-3, Document Cited by: §2.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §4.1.
- Rule-based sentiment analysis for financial news. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, Vol. , pp. 1601–1606. External Links: Document Cited by: §2.
-
Target-dependent sentiment classification with long short term memory
. CoRR abs/1512.01100. External Links: Link, 1512.01100 Cited by: §2. - A rule-based approach to implicit emotion detection in text. In NLDB, Cited by: §2.
- XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §2.1.