Leveraging Sentiment Analysis Knowledge to Solve Emotion Detection Tasks

Identifying and understanding underlying sentiment or emotions in text is a key component of multiple natural language processing applications. While simple polarity sentiment analysis is a well-studied subject, fewer advances have been made in identifying more complex, finer-grained emotions using only textual data. In this paper, we present a Transformer-based model with a Fusion of Adapter layers which leverages knowledge from more simple sentiment analysis tasks to improve the emotion detection task on large scale dataset, such as CMU-MOSEI, using the textual modality only. Results show that our proposed method is competitive with other approaches. We obtained state-of-the-art results for emotion recognition on CMU-MOSEI even while using only the textual modality.


page 1

page 2

page 3

page 4


A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis

Understanding expressed sentiment and emotions are two crucial factors i...

Multimodal Dual Emotion with Fusion of Visual Sentiment for Rumor Detection

In recent years, rumors have had a devastating impact on society, making...

Developing a concept-level knowledge base for sentiment analysis in Singlish

In this paper, we present Singlish sentiment lexicon, a concept-level kn...

A Survey on Sentiment and Emotion Analysis for Computational Literary Studies

Emotions have often been a crucial part of compelling narratives: litera...

Multimodal Sentiment Analysis To Explore the Structure of Emotions

We propose a novel approach to multimodal sentiment analysis using deep ...

Visual and Textual Sentiment Analysis Using Deep Fusion Convolutional Neural Networks

Sentiment analysis is attracting more and more attentions and has become...

A Study on the Ambiguity in Human Annotation of German Oral History Interviews for Perceived Emotion Recognition and Sentiment Analysis

For research in audiovisual interview archives often it is not only of i...

1 Introduction

Sentiment analysis is a subject that has long interested multiple researchers in the domain of natural language understanding. It is a task which aims to identify sentiment polarity for a given signal, which can be of the audio, visual or textual modality. Emotion recognition is a related task which consists of assigning more fine-grained labels, such as anger, joy, sadness, etc.

This work focuses on analyzing textual inputs. The ability to recognize the sentiment or emotion behind a given sentence or paragraph can lead to multiple applications, such as empathetic dialogue agents and tools to assess the mental state of a patient.

While sentiment analysis in the form of assigning polarities (positive, negative, and sometimes neutral) to text data is a task that is often studied and for which adequate results have already been obtained for multiple datasets, identifying finer-grained labels such as specific emotions is still a challenge. In addition to the task complexity, in most datasets available for this task, some emotions are much less represented than others, making the training data unbalanced. To address this issue, the model proposed in this work combines knowledge from less complex tasks and is trained using methods to counteract class imbalance. It is based on a Transformer-based model with a Fusion of Adapter layers to leverage knowledge from the more simple sentiment analysis task.

The results obtained are competitive with state-of-the-art multi-modal models on the CMU-MOSEI dataset (Bagher Zadeh et al., 2018), while only utilizing the textual modality. Our main contribution can be formulated as:

  • We designed a method that capitalizes on both pretrained Transformer language models and knowledge from complementary tasks to improve on the emotion recognition task, whilst using Adapter layers that require less training parameters than the conventional fine-tuning approach and taking into account class imbalance.

2 Prior Works and Background

There are multiple approaches that have been used to solve text-based sentiment analysis and emotion detection tasks, namely rule-based and machine learning approaches. Rule-based approaches consist of creating grammatical and logical rules to assign emotions and use lexicons to assign emotions or polarities to words. Previous works using this approach include the ones of

Udochukwu and He (2015), Tan et al. (2015) and Seal et al. (2019). These methods are limited by the size and contents of the lexicon used and by the ambiguity of some keywords.

Most recent methods are based on the machine learning approach were the network is trained to learn the relationships between words and emotions. Methods such as those proposed by Abdul-Mageed and Ungar (2017), Tang et al. (2015) and Ma et al. (2019)

use recurrent neural networks to solve sentiment analysis tasks to break down sentences and understand the relationship between the succession of words and sentiments or emotions. Since the release of pretrained models, recent works

(Park et al., 2019; Acheampong et al., 2021)

have been focused on fine-tuning transformer models, which have consistently outperformed previous methods thanks to the multi-head attention applied on words. To improve previous textual emotion recognition methods, we believe that in addition to transfer learning, multi-task learning and class imbalance should be considered.

2.1 Transfer Learning

Transfer learning is a method where the weights of a model trained on a task are used as starting point to train a model for another task. The use of transfer learning with pretrained models has been, for the past few years, the way to obtain state-of-the-art results for multiple natural language understanding (NLU) tasks. Transformer-based pretrained models such as BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019c), XLNet (Yang et al., 2019), etc. have been dominating the field over previously used methods.

2.2 Multi-Task Learning

Multi-task learning is used to train one model to solve multiple tasks instead of fine-tuning separate models. Multiple approaches have been used to solve multi-task learning problems. Liu et al. (2019b)

proposed a Multi-Task Deep Neural Network (MT-DNN) with a shared transformer encoder and task-specific heads.

Clark et al. (2019) and Liu et al. (2019a) presented a new training procedure based on knowledge distillation to improve the performances of the MT-DNN. These approaches allow the model to learn a shared representation between all tasks. Houlsby et al. (2019) introduced a new model architecture using task-specific adapter layers and keeping the weights of the pretrained encoder frozen. This method, while preventing task interference and catastrophic forgetting, does not allow to transfer knowledge between tasks. To counter this weakness, Pfeiffer et al. (2020a) proposed AdapterFusion, a way to combine knowledge from multiple adapters.

2.3 Class Imbalance

Class imbalance is a challenge in resolving many artificial intelligence tasks. It occurs when one or multiple classes make up significantly less samples of the data than the majority class or classes, often leading to a poor predictive performance for those minority classes. Classic approaches to this problem include re-sampling minority class samples or weighting the loss function according to class frequency. In the field of computer vision,

Lin et al. (2018) proposed a modified version of the cross-entropy loss called the focal loss to handle imbalance.

3 Proposed Approach

To improve over previous methods, we have based our method on transfer learning, multi-task learning and we specifically considered class imbalance. To capitalize on transfer learning, our method is based on a strong language model, BERT (Devlin et al., 2018). We motivate this choice by the fact that identifying emotion requires a good overall understanding of a language, as captured by BERT. Since, sentiment analysis and emotion detection are closely related, we propose a model that learns to combine knowledge from multiple tasks of that nature. This allows leveraging datasets that are annotated only with sentiment for the emotion detection task. Finally, our model is designed to consider class imbalance.

Our method is described in detail in the following.

3.1 Model

The proposed model is based on the pretrained Transformer encoder BERT (Devlin et al., 2018) and of a fusion of separately trained Adapter layers. The overall architecture of the model can be seen in Figure 1. We chose the BERT encoder (base size), which is comprised of a stack of twelve encoder layers, preceded by token, sentence and position embeddings. Following the encoder, the last hidden state corresponding to the special classification token ([CLS]) is fed to a classification head formed by two feed forward layers.

Figure 1: Architecture of the proposed model.

Adapters are layers inserted in each of the encoder layers and are trained to adapt the encoder pretrained knowledge to a specific task, while the weights of the encoder are kept frozen (see Figure 2). In this work, each adapter layer trained for a specific task has the same structure, which is the one Pfeiffer et al. (2020a) found to be the best across multiple diverse tasks. They are composed of a feed forward layer that projects the encoder hidden state to a lower dimension, a non-linear function and a feed forward layer that projects it back up to the original hidden size. Pfeiffer et al. (2020a) also found that a reduction factor of 16 for the projection down layer adds a reasonable number of parameters per task whilst still achieving good results. All adapters were therefore trained using this reduction factor.

Figure 2: Modified encoder layer proposed in our method (middle), as defined in Pfeiffer et al. (2020a). On the left, architecture of an adapter layer. On the right, architecture of a fusion layer.

There are as many adapter layers as there are tasks. Figure 2 illustrates that there are several adapter layers that are used in parallel in our model. To combine the knowledge of each adapter, the AdapterFusion method is used (Pfeiffer et al., 2020a). This method consists of learning a composition of the knowledge of different trained adapters. In this stage of the learning, the weights of the pretrained encoder and of all single adapters are frozen, while the classification and fusion layers are trained. The architecture of the fusion layers is also presented in Figure 2.

3.2 Loss Function

The loss function used to counter the imbalance present in emotion detection datasets is a modified version of the classic Binary Cross-Entropy (BCE) Loss used for multi-label classification and can be defined as followed:


where is the number of samples in the batch, is the number of classes, is the output of the classification layer of the model for class of sample , and is the positive answer weighting factor for class defined as:

This weighting factor is computed on the statistics of the training set data. It weights the loss function to increase recall when the data contains more negative samples of class than positive samples, and to increase precision in the opposite situation.

Adapting the focal loss to multi-label classification was also tested but did not significantly improve the performances of the model in comparison to using the classic BCE loss.

4 Experiments

Our proposed method was tested using three datasets. We also performed several ablation studies to assess the contribution of each component.

4.1 Datasets

CMU-MOSEI (Bagher Zadeh et al., 2018): This dataset is comprised of visual, acoustic and textual features for around 23,500 sentences extracted from videos. This dataset is meant to be used to train multi-modal models, but in this work, only the textual inputs were used. The dataset is labelled for sentiment on a scale of [-3,3] and for Ekman emotions (Ekman, 1992)

of joy, sadness, anger, surprise, disgust and fear on a scale of [0, 3]. For binary sentiment classification, the labels are binarized to negative (labels lesser than 0) and non-negative (labels greater or equal to 0). The emotions are discretized to non-present (label equal to 0) or present (label greater than 0). Multiple emotions can be present for the same sample. The performance of models on this dataset is measured with standard binary accuracy (A) and F1 scores (F1) for each emotion, as well as an overall non-weighted mean accuracy score and an overall weighted F1 score.

SST-2 (Socher et al., 2013) & IMDB (Maas et al., 2011)

: SST-2 is comprised of over 60,000 sentences extracted from movie reviews. IMDB contains 50,000 movie reviews. Both are labelled for sentiment analysis in a 2-class split (positive or negative). These datasets were obtained using the HuggingFace Datasets library

111https://huggingface.co/datasets. The performance of models on these datasets is measured with the same binary accuracy scores (A) as CMU-MOSEI.

4.2 Experimental Setup

All experiments use BERTbase (cased) (Devlin et al., 2018) as the pretrained model, which has 12 encoder layers and a hidden size of 768. Adapter and AdapterFusion layers are added to each of those encoder layers. The classification heads are composed of two fully connected linear layers with sizes equal to the hidden size of the transformer layer (768) and the number of labels (6) respectively, and with activation functions. The input of the first linear layer is the last hidden state of the BERT model corresponding to the classification token ([CLS]) at the beginning of the input sequence. All models were trained using AdamW (Loshchilov and Hutter, 2017)

with a linear rate scheduler, a learning rate of 1e-5, and a weight decay of 1e-2. All models were trained for 10 epochs with early stopping after 3 epochs if the validation metric did not improve. The Adapter-Transformers library

(Pfeiffer et al., 2020b) was used to incorporate the Adapter and AdapterFusion layers to the model. The results presented in the following section are averaged over 3 runs.

Two types of fusion models were trained: one using a fusion of only CMU-MOSEI tasks (Fusion3: binary sentiment, 7-class sentiment and emotion classification) and one using additional knowledge from the SST-2 and IMDB sentiment analysis tasks (Fusion5).

4.3 Results

The results for the emotion detection task of CMU-MOSEI are presented in Table 1. The performance of the proposed model is compared to that of a fine-tuned BERT model and of a model using a single task specific adapter, both using the same classification head as our proposed model. The results of the current state-of-the art model for this dataset (Delbrouck et al., 2020) are also presented. Note that this state-of-the-art model is a Transformer-based model that utilizes both textual and audio modalities.

Model Emotions
Joy Sadness Anger Surprise Disgust Fear Overall
A/F1 A/F1 A/F1 A/F1 A/F1 A/F1 A/F1
TBJE1 66.0/71.7 73.9/17.8 81.9/17.3 89.2/3.5 86.5/45.3 90.6/0.0 81.5/40.5
BERT 66.3/69.0 69.4/42.8 74.2/44.3 85.8/21.9 83.1/53.1 83.8/18.7 77.1/51.8
Adapter 67.3/69.4 66.3/46.1 70.4/48.5 73.4/26.5 77.3/52.3 70.9/22.7 70.9/53.7
Fusion3 67.5/70.5 66.5/44.4 72.5/47.3 81.4/25.9 79.0/52.9 81.1/21.1 74.7/53.6
Fusion5 67.5/70.7 69.1/44.6 73.1/47.5 81.3/26.6 79.9/53.0 82.2/20.3 75.5/53.7
  • Accuracy scores obtained from (Delbrouck et al., 2020). F1 scores were computed using the two provided model checkpoints, as the ones presented in their paper were weighted F1 scores.

Table 1: Results on CMU-MOSEI for emotion detection. Adapter: BERT with task-specific adapters, Fusion: BERT with Fusion of adapters for CMU-MOSEI tasks (binary sentiment, 7-class sentiment and emotion classification), Fusion: BERT with fusion of adapters for tasks of all datasets (CMU-MOSEI tasks and sentiment classification tasks of SST-2 & IMDB)

All models trained with our proposed loss function achieve better F1-scores than the current state-of-the-art. While a fully fine-tuned BERT model achieves better overall accuracy, the proposed Fusion model is the one that has best accuracy/F1-score trade-off for all emotions. As observed in Table 2, given that all distributions of emotions, except for joy, are heavily imbalanced, accuracy is not an appropriate metric for this dataset as it does not fully represent the model ability to identify each emotion. Therefore, it is better to use the F1-score as a measurement basis. Single Adapter models are able to achieve good F1 scores, but do not reach accuracy scores that are comparable to Fusion models, which further proves that combining knowledge from multiple tasks improves the performance of the model. Capitalizing on knowledge from additional sentiment analysis tasks outside of the CMU-MOSEI dataset also allows the Fusion5 model to perform slightly better than the Fusion3 model, which only includes knowledge from the CMU-MOSEI tasks. The proposed model also requires a lot less parameters to train, as can be seen in Table 3.

Joy Sadness Anger Surprise Disgust Fear
Proportion of positive samples   52%   25%   21%   10%   17%   8%
Table 2: Positive samples per class for CMU-MOSEI
Model All parameters Trainable parameters
BERT (fine-tuned) 108.3 M 108.3 M
Adapter 109.8 M 1.5 M
Fusion 132.8 M 21.8 M
Fusion 134.6 M 21.8 M
Table 3: Number of parameters per model

4.4 Comparison of Loss Functions

The choice of loss function greatly impacts the performance of the model, especially on emotions that are less present in the dataset. The performance obtained with the different loss functions tested are presented in the Table 4.

Loss Emotions
Joy Sadness Anger Surprise Disgust Fear Overall
A/F1 A/F1 A/F1 A/F1 A/F1 A/F1 A/F1
BCE 67.9/71.5 75.8/22.2 78.5/25.4 90.5/1.3 85.6/48.5 91.7/0.5 81.7/42.8
FL 67.7/70.9 75.8/24.9 78.4/23.1 90.5/0.6 85.6/46.1 91.7/0.0 81.6/42.2
PL 67.5/70.7 69.1/44.6 73.1/47.5 81.3/26.6 79.9/53.0 82.2/20.3 75.5/53.7
Table 4: Results on CMU-MOSEI for emotion detection for different loss functions. BCE: Regular Binary Cross-Entropy loss, FL: Focal loss, PL: Proposed loss function.

The difference in performance between the classic Binary Cross-Entropy loss and the Focal loss is not significant. While the use of the loss function proposed in this paper decreases to some extent the accuracy for most emotions, it greatly improves the F1-score for all emotions with the exception of joy.

4.5 Performance of Single Adapters

This section presents the performance of the single Adapters on the other tasks used for knowledge composition in the Fusion models. Tables 5 and 6 compare the results obtained by BERT trained with task-specific adapters (Adapter) to fully fine-tuned models and state-of-the-art models. Unless stated otherwise, all accuracy values were obtained by averaging the results over 3 runs using the experimental setup described in section 4.2. Base size versions of BERT, RoBERTa and XLNet models were used for a fair comparison.

Model 2-class sentiment 7-class sentiment
TBJE1 84.2 45.5
BERT 84.3 46.8
Adapter 83.9 46.5
  • Accuracy scores obtained from (Delbrouck et al., 2020).

Table 5: Results on CMU-MOSEI sentiment analysis tasks. BERT: fine-tuned BERT, Adapter: BERT with task-specific adapters.

The performances of the Adapter models are comparable to those of a fully fine-tuned BERT model. In the case of CMU-MOSEI tasks, they performed on par with or better than state-of-the-art results. In the case of SST-2 and IMDB tasks, they slightly underperformed compared to state-of-the-art fine-tuned language models. However, regardless of the dataset, this experiment shows that adapters can capture useful task-specific information at lower training cost. Furthermore, adapter fusion allows to combine the knowledge from these several good performing task-specific adapters. This explains why our proposed adapter fusion model benefits from the related tasks of sentiment analysis to improve emotion recognition.

Model SST-2 IMDB
RoBERTa 94.8 94.5
XLNet 93.4 95.1
BERT 93.5 94.0
Adapter 92.6 93.7
Table 6: Results on SST-2 & IMDB sentiment analysis tasks. RoBERTa, XLNet and BERT: language model fine-tuned on the specific task, Adapter: BERT with task-specific adapters.

5 Conclusion

The model presented in this work surpasses state-of-the-art results for emotion recognition on CMU-MOSEI even while using only the textual modality. There is still improvement needed for the rarer emotions in the dataset, but at of the time of producing this article, the results presented are substantially stronger than other contributions in terms of F1-scores. Due to the lack of large-scale datasets for emotion detection in text, testing the model on purely textual data will have to be done in further studies as the data will become available.


This work was supported by Mitacs through the Mitacs Accelerate program and by Airudi.


  • M. Abdul-Mageed and L. Ungar (2017) EmoNet: fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 718–728. External Links: Link, Document Cited by: §2.
  • F. A. Acheampong, H. Nunoo-Mensah, and W. Chen (2021) Transformer models for text-based emotion detection: a review of bert-based approaches. Artificial Intelligence Review. External Links: ISSN 1573-7462, Document, Link Cited by: §2.
  • A. Bagher Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2236–2246. External Links: Link, Document Cited by: §1, §4.1.
  • K. Clark, M. Luong, U. Khandelwal, C. D. Manning, and Q. V. Le (2019) BAM! born-again multi-task networks for natural language understanding. CoRR abs/1907.04829. External Links: Link, 1907.04829 Cited by: §2.2.
  • J. Delbrouck, N. Tits, M. Brousmiche, and S. Dupont (2020) A transformer-based joint-encoding for emotion recognition and sentiment analysis. CoRR abs/2006.15955. External Links: Link, 2006.15955 Cited by: item 1, item 1, §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §2.1, §3.1, §3, §4.2.
  • P. Ekman (1992) An argument for basic emotions. Cognition and Emotion 6 (3-4), pp. 169–200. External Links: Document, Link, https://doi.org/10.1080/02699939208411068 Cited by: §4.1.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019) Parameter-efficient transfer learning for NLP. CoRR abs/1902.00751. External Links: Link, 1902.00751 Cited by: §2.2.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2018) Focal loss for dense object detection. External Links: 1708.02002 Cited by: §2.3.
  • X. Liu, P. He, W. Chen, and J. Gao (2019a) Improving multi-task deep neural networks via knowledge distillation for natural language understanding. CoRR abs/1904.09482. External Links: Link, 1904.09482 Cited by: §2.2.
  • X. Liu, P. He, W. Chen, and J. Gao (2019b) Multi-task deep neural networks for natural language understanding. CoRR abs/1901.11504. External Links: Link, 1901.11504 Cited by: §2.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019c) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §2.1.
  • I. Loshchilov and F. Hutter (2017) Fixing weight decay regularization in adam. CoRR abs/1711.05101. External Links: Link, 1711.05101 Cited by: §4.2.
  • L. Ma, L. Zhang, W. Ye, and W. Hu (2019) PKUSE at SemEval-2019 task 3: emotion detection with emotion-oriented neural attention network. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 287–291. External Links: Link, Document Cited by: §2.
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011)

    Learning word vectors for sentiment analysis

    In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Link Cited by: §4.1.
  • S. Park, J. Kim, J. Jeon, H. Park, and A. Oh (2019) Toward dimensional emotion detection from categorical emotion annotations. CoRR abs/1911.02499. External Links: Link, 1911.02499 Cited by: §2.
  • J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych (2020a) AdapterFusion: non-destructive task composition for transfer learning. CoRR abs/2005.00247. External Links: Link, 2005.00247 Cited by: §2.2, Figure 2, §3.1, §3.1.
  • J. Pfeiffer, A. Rücklé, C. Poth, A. Kamath, I. Vulić, S. Ruder, K. Cho, and I. Gurevych (2020b) AdapterHub: a framework for adapting transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 46–54. Cited by: §4.2.
  • D. Seal, U. Roy, and R. Basak (2019) Sentence-level emotion detection from text based on semantic rules. pp. 423–430. External Links: ISBN 978-981-13-7165-3, Document Cited by: §2.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §4.1.
  • L. I. Tan, W. S. Phang, K. O. Chin, and A. Patricia (2015) Rule-based sentiment analysis for financial news. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, Vol. , pp. 1601–1606. External Links: Document Cited by: §2.
  • D. Tang, B. Qin, X. Feng, and T. Liu (2015)

    Target-dependent sentiment classification with long short term memory

    CoRR abs/1512.01100. External Links: Link, 1512.01100 Cited by: §2.
  • O. Udochukwu and Y. He (2015) A rule-based approach to implicit emotion detection in text. In NLDB, Cited by: §2.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §2.1.