Log In Sign Up

Deep Learning Brasil – NLP at SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets

In this paper, we describe a methodology to predict sentiment in code-mixed tweets (hindi-english). Our team called verissimo.manoel in CodaLab developed an approach based on an ensemble of four models (MultiFiT, BERT, ALBERT, and XLNET). The final classification algorithm was an ensemble of some predictions of all softmax values from these four models. This architecture was used and evaluated in the context of the SemEval 2020 challenge (task 9), and our system got 72.7


SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets

In this paper, we present the results of the SemEval-2020 Task 9 on Sent...

Two Stages Approach for Tweet Engagement Prediction

This paper describes the approach proposed by the D2KLab team for the 20...

MeToo Tweets Sentiment Analysis Using Multi Modal frameworks

In this paper, We present our approach for IEEEBigMM 2020, Grand Challen...

Hate Speech Detection from Code-mixed Hindi-English Tweets Using Deep Learning Models

This paper reports an increment to the state-of-the-art in hate speech d...

Stance detection in online discussions

This paper describes our system created to detect stance in online discu...

Sentiment Analysis of Persian-English Code-mixed Texts

The rapid production of data on the internet and the need to understand ...

1 Introduction

It is a common tendency among multilingual people who are non-native English speakers to code-mix in their speech using English-based phonetic typing. This linguistic phenomenon, particularly in social media like Twitter222

, poses a great challenge to the conventional Natural Language Processing (NLP) study area.

Within the context of the Sentiment Analysis, the study of the phenomenon of code-mixed language is important to the research community because this behavior is more common today. The interest in this area has grown due to the volume of data that social networks generate, and also by the value that this information has to understand people opinions when they are expressed in written texts.

In this paper, we explain our methodology to predict sentiment in tweets, describing how our method is based on a combination of the latest language models, and also how such models contributed to a great advance in this task. This configuration was employed and evaluated in the SemEval 2020 challenge (task 9), in which the goal is to predict the sentiment in code-mixed texts written in English and Hindi languages of a tweet [12]. The models used in this combination are: MultiFiT [6] that an evolution of ULMFiT [8], BERT [5], ALBERT [10] and XLNet [16].

This work is organized as follows: Section 2 explains some related works, Section 4 describes the dataset used, Section 3 addresses the methodology applied in the task, Section 5 presents the results, and finally Section 6 expose our final considerations as well as possible future works.

2 Related Works

Sentiment Analysis in Twitter has been considered as a very important task from various academic and commercial perspectives. Many companies use the data on Twitter to decide about marketing and business decisions.

A challenge is to apply Sentiment Analysis in texts written in two different languages like English and Hindi. This happens because some people around the world speak two or more languages and sometimes when these people write texts they write it in more than one language.

The author [13]

present their work for Sentiment Analysis for Indian Languages (SAIL) (code mixed). They implemented an algorithm using Multinomial Naïve Bayes trained using n-gram and SentiWordnet features.

For [1], users tend to express their thoughts by mixing words from multiple languages, because in most of the time, they are comfortable in their regionalistic language, and mixed languages are common when users write in social media. They divide their technique into two stages, viz Language Identification, and Sentiment Mining Approach. They evaluated their results and compared to a baseline obtained from machine-translated sentences in English, and found it to be around 8% better in terms of precision.

For [7]

is very important a preprocessing step to remove noise from raw text. The authors developed a Multilayer Perceptron to determine sentiment polarity in code-mixed social media text from Facebook. Such example using texts from Facebook is important to show that this phenomenon can be applied in different social media than the one used on SemEval 2020.


cites some data about the Census in India, showing the existence of 22 scheduled languages and 462 million users on the internet. To express their feelings, such users probably use more than one language to write text on social media.

In order to help NLP researchers, a corpus was created by [14] using English-Hindi code-mixed of tweets marked for the presence of sarcasm and irony where each token is also annotated with a language tag. In this present work, this corpus was used to training MultiFiT model.

3 Methodology

The methodology applied in this task consists of training and using prediction values of four models: MultiFiT, BERT, ALBERT, and XLNet. After retrieving prediction values, our ensemble calculates an average of all softmax values from these four models, as shown in Figure 1

. Models BERT, ALBERT, and XLNet were trained on a DGX-1, while MultiFiT was trained on a GTX 1070 Ti 8GB. The hyperparameters of the four models are described in Table


Model Batch Size Learning Rate Max Length Optimizer Training Base Model
MultiFiT 32 1e-3

20 epochs

BERT 8 2e-5 128 AdamW 30 epochs BERT-Base, Multilingual Cased
ALBERT 16 2e-5 256 AdamW 30 epochs Xxlarge
XLNet 16 2e-5 256 8000 steps XLNetLarge, Cased
Table 1: Hyperparameters
Figure 1: Solution Architecture.

3.1 Preprocessing

This step consists in eliminating noises and terms that have no semantic significance in the sentiment prediction. For this, we perform the removal of links, removal of numbers, removal of special characters, and transform text in lowercase.

3.2 MultiFiT

Nowadays, there are many advances in NLP, but the majority of researches is based on the English language, and those advances can be slow to transfer beyond English.

The MultiFiT [6] method is based on Universal Language Model Fine-tuning (ULMFiT) [8] and the goal of this model is to make it more efficient for modeling languages others than English.

There are two changes compared to the old model: it utilizes tokenization based on sub-words rather than words, and it also uses a QRNN [2] rather than an LSTM. The model architecture can be seen in Figure 2.

The architecture of the model consists of a subword embedding layer, four QRNN layers, an aggregation layer, and two linear layers. In special this architecture, subword tokenization has two very important properties:

  • Subwords more easily represent inflections and this includes common prefixes and suffixes. For morphologically rich languages this is well-suited.

  • It is a common problem out-of-vocabulary tokens and Subword tokenization is a good solution to prevent this problem.

Figure 2: MultiFiT Architecture.444

3.3 Bert

Bidirectional Encoder Representations from Transformers (also abbrevitaed as BERT) [5] is a model designed to pre-train deep bidirectional representations from unlabeled data. The pre-trained BERT model can be fine-tuned with just one single additional output layer, which can be used in sentiment analysis and others NLP tasks.

The implementation of BERT there has two steps: pre-training and fine-tuning. In the pre-training step, the model is trained on unlabeled data over different pre-training tasks using a corpus in a specific language or in multiples corpus with different languages. For the fine-tuning step, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the specific tasks.

Dataset repositories like NLP-progress555 track different model results and progress in many Natural Language Processing (NLP) benchmarks, and also the current state for the most common NLP tasks. When doing a comparison between results available for reference in such repositories, BERT was able to achieve state-of-the-art in many NLP-related tasks, which gives an excellent reason to use BERT in our architecture, even while many reasons of BERT state-of-art performance are not fully understood [9] [3].

3.4 Albert

Recent language models had shown a tendency to increase in size and quantity of parameters for training. They often offer many improvements in many NLP tasks, but they suffer as a consequence of the need for many hours of training, which consequently increases its costs of operation. ALBERT [10]: A Lite BERT

for Self-supervised Learning of Language Representations, offers an alternative of parameters reduction to solve this problem.

There are two changes to reduce the size of the model based on BERT. The first is a factorized embedding parametrization, this decomposing the large vocabulary embedding matrix into two small matrices.This decomposition approach reduces the trainable parameters and reduces a significant time during the training phase. The second change is the share parameter cross-layer, which also prevents the parameter from growing with the depth of the network.

3.5 XLNet

XLNet [15] is a model that uses a bidirectional learning mechanism, doing that as an alternative to word corruption via masks implemented by BERT. XLNet uses a permutation operation over tokens in the same input sequence, being able to use a single phrase through different training steps while providing different examples. Phrase permutation in training fixes token position, but iterating in every token in training phrases, rendering the model able to deal with information gathered from tokens and its positions in a given phrase. XLNet also draws inspiration from Transformer-XL [4], relying specially in pretraining ideas.

4 Dataset and Task

Dataset. The data for the task consists of 17,000 tweets for training and 3,000 for test/evaluation. The data format is in CONLL format. The amount of tweets for the dataset can be seen in Figure 3.

Figure 3: Number of labels per classes.

Task details. The task objective is to predict the sentiment of a given code-mixed tweet. The sentiment labels are positive, negative, or neutral, and the code-mixed languages will be English-Hindi. The challenge is to predict the sentiment in texts written in these two languages [12].

Here are some examples of sentences taken from the dataset.

Positive Sentence

@AmitShah @narendramodi All India me nrc lagu kare w Kashmir se dhara 370ko khatam kare ham Indian ko apse yahi umid hai

Negative Sentence

@RahulGandhi television media congress ke liye nhi h . Ye toh aapko pata chal hi gya hoga . Achha hoga ki Congress ke … https//

Neutral Sentence

@sardanarohit jaaz saab ko salo saal ke pending case ko soultion me maza nahi aata * inko to public paise monthly case miljaye *

5 Results

In this section, we report the obtained results by our model according to the metric evaluation used by the challenge: macro f1, precision and recall, accuracy, and f1 for all classes. Results are reported for each model and an ensemble using a combination of results of the four models XLNet, BERT, ALBERT, and MultiFiT. In Table

2 we show model’s performances and in Table 3 we present the F1 score per class.

Model F1 P R Acc
Ensemble 0.727 0.729 0.726 0.723
XLNet 0.679 0.696 0.692 0.690
ALBERT 0.679 0.684 0.676 0.675
BERT 0.675 0.680 0.672 0.670
MultiFiT 0.665 0.665 0.669 0.662
Table 2: Result Semeval-2020.
Class Acc
Negative 0.671
Positive 0.760
Neutral 0.606
Table 3: Ensemble f1 by class.

The obtained results on the testing data indicate that our ensemble produces the best F1; on the other hand, the XLNet model represents the best result among the other models. It is important to quote that the final ensemble is a large combination of results of the four models used in this architecture.

For the official666 results in competition, the organizers used only the first three submissions and in our case, our models were only MultiFiT and BERT. Using only these architectures our results are only 66.5%.

6 Conclusion

In this paper, we propose a combination of four models for the Semeval 2020 (task 9), and our team got 72.7%

on F1 score in the competition. All of these models are based on using language models and transfer learning. They alone performed well, but together in an ensemble combination, they performed even better.

In some applications, it is difficult to use an ensemble consisted of four models, especially because of the overhead coming from time spent on inference, culminating in an approach that sometimes will not perform well. On the other hand, the individual results of these four models are very close, meaning that for this task, any model can be used.

It is important to note that MultFit has the worst result, but the difference is very small, and this specific model takes a lot less time to train, being the lightest model of the ensemble.

As future works, we intend to explore these models for Sentiment Analysis in other multilingual and monolingual scenarios.


  • [1] R. Bhargava, Y. Sharma, and S. Sharma (2016-Sep.) Sentiment analysis for mixed script indic sentences. In 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Vol. , pp. 524–529. External Links: Document, ISSN null Cited by: §2.
  • [2] J. Bradbury, S. Merity, C. Xiong, and R. Socher (2016)

    Quasi-recurrent neural networks

    External Links: 1611.01576 Cited by: §3.2.
  • [3] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019-08) What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 276–286. External Links: Link, Document Cited by: §3.3.
  • [4] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §3.5.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. External Links: 1810.04805 Cited by: §1, §3.3.
  • [6] J. Eisenschlos, S. Ruder, P. Czapla, M. Kardas, S. Gugger, and J. Howard (2019) MultiFiT: efficient multi-lingual language model fine-tuning. External Links: 1909.04761 Cited by: §1, §3.2.
  • [7] S. Ghosh, S. Ghosh, and D. Das (2017) Sentiment identification in code-mixed social media text. External Links: 1707.01184 Cited by: §2.
  • [8] J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. External Links: 1801.06146 Cited by: §1, §3.2.
  • [9] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky (2019-11) Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4365–4374. External Links: Link, Document Cited by: §3.3.
  • [10] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019) ALBERT: a lite bert for self-supervised learning of language representations. External Links: 1909.11942 Cited by: §1, §3.4.
  • [11] B. G. Patra, D. Das, and A. Das (2018) Sentiment analysis of code-mixed indian languages: an overview of sail_code-mixed shared task @icon-2017. External Links: 1803.06745 Cited by: §2.
  • [12] P. Patwa, G. Aguilar, S. Kar, S. Pandey, S. PYKL, B. Gambäck, T. Chakraborty, T. Solorio, and A. Das (2020-12) SemEval-2020 task 9: overview of sentiment analysis of code-mixed tweets. In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain. Cited by: §1, §4.
  • [13] K. Sarkar (2018) JU_KS@sail_codemixed-2017: sentiment analysis for indian code mixed social media texts. External Links: 1802.05737 Cited by: §2.
  • [14] S. Swami, A. Khandelwal, V. Singh, S. S. Akhtar, and M. Shrivastava (2018) A corpus of english-hindi code-mixed tweets for sarcasm detection. External Links: 1805.11869 Cited by: §2.
  • [15] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. corr abs/1906.08237 (2019). URL: Cited by: §3.5.
  • [16] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. External Links: 1906.08237 Cited by: §1.