Log In Sign Up

Benchmarking Multi-Task Learning for Sentiment Analysis and Offensive Language Identification in Under-Resourced Dravidian Languages

To obtain extensive annotated data for under-resourced languages is challenging, so in this research, we have investigated whether it is beneficial to train models using multi-task learning. Sentiment analysis and offensive language identification share similar discourse properties. The selection of these tasks is motivated by the lack of large labelled data for user-generated code-mixed datasets. This paper works on code-mixed YouTube comments for Tamil, Malayalam, and Kannada languages. Our framework is applicable to other sequence classification problems irrespective of the size of the datasets. Experiments show that our multi-task learning model can achieve high results compared with single-task learning while reducing the time and space constraints required to train the models on individual tasks. Analysis of fine-tuned models indicates the preference of multi-task learning over single-task learning resulting in a higher weighted F1-score on all three languages. We apply two multi-task learning approaches to three Dravidian languages: Kannada, Malayalam, and Tamil. Maximum scores on Kannada and Malayalam were achieved by mBERT subjected to cross-entropy loss and with an approach of hard parameter sharing. Best scores on Tamil was achieved by DistilBERT subjected to cross-entropy loss with soft parameter sharing as the architecture type. For the tasks of sentiment analysis and offensive language identification, the best-performing model scored a weighted F1-score of (66.8\% and 90.5\%), (59\% and 70\%), and (62.1\% and 75.3\%) for Kannada, Malayalam, and Tamil on sentiment analysis and offensive language identification, respectively. The data and approaches discussed in this paper are published in Github\footnote{\href{}{Dravidian-MTL-Benchmarking}}.


page 21

page 22

page 23


Polarity and Intensity: the Two Aspects of Sentiment Analysis

Current multimodal sentiment analysis frames sentiment score prediction ...

A Novel Multi-Task Learning Approach for Context-Sensitive Compound Type Identification in Sanskrit

The phenomenon of compounding is ubiquitous in Sanskrit. It serves for a...

Multi-task Learning of Negation and Speculation for Targeted Sentiment Classification

The majority of work in targeted sentiment analysis has concentrated on ...

Sluice networks: Learning what to share between loosely related tasks

Multi-task learning is partly motivated by the observation that humans b...

1 Introduction

Nowadays, the internet is key to accessing information and communicating with others. The easy access to the internet has enabled the majority of the population to use social media and amplify their communication and connections throughout the world. Social media platforms such as Facebook, YouTube, and Twitter have paved the way to express any sentiment about the content posted by its users in any languageSeveryn et al. (2014); Clarke and Grieve (2017); Tian et al. (2017)

. These texts are informal as they are written in a spoken tone that does not follow strict grammatical rules. Understanding social media content is lately attracting much attention from the natural language processing (NLP) community

Barman et al. (2014), owing to the growing amount of data and active users. Sentiment analysis (SA) refers to the method to extract subjectivity and polarity from text Pang and Lee (2008). The lack of moderation on social media has resulted in the persistent use of offensive language towards other users on their platforms. Offensive language identification (OLI) is the task of identifying whether a comment contains offensive language or not. Traditionally, the classification of SA or OLI is usually attempted in a mono-lingual and single task. Traditional approaches to this text classification problems are pretty helpful for high resourced languages. However, traditional approaches fail for languages with limited resources, and they also fail on code-mixed text Chakravarthi et al. (2020b).

Multi-task learning (MTL) is a practical approach to utilise shared characteristics of tasks to improve system performances Caruana (1997). In MTL, the objective is to utilise learning multiple tasks simultaneously to improve the performance of the system Martínez Alonso and Plank (2017)

. Since SA and OLI are both essentially sequence classification tasks, this inspired us to do MTL. To utilise MTL on Dravidian languages, we have experimented with several recent pretrained transformer-based natural language models on Tamil (ISO 639-3: tam), Malayalam (ISO 639-3: mal), and Kannada (ISO 639-3:kan).

Kannada and Malayalam are among the Dravidian languages predominantly spoken in South India and are also official languages in the states of Karnataka and Kerala, respectively Reddy and Sharoff (2011). The Tamil language has official status in Tamil Nadu of India and countries like Sri Lanka, Singapore, Malaysia, and other parts of the world. Dravidian languages are morphologically rich; along with code-mixing, it becomes even more challenging to process these languages Bhat (2012), and they are under-resourced Prabhu et al. (2020).

We propose a method to perform MTL to address two main challenges arising when creating a system for user-generated comments from social media. The challenges are:

  1. Code-mixing: In multi-lingual countries like India, Sri Lanka, and Singapore, the speakers are likely to be polyglottic and often switch between multiple languages. This phenomenon is called code-mixing, and these code-mixed texts are even written in non-native scripts Das and Gambäck (2014); Bali et al. (2014); Chakravarthi et al. (2020a). Code-mixing can be referred to as a blend of two or more languages in a single sentence or conversation, which is prevalent in the social media platforms such as Facebook and YouTube.

  2. Scarcity of Data: Although there is an enormous number of speakers for Tamil, Kannada, and Malayalam, these languages are extensively considered as under-resourced languages. One of the main reasons is the lack of user-generated content extracted from social media applications. One of the ways to tackle the problem is to annotate the data on several tasks.

We address (1) by using pretrained multi-lingual natural language models and (2) the MTL approach.
The rest of the paper is organised as follows. Section 2 shows previous work on SA, OLI , and MTL in NLP. Section 3 consists of a detailed description of the datasets for our purposes. Section 4

talks about the proposed models for single-task models and their loss functions. Popular MTL frameworks are described in section

4.2. We describe the experimental results and analysis in Section 6 and conclude our work in Section 7.

2 Related Work

In this section, we briefly review previous relevant work related to (i) SA, (ii) OLI, and finally, (iii) MTL in NLP.

2.1 Sentiment Analysis

SA is one of the leading research domains devoted to analyse people’s sentiments and opinions on any given entity. Due to its broader applications, there has been a plethora of research performed in several languages. However, the same is not true for the Dravidian languages. As stated earlier, there is a significant lack of data to conduct experiments on code-mixed data in the Dravidian languages. SA is one of the most prominent downstream tasks of NLP as it is essential to obtain people’s opinion, which has several business applications in the e-commerce market.

A data set was created as a part of a shared task on SA of Indian languages (SAIL), which consisted of around 1,663 code-mixed tweets extracted from Twitter in Tamil Patra et al. (2015), where SentiWordNet outperformed all of the other systems Phani et al. (2016); Das and Bandyopadhyay (2010). There has been a plethora of research performed on several downstream tasks in other languages, primarily due to the abundance of user-generated data in social media, which has developed an interest in people’s opinions and emotions with respect to a specific target. Existing research is relatively low due to the lack of data in code-mixing. SA is one of the downstream tasks that are performed on any natural language model. The largely available crowd-sourced data on social media applications such as Twitter and YouTube have resulted in developing several code-mixed datasets for the task.

To our knowledge, a very few Kannada-English code-mixed datasets exist on SA. A Kannada-English code-mixed data set for the emotion prediction was created Appidi et al. (2020)

. A probabilistic approach was employed to classify parts of speech (POS) tags

Sowmya Lakshmi and Shambhavi (2017). Several research pursuits have been worked upon SA in Tamil Se et al. (2016); Thilagavathi and Krishnakumari (2016)

. A recursive neural network approach was opted to improve the accuracy of texts in Tamil

Padmamala and Prema (2017). A dynamic mode decomposition (DMD) method with random mapping was developed for SA on the SAIL 2015 data set Kumar et al. (2020)

. A Lexicon-based approach was employed along with several feature representation approaches to analyse the sentiments on Tamil texts

Thavareesan and Mahesan (2019)

. When it comes to Malayalam, several supervised machine learning and rule-based approaches were used to analyse the sentiments of the Malayalam movie reviews

Nair et al. (2015, 2014); Soumya and Pramod (2020). A fuzzy logic-based hybrid approach was also used to analyse the movie reviews in Malayalam Anagha et al. (2015).

2.2 Offensive Language Identification

A surge in the popularity of social media platforms has resulted in the rise of trolling, aggression, hostile, and abusive language, which is a concerning issue pertaining to the positive/negative impacts a message can imply on an individual or groups of people Tontodimamma et al. (2021); Plaza-del-Arco et al. (2021). This issue has led several researchers to work on identifying offensive language/posts from social media to moderate content on social media platforms to promote positivity Chakravarthi (2020). Offensive language can be defined as any text entailing certain forms of unacceptable language, which may include insults, threats, or bad words Plaza-del-Arco et al. (2021). In comparison, hate speech seems indistinguishable to offensive language. The former aims to detect ‘abusive’ words, which are considered a type of degradation Nobata et al. (2016); Djuric et al. (2015).

There are several ways to detect offensive language. A supervised learning technique was used

Dadvar et al. (2013)

, which was based on three decisive factors: content, cyberbullying, and user-based features to tackle cyberbullying. A multi-level classification system was developed that extracts features at different conceptual levels and applies pattern recognition to detect flames (rants, taunts, and squalid phases) in sentences

Razavi et al. (2010). It can also be detected by ferreting out offensive and toxic spans in the texts. A toxic span detecting system was developed by leveraging token classification and span prediction techniques that are based on bidirectional encoder representations from transformers (BERT) Chhablani et al. (2021). Multi-lingual detection of offensive spans (MUDES) Ranasinghe and Zampieri (2021) was developed to detect offensive spans in texts. Several systems were developed to identify offensive language as a part of shared tasks conducted to stimulate research in this domain for Arabic, Danish, English, Greek, and Turkish Zampieri et al. (2019, 2020). Consequentially, several NLP researchers have worked on developing systems to detect hate speech on social media Kumar et al. (2020); Zampieri et al. (2019). However, most of the work done on OLI is language-specific, focusing on mono-lingual users in lieu of multi-lingual users, which entails code-mixed text on its social media users Bali et al. (2014).

For code-mixed sentences, certain researchers analysed the existing state-of-the-art (SoTA) hate speech detection on Hindi-English Rani et al. (2020)

, while other researchers compared the existing pretrained embeddings for convolutional neural networks (CNN)

Banerjee et al. (2020). When it comes to OLI, several systems were developed as a part of shared task for OLI in Dravidian languages Chakravarthi et al. (2021); Ghanghor et al. (2021); Yasaswini,Konthala et al. (2021) and Indo-European languages Mandl et al. (2020).

2.3 Multi-Task Learning

The primary objective of the MTL model is to improve the learning of a model for a given task by utilising the knowledge encompassed in other tasks, where all or a subset of tasks are related Zhang and Yang (2018). The essence of MTL is that solving many tasks together provides a shared inductive bias that leads to more robust and generalisable system Changpinyo et al. (2018). It has long been studied in the domain of machine learning. It has applications on neural networks in the NLP domain Caruana (1997). However, to the best of our knowledge, MTL models have not been developed for the Dravidian languages yet.

There is a considerable variety and unique model designs in the field of computer vision since the requirements in the domain of computer vision are vast. Tasks such as image classification, object localisation, object segmentation

Mou and Zhu (2018), and object tracking are widespread. Moreover, multi-tasking in videos for real-time applications Ouyang et al. (2019) play a vital role in day-to-day life. Multi-task models have also been used for medical images Zhai et al. (2020).

MTL models are usually designed with shared encoders that can be customised for the preferred tasks. A multi-scale approach was used to combine long short-term memory (LSTM) and CNN for better results as it adds up the benefits of both CNN (nearby sentiments) and LSTM (sentiments that are further away)

Jin et al. (2020). Attention-based mechanisms can also be used in the encoder, such as multi-layer bidirectional transformer encoder and knowledge encoder that injects knowledge into the language expressions Zhang et al. (2020a)

. Not only LSTM but also gated recurrent units (GRUs) can be used as a part of multi-task model based on specific tasks; although GRUs have much simpler architecture than the LSTMs, they have the capabilities to forecast futuristic values based on the previous values

Zhang et al. (2020b)

. Such model preferences can be opted to the domain of NLP in order to make the model lighter and faster. The most recent transformation includes using multi-tasks over semi-supervised learning

Li et al. (2020)

, by stacking recurrent neural networks and utilising a sliding window algorithm the sentiments are transferred on to the next item. An empirical study on MTL models for varieties of biomedical and clinical NLP tasks on BERT was proposed

Peng et al. (2020).

MTL models for the domain of NLP are less in number. The ideas that were applied to other domains such as computer vision, time series, and semi-supervised learning can be acquired, and models could be improved in NLP. As such models are rarely seen, this paper introduces MTL for SA and OLI incorporated from other domains of deep learning.

3 Dataset

We make use of the multi-lingual dataset, DravidianCodeMix222 Chakravarthi et al. (2021), consisting of over 60,000 manually annotated YouTube comments. The data set comprises sentences from 3 code-mixed Dravidian languages: Kannada Hande et al. (2020), Malayalam Chakravarthi et al. (2020a), and Tamil Chakravarthi et al. (2020b, 2021). Each comment is annotated for the tasks of SA and OLI. The code-mixed Kannada dataset consists of 7,273 comments, while the corresponding Malayalam and Tamil code-mixed datasets consist of 12,711 and 43,349 comments, respectively. After removing repetitive sentences, the class-wise distribution of the datasets are specified in Table 1 that are to be split into train, validation, and test sets.

Sentiment analysis Offensive language identification
Sl. No. Class Distribution Class Distribution
1 Positive 3,291 Not offensive 4,121
2 Negative 1,481 Offensive untargeted 274
3 Mixed feelings 678 Offensive targeted individual 624
4 Neutral 820 Offensive targeted group 411
5 Other language 1,003 Offensive targeted others 145
6 - - Other anguages 1,698
Total 7,273 Total 7,273
Sentiment analysis Offensive language identification
Sl. No. Class Distribution Class Distribution
1 Positive 24,501 Not offensive 31,366
2 Negative 5,190 Offensive untargeted 3,594
3 Mixed feelings 4,852 Offensive targeted individual 2,928
4 Neutral 6,748 Offensive targeted group 3,110
5 Other languages 2,058 Offensive targeted others 582
6 - - Other languages 1,769
Total 43,349 Total 43,349
Sentiment analysis Offensive language identification
Sl. No. Class Distribution Class Distribution
1 Positive 5,565 Not offensive 11,357
2 Negative 1,394 Offensive untargeted 171
3 Mixed feelings 794 Offensive targeted individual 179
4 Neutral 4,063 Offensive targeted group 113
5 Other languages 955 Other languages 951
Total 12,771 Total 12,771
Table 1: Class-wise distribution of the datasets for Kannada, Malayalam, and Tamil

The class labels in the dataset are as follows:
Sentiment analysis:

  • Positive state: Comment contains an explicit or implicit clue in the text suggesting that the speaker is in a positive state.

  • Negative state: Comment contains an explicit or implicit clue in the text suggesting that the speaker is in a negative state.

  • Mixed feelings: Comment contains an explicit or implicit clue in both positive and negative feeling.

  • Neutral state: Comment does not contain an explicit or implicit indicator of the speaker’s emotional state.

  • Not in intended language: For Kannada, if the sentence does not contain Kannada written in Kannada script or Latin script, then it is not Kannada.

Offensive language identification:

  • Not offensive: Comment does not contain offence or profanity.

  • Offensive untargeted : Comment contains offence or profanity without any target. These are comments that contain unacceptable language that does not target anyone.

  • Offensive targeted individual: Comment contains offence or profanity that targets the individual.

  • Offensive targeted group: Comment contains offence or profanity that targets the group.

  • Offensive targeted other: Comment contains offence or profanity that does not belong to any of the previous two categories (e.g., a situation, an issue, an organisation, or an event).

  • Not in indented language: Comment is not in the Kannada language.

The overall class types are similar in all languages. The code-mixed datasets of Kannada and Tamil consist of six classes in OLI, while Malayalam consists of five classes. There is an absence of offensive language others (OTO) class in the Malayalam dataset.

4 Methodology

We explore the suitability of several NLP models on the task of sequence classification. Several pretrained multi-lingual transformer models are investigated to find the better fit for the code-mixed datasets of Tamil, Kannada, and Malayalam. For our purpose, we define single-task learning (STL) models when we train the language models on both of the tasks separately. In this section, we discuss several pretrained transformer-based models that have been used for both STL and MTL. We have implemented the proposed STL and MTL models333

with the Pytorch library.

4.1 Transformer-Based Models

Recurrent models such as LSTMs and GRUs fail to achieve SoTA results due to longer sequence lengths, owing to memory limitations while batching. While factorisation and conditional computational approaches Shazeer et al. (2017); Kuchaiev and Ginsburg (2017) have improved the efficiency of the model, however, the underlying issue of computing it sequentially persists. To overcome this, a transformer was proposed Vaswani et al. (2017), an architecture that completely shuns recurrence and restores to attention mechanisms. It is found that adapting to an architecture with attention mechanisms proves to be much more efficient than recurrent architectures. The transformer block follows a stacked encoder-decoder architecture with multi-headed attention and feed forward layers. Self-attention is an attention mechanism relating distinct arrangements of a single sequence to compute a representation of a given sequence.

Scaled dot-product attention is mathematically computed using three vectors from each of the encoder’s input vectors:

Query, Key, and Value vectors. Key and Value assume dimensions and , respectively. A softmax function is applied on the dot product of queries and keys in order to compute the weights of the values. Practically, the attention function is computed simultaneously on a set of queries and then stacked into a matrix Q. In practice, the attention function is computed on a set of queries simultaneously, being packed into a matrix Q. The Keys and Values are packed into matrices K and V. The matrix of outputs is computed as follows:


The above dot-product attention is preferred over additive attention owing to its practical efficiency in both space and time complexities. Self-attention is computed several times in transformer’s architecture; thus, it is referred to as multi-head attention. This approach collectively attends information from different representations at different positions.

Sentiment analysis Kannada Malayalam Tamil Offensive language identification Kannada Malayalam Tamil
Positive 334 544 2,426 Not offensive 407 1,134 3,148
Negative 164 135 529 Offensive untargeted 27 28 336
Mixed feelings 63 81 465 Offensive targeted individual 82 16 316
Neutral 80 424 702 Offensive targeted group 44 10 305
Other language 87 94 213 Offensive targeted others 14 52
Other language 154 90 178
Total 728 1,278 4,335 728 1,278 4,335
Table 2: Class-wise distribution of the test sets of Kannada, Malayalam, and Tamil datasets
Figure 1: A transformer architecture by Vaswani et al. (2017)

Consider the phrase – Ask Powerful Questions. In order to calculate the self-attention of the first word Ask, the scores for all words in the phrase with respect to Ask is to be computed, which then determines the importance of other words when certain words are being encoded into the input sequence. The scores are divided by 8, which is the square root of the dimension of the key vector. The score of the first word is calculated using dot-product attention, with the dot product of the query vector with keys of all words in the input sentence. The scores are then normalised using the softmax activation. The normalised scores are then multiplied by vectors and summed up to obtain self-attention vector . It is then passed to feed FN as input. The vectors for the other words are calculated in a similar way in dot-product attention.

The normalised scores are then multiplied by vectors and summed up to obtain self-attention vector

. It is then passed to feed forward network (FFN) as input. The vectors for the other words are calculated in a similar way in dot-product attention. Softmax is an activation function that transforms the vector of numbers into a vector of probabilities. We use softmax function when we want a discrete variable representation of the probability distribution over


possible values. It is a generalisation of sigmoid activation function that is used to represent the probability distribution over a binary variable. Softmax activation function over

K classes is represented as follows:


Whereas, sigmoid is a binary representation of softmax activation function. It is given by:


Along with attention sub-layers in the transformer block, a fully connected FFN persists in each of the layers in its encoder and decoder. It consists of a rectified linear unit (ReLU) activation in-between two positions.


ReLU, a rectified linear activation function, is a piece-wise linear function that will output the input directly if positive, else it will output it as zero. ReLU overcomes the vanishing gradient problem, thus allowing models to learn faster and perform better. It is mathematically computed as follows:


While attention mechanism is a major improvement over recurrence-based seq2seq models, it does have its own limitations. As attention can only deal with fixed-length strings, the text has to be split into a number of chunks before feeding them into the inputs. The chunking inadvertently results in context fragmentation. This means that a sentence, if split in the middle, will result in significant loss of context. It would mean that the text is split without considering sentence or any other semantic boundary.

Due to the absence of recurrence or convolutional layers, some information about the tokens’ relative or absolute position must be fed for the succession of sequence order in the model. Thus, Positional Encodings is added to the input embeddings as shown in Fig. 1. Most of the SoTA language models assume the transformer block as its fundamental building block in each layer.

4.1.1 Bert

The computer vision researchers have repeatedly demonstrated the advantages of transfer learning, pretraining a neural network model on a known task

Deng et al. (2009).

BERT, a language representation model is designed to pretrain deep bidirectional representations from unlabelled text by jointly conditioning on both left and right context in all layers Devlin et al. (2019). It has been trained on 11 NLP tasks.

Figure 2: Illustration of BERT input representations. The Tamil sentence translates to “This is a Flower. It is blue in colour.” The input embeddings are the sum of token, segmentation, and positional embeddings Devlin et al. (2019)


BERT is pretrained on two unsupervised tasks:

  • Masked Language Modelling (MLM):

    Standard LMs Radford (2018) can only be trained left-to-right or right-to-left. However, training it bidirectionally might allow the word to spot itself accidentally. To pretrain deep representations, a certain percentage of the input tokens are masked arbitrarily, then predict those masked tokens, and are referred to as “Masked Language Modeling”, which was previously stated as Cloze Taylor (1953). Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the initial value of the masked words, supported by the context provided by the opposite non-masked words within the sequence.

  • Next-Sentence Prediction (NSP):

    One of the primary drawbacks of LM is its inability to capture the connection between two sentences directly. However, important downstream tasks such as question answering (QA) and natural language interference (NLI) are supported with the relationships between two sentences. To overcome this obstacle pretraining for a Binarized Next Sentence Prediction task is performed. Specifically, when choosing the sentences A and B for every pretraining example, 50% of the time B is that the actual next sentence that follows A (labelled as Is Next), and 50% of the time it is a random sentence from the corpus (labelled as Not Next). In Fig. 2, the input consists of two Tamil sentences. After tokenising the sentences, we observe that there are 13 input tokens, including the [CLS] and [SEP] tokens. Tokens , , …, represent the positional embeddings while and represent the tokens for whether a given sentence follows the other.

Figure 3: Illustration of fine-tuning BERT on single sentence classification tasks such as SA and OLI Devlin et al. (2019)


We use the sequence tagging specific inputs and outputs into BERT and fine-tune all parameters end to end. At the input, sentences A and B from pretraining are equivalent to a degenerate text pair in sequence tagging. The [CLS] representation is fed into an output layer for classification as observed in Fig. 3.

4.1.2 DistilBERT

DistilBERT was proposed Sanh et al. (2019) as a method to pretrain a smaller language representation model that serves the same purpose as other large-scale models like BERT. DistilBERT is 60% faster and has 40% fewer parameters than BERT. It also preserves 97% of BERT performance in several downstream tasks by leveraging the knowledge distillation approach. Knowledge distillation is a refining approach in which a smaller model—the student—is trained to recreate the performance of a larger model—the teacher. DistilBERT is trained on a triple loss with student initialisation.

Triple loss is a linear combination of the distillation loss with the supervised training loss, the MLM loss , and a cosine embedding loss () that coordinates the direction of the student and teacher hidden state vectors, where:


where t

(resp. si) is a probability estimated by the teacher (resp. the student)

Sanh et al. (2019).

We use a pretrained multi-lingual DistilBERT model from the huggingface transformer library distilbert-base-multilingual-cased for our purpose, which is distilled from the mBERT model checkpoint.

4.1.3 Albert

Present SoTA LMs consist of hundreds of millions if not billions of parameters. In order to scale the models, we would be restricted by the memory limitations of compute hardware such as GPUs or TPUs. It is also found that increasing the number of hidden layers in the BERT-large model (340M parameters) can lead to worse performance. Several attempts of parameter reduction techniques to reduce the size of models without affecting their performance. Thus, a lite BERT (ALBERT) for self-supervised learning of language representations was proposed. ALBERT Lan et al. (2019) overcomes the large memory consumption by incorporating several memory reduction techniques, factorised embedding parameterisation, cross-layer parameter sharing, and Sentence ordering Objectives.

  • Factorised embedding parameterisation

    It is learned that WordPiece embeddings are designed to learn context independent representations, but hidden-layer embeddings are designed to learn context dependent representations. BERT heavily relies on learning context dependent representations with the hidden layers. It is found that embedding matrix E must scale with hidden layers H, and thus, this results in models have billions of parameters. However, these parameters are rarely updated during training indicating that there are insufficient useful parameters. Thus, ALBERT has designed a parameter reduction method to reduce memory consumption by changing the result of the original embedding parameter P (the product of the vocabulary size V and the hidden layer size H).


    where E represents the size of the low-dimensional embedding space. In BERT, E = H. While in ALBERT, H E, so the number of parameters will be greatly reduced.

  • Cross-layer parameter sharing

    ALBERT aims to elevate parameter efficiency by sharing all parameters, across all layers. Hence, feed FN and attention parameters are shared across all layers as shown in Fig. 4.

    Figure 4: No shared parameters in BERT vs cross-layer parameter sharing in ALBERT
  • Sentence order prediction (SOP)

    ALBERT uses MLM, as similar to BERT, using up to 3 word masking (max(n_gram) = 3). ALBERT also uses SOP for computing inter-sentence coherence loss. Consider two sentences being used in the same document. The positive test case is that the sentences are in a correct order, while the negative test case states that the sentences are not in a proper order. SOP results in the model learning finer-grained distinctions about coherence properties, while additionally solving the NSP task to a rational degree. Thus, ALBERT models are more effective in improving downstream tasks’ performance for multi-sentence encoding tasks.

By incorporating these features and loss functions, ALBERT requires much fewer parameters in contrast to the base and large versions of BERT models proposed earlier in Devlin et al. (2019), without hurting its performance.

4.1.4 RoBERTa

Robustly optimised BERT (RoBERTa) Liu et al. (2019) is a BERT-based model. The difference is within the masking technique. BERT performs masking once during the information processing, which is essentially a static mask; the resulting model tends to see the same form of masks in numerous training phases. RoBERTa was designed to form a dynamic mask within itself, which generated masking pattern changes every time the input sequence is fed in, thus playing a critical role during pretraining. The encoding used was byte-pair encoding (BPE) Sennrich et al. (2016)

, which may be a hybrid encoding between character and word-level encoding that allows easy handling of huge text corpora, meaning it relies on subwords instead of full words. The model was made to predict the words using an auxiliary NSP loss. Even BERT was trained on this loss and was observed that without this loss, pretraining would hurt the performance, with significant degradation of results on QNLI, MNLI, and SQuAD.

4.1.5 Xlm

The XLM model was proposed in cross-lingual language model pretraining Dai et al. (2019). This model uses a shared vocabulary for different languages. For tokenising the text corpus, BPE was used. Causal language modelling (CLM) was designed to maximise the probability of a token to appear at the tth position in a given sequence. MLM is when we maximise the probability of a given masked token to appear at the t

th position in a given sequence. Both CLM and MLM perform well on mono-lingual data. Therefore, XLM model used a translation language modelling. The sequences are taken from the translation data and randomly masks tokens from the source as well as from the target sentence. Similar to other transformer-based mono-lingual models, XLM was fine-tuned on XLNI data set for obtaining the cross-lingual classification. Downstream tasks on which this was evaluated on were tasks such as cross-lingual classification, neural machine translation, and LMs for low-resource languages.

4.1.6 XLNet

BERT performed extremely well on almost every language modelling task. It was a revolutionary model as it could be fine-tuned for any downstream task. But even this came with a few flaws of its own. BERT was built in such a manner that it replaces random words in the sentences with a special [MASK] token and attempts to predict what the original word was. XLNet Yang et al. (2019) pointed out certain major issues during this process. The [MASK] token that is used in the training would not appear during fine-tuning the model and for other downstream tasks. However, this approach could further create problems failing to replace [MASK] tokens at the end of pretraining. Moreover, the model finds it difficult to train when there are no [MASK] tokens in the input sentence. BERT also generates the predictions independently, meaning it does not care about the dependencies of its predictions.

XLNet uses autoregressive (AR) language modeling that aims to estimate the probability distribution of a text corpus and without using the [MASK] token and parallel independent predictions. It is achieved through the AR modelling as it provides a reasonable way to express the product rule of factorising the joint probability of the predicted tokens. XLNet uses a particular type of language modelling called the ”permutation language modelling” in which the tokens are predicted for a particular sentence in random order rather than sequential order. The model is forced to learn bidirectional model dependencies between all combinations of the input. It is significant to note that it permutes only the factorisation order and not the sequence order, and it is rearranged and brought back to the original form using the positional embedding. For sequence classification, the model is fine-tuned for sentence classifier, and it does not predict the tokens but predicts the the sentiment according to the embedding. The architecture of the XLNet uses transformer XL as its baseline Dai et al. (2019). The transformer adds recurrence to the segment level instead of the word level. Hence, fine-tuning is carried out by caching the hidden states of the previous states and passing them as keys or values when processing the current sequence. The transformer also uses the notion of relative embedding instead of positional embedding by encoding the relative distance between the words.

4.1.7 XLM-RoBERTa

XLM-RoBERTa was proposed as an unsupervised cross-lingual representation approach Conneau et al. (2020)

, and it significantly outperformed multi-lingual BERT on a variety of cross-lingual benchmarks. XLM-R was trained on Wikipedia data of 100 languages and fine-tuned on different downstream tasks for evaluation and inference. The XNLI data set was used for machine translation from English to other languages and vice versa. It was checked on named entity recognition (NER) and cross-lingual QA. It achieved great results on the standart general language understanding evaluation (GLUE) benchmark and achieved SoTA results in several tasks.

4.1.8 CharacterBERT

The success of BERT has led many language representation models to adapt to the transformers architecture as their primary building block, consequently inheriting the wordpiece tokenisation. This could result in an intricate model that focuses on subword rather than word, for specialised domains. Hence, characterBERT, a new variant of BERT that completely drops the wordpiece system and uses a Character-CNN module instead of representing the entire words by consulting their characters El Boukkouri et al. (2020) as shown in Fig. 5.

The characterBERT is based on “base-uncased” version of BERT (L = 12, H = 768, A = 12, and total parameters = 109.5 M). The subsequent characterBERT architecture has 104.6 M parameters. Usage of character-CNN results in a smaller overall model, in spite of using a complex character module, m for BERT’s wordpiece matrix possesses 30K X 768-D vectors, while characterBERT utilises 16-D character embeddings with the majority of small-sized CNNs. Seven 1-D CNNs with the following filters are used: [1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], and [7, 1024]. The Kannada word chennagide written in Latin script, which can be translated as Great. Fig. 5 compares how characterBERT focuses more on the entire word rather than subwords as observed in BERT.

Figure 5: Comparison of the context-independent representation systems and tokenisation approach in BERT and characterBERT El Boukkouri et al. (2020)

4.2 Multi-Task Learning

We generally focus more on optimising a particular metric to score well on a certain benchmark. Thus, we train on a single model or an ensemble of models to perform our desired task. These are then fine-tuned until we no longer see any major increase in its performance. However, the downside to this is that we lose out on much of the information that could have helped our model train better. If there are multiple related tasks, we could devise an MTL model that can further elevate its performance in multiple tasks by learning from shared representations between related tasks. We aim to achieve better F1-scores by leveraging this approach. Since both SA and OLI are sequence classification tasks, we believe that an MTL model trained on these two related tasks will achieve better results, in comparison with their performances when trained separately on a model to predict a single task. Most of the methodologies of MTL in NLP are based on the success of MTL models in computer vision. We use two novel approaches for MTL in NLP, which are discussed in the subsequent sections.

4.2.1 Hard Parameter Sharing

This is one of the most common approaches to MTL. Hard parameter sharing of hidden layers Caruana (1997) is applied by sharing the hidden layers between the tasks while keeping the task-specific output layers separate. Hard parameter sharing is represented in Fig. 6. As both of the tasks are very similar, one of the tasks of the model is supposed to learn from the representations of the other task. The more the number of tasks that we are learning simultaneously, the more the representations the model will capture and reduce the chances of overfitting in our model Ruder (2017). As shown in Fig. 6, the model will have two outputs. We would be using different loss functions to optimise the performance of the models.

Figure 6: Hard parameter sharing for sentiment analysis and offensive language identification

A loss function takes a pair (output and target) of inputs and computes a value that estimates how far away the output is from the target. To back propagate the errors, we first clear the gradients to prevent the accumulation of gradients in each epoch. In this method, we propagate back the losses separately, and the optimiser updates the parameters for each of the tasks separately. After the loss function calculates the losses, we add the losses of the two tasks, and the optimiser then updates the parameters for both tasks simultaneously, where

and are losses for SA and OLI, respectively.


4.2.2 Soft Parameter Sharing

This is another approach of MTL where constrained layers are added to support resemblance among related parameters. Unlike hard parameter sharing, each task has its own model, and it learns for each task to regularise the distances between the different models’ parameters in order to encourage the parameter to be similar. Unlike hard parameter sharing, this approach gives more adaptability to the tasks by loosely coupling the representations of the shared space. We would be using Frobenius norm for our experiments as the additional loss term.

Figure 7: Soft parameter sharing for sentiment analysis and offensive language identification
  1. Additional Loss Term

    This approach of soft parameter sharing is to impose similarities between corresponding parameters by augmenting the loss function with an additional loss term, as we have to learn two tasks. Let denote the task j for the layer.


    where l is the original loss function for both of the tasks, and is the squared Frobenius norm. A similar approach was used for low-resource dependency parsing, by employing a cross-lingual parameter sharing whose learning objective involved regularisation using Frobenius norm Duong et al. (2015).

  2. Trace Norm

    We intend to penalise trace norm resulting from stacking and . A penalty encourages estimation of shrinking coefficient estimates, thus resulting in dimension reduction Yang and Hospedales (2017)

    . The trace norm can be defined as the sum of singular values:


4.3 Loss Functions

Loss function is a technique of evaluating the effectiveness of specific algorithm models on the given data. If the predictions are deviating from expected results, the purpose of a loss function is to output a high value. The output of the loss function decreases when the predictions start to match the expected results. With the help of an optimisation function, a loss function comprehends to reduce the error in the predictions. In this section, we will be discussing several loss functions that are going to be used in our experiments.

4.3.1 Cross-Entropy Loss

This is one of the most commonly used loss functions for classification tasks. It quantifies from the study of information theory and entropy and is calculated as the difference of the probability distributions for a given random variable. Entropy can be defined as the number of entities required for the transmission of a randomly selected event in a probability distribution. ”Cross-entropy is the average number of bits needed to encode data from a source coming from a distribution

p when we use a model qMurphy (2012). The divination of this definition can be conveyed if we consider a target as an underlying probability distribution P and an approximation of the target probability distribution as Q. Let the cross-entropy between two probability distributions, Q from P, be stated as H(P,Q). It is calculated as follows:


For the sake of multi-label classification, it is computed as follows:


where and are the ground truths for each class i in C.
Considering Class Imbalance

As specified earlier, we observe that the datasets have class imbalances and are not equally distributed as shown in Table 1

. Thus, we consider the weights of each class and pass the weights as a tensor to the parameter while computing the loss. Class Weights give inverse class weights to penalise the underrepresented class for class imbalance.


where C is the number of classes. The resultant tensor is

4.3.2 Multi-Class Hinge Loss

Hinge loss function was initially proposed as an alternative to cross-entropy for binary classification. It was primarily developed for use with SVM models. The targets values are in the set {–1, 1}. The main purpose of hinge loss is to maximise the decision boundary between two groups that are to be discriminated, that is, a binary classification problem. For linear classifiers w,x, hinge loss is computed as follows:


Squared hinge loss function was developed as an extension, which computes the square of the score hinge loss. It smoothens the surface of the error function, thus making it easier for computational purposes.

Categorical hinge loss or multi-class hinge loss is computed as follows, ”For a prediction y, take all values unequal to t, and compute the loss. Eventually, sum them all together to find the multi-class hinge loss” Weston and Watkins (1999); Zhang et al. (2014); Rakhlin (2016); Shalev-Shwartz,Shai and Ben-David (2014). The regular targets are computed into categorical data, and the loss function is calculated as follows:

Precision (sentiment analysis)
Positive 0.701 0.692 0.634 0.667 0.681 0.711 0.567 0.631
Negative 0.573 0.501 0.556 0.120 0.552 0.571 0.520 0.689
Mixed feelings 0.329 0.443 0.0 0.162 0.266 0.292 0.0 0.248
Neutral 0.371 0.551 0.350 0.264 0.568 0.498 0.467 0.602
Other language 0.564 0.192 0.496 0.012 0.498 0.508 0.073 0.503
Macro-average 0.509 0.471 0.427 0.245 0.461 0.441 0.325 0.452
Weighted average 0.592 0.564 0.536 0.362 0.519 0.587 0.418 0.542
Precision (Offensive language identification)
Not offensive 0.781 0.778 0.715 0.679 0.759 0.719 0.735 0.749
Offensive untargeted 0.0 0.123 0.0 0.0 0.0 0.0 0.0 0.00
Offensive targeted individual 0.452 0.461 0.654 0.769 0.798 0.641 0.085 0.721
Offensive targeted group 0.0 0.391 0.0 0.571 0.371 0.0 0.0 0.428
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.704 0.731 0.702 0.748 0.678 0.712 0.008 0.678
Macro-average 0.321 0.408 0.345 0.461 0.431 0.345 0.138 0.431
Weighted average 0.651 0.691 0.622 0.656 0.678 0.625 0.425 0.669
Recall (sentiment analysis)
Positive 0.742 0.753 0.793 0.667 0.663 0.649 0.759 0.867
Negative 0.581 0.442 0.524 0.467 0.676 0.778 0.604 0.532
Mixed feelings 0.100 0.229 0.0 0.074 0.108 0.061 0.0 0.051
Neutral 0.493 0.581 0.263 0.144 0.376 0.401 0.073 0.343
Other language 0.590 0.209 0.678 0.006 0.608 0.702 0.117 0.618
Macro-average 0.501 0.441 0.452 0.271 0.508 0.521 0.311 0.479
Weighted average 0.602 0.572 0.592 0.361 0.628 0.612 0.470 0.628
Recall (Offensive language identification)
Not offensive 0.842 0.813 0.875 0.917 0.858 0.708 0.746 0.864
Offensive untargeted 0.0 0.041 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted individual 0.533 0.629 0.415 0.484 0.599 0.610 0.025 0.653
Offensive targeted group 0.0 0.251 0.0 0.089 0.229 0.0 0.0 0.303
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.741 0.748 0.812 0.601 0.768 0.872 0.038 0.731
Macro-average 0.379 0.412 0.350 0.349 0.408 0.365 0.135 0.409
Weighted average 0.719 0.709 0.707 0.696 0.714 0.706 0.418 0.718
F1-score (sentiment analysis)
Positive 0.721 0.723 0.705 0.665 0.714 0.681 0.649 0.729
Negative 0.573 0.473 0.583 0.190 0.662 0.663 0.559 0.608
Mixed feelings 0.153 0.301 0.0 0.102 0.182 0.102 0.0 0.089
Neutral 0.424 0.558 0.300 0.187 .449 0.443 0.126 0.431
Other languages 0.581 0.201 0.573 0.008 0.548 0.591 0.090 0.609
Macro-average 0.493 0.453 0.432 0.230 0.498 0.491 0.285 0.489
Weighted average 0.591 0.561 0.556 0.349 0.595 0.589 0.418 0.594
F1-score (offensive language identification)
Not offensive 0.811 0.801 0.787 0.780 0.792 0.788 0.741 0.801
Offensive untargeted 0.0 0.061 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted individual 0.529 0.534 0.507 0.594 0.685 0.625 0.039 0.649
Offensive targeted group 0.0 0.299 0.0 0.154 0.285 0.0 0.0 0.348
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0. 0 0.0
Other languages 0.743 0.741 0.753 0.667 0.725 0.710 0.013 0.698
Macro-average 0.351 0.421 0.341 0.366 0.419 0.354 0.132 0.418
Weighted average 0.686 0.698 0.656 0.651 0.685 0.661 0.416 0.680
Table 3: Precision, recall, and F1-scores of STL approach for Kannada

4.3.3 Focal Loss

The focal loss was initially proposed for dense object detection task as an alternative to CE loss. It enables training highly accurate dense object detectors for highly imbalanced datasets Lin et al. (2017). For imbalanced text datasets, it handles class imbalance for by calculating the ratio of each class and thus assigning weights based on whether it is hard or soft examples (inliers) Tula et al. (2021); Ma et al. (2020).

Precision (sentiment analysis)
Positive 0.704 0.657 0.446 0.615 0.713 0.695 0.750 0.673
Negative 0.343 0.0 0.0 0.512 0.536 0.521 0.494 0.623
Mixed feelings 0.0 0.0 0.0 0.175 0.521 0.571 0.562 0.447
Neutral 0.664 0.600 0.333 0.712 0.765 0.591 0.751 0.679
Other languages 0.750 0.750 0.622 0.744 0.735 0.814 0.706 0.787
Macro-average 0.492 0.402 0.280 0.552 0.654 0.639 0.653 0.642
weighted aAverage 0.612 0.533 0.346 0.618 0.701 0.643 0.708 0.664
Precision (Offensive language identification)
Not offensive 0.939 0.947 0.933 0.946 0.929 0.951 0.716 0.934
Offensive untargeted 0.0 0.529 0.0 0.400 0.0 0.0 0.0 1.000
Offensive targeted individual 0.0 0.222 0.0 0.0 0.0 0.027 0.0 0.0
Offensive targeted group 0.0 0.0 0.0 0.0 0.0 0.028 0.0 0.0
Other languages 0.717 0.716 0.744 0.807 0.756 0.045 0.829 0.724
Macro-average 0.331 0.483 0.335 0.431 0.337 0.210 0.353 0.532
weighted average 0.884 0.905 0.880 0.904 0.878 0.847 0.885 0.901
Recall (sentiment analysis)
Positive 0.801 0.826 0.963 0.829 0.827 0.697 0.776 0.814
Negative 0.274 0.0 0.0 0.304 0.437 0.363 0.630 0.244
Mixed feelings 0.0 0.0 0.0 0.136 0.309 0.147 0.233 0.159
Neutral 0.719 0.717 0.009 0.531 0.505 0.559 0.596 0.712
Other languages 0.742 0.717 0.596 0.681 0.566 0.411 0.566 0.528
Macro-average 0.506 0.452 0.314 0.496 0.509 0.495 0.503 0.502
weighted average 0.663 0.642 0.457 0.620 0.608 0.640 0.609 0.621
Recall (Offensive language identification)
Not offensive 0.983 0.966 0.981 0.976 0.983 0.648 0.935 0.979
Offensive untargeted 0.0 0.321 0.0 0.286 0.0 0.0 0.0 0.071
Offensive targeted individual 0.0 0.125 0.0 0.0 0.0 0.125 0.0 0.0
Offensive targeted group 0.0 0.0 0.0 0.0 0.0 0.100 0.0 0.0
Other languages 0.789 0.756 0.711 0.789 0.656 0.156 0.815 0.700
Macro-average 0.353 0.434 0.338 0.410 0.328 0.206 0.359 0.350
weighted average 0.922 0.919 0.920 0.928 0.919 0.588 0.886 0.919
F1-score (sentiment analysis)
Positive 0.750 0.732 0.609 0.706 0.666 0.666 0.662 0.637
Negative 0.305 0.0 0.0 0.381 0.482 0.428 0.554 0.351
Mixed feelings 0.0 0.0 0.0 0.153 0.388 0.145 0.219 0.128
Neutral 0.691 0.654 0.018 0.608 0.634 0.665 0.622 0.655
Other languages 0.742 0.733 0.609 0.711 0.591 0.627 0.605 0.598
Macro-average 0.497 0.424 0.247 0.512 0.624 0.481 0.488 0.492
weighted average 0.635 0.581 0.310 0.605 0.623 0.630 0.603 0.624
F1-score (offensive language identification)
Not offensive 0.958 0.956 0.956 0.961 0.955 0.771 0.962 0.956
Offensive untargeted 0.0 0.400 0.0 0.333 0.0 0.0 0.0 0.133
Offensive targeted individual 0.0 0.160 0.0 0.0 0.0 0.044 0.0 0.0
Offensive targeted group 0.0 0.0 0.0 0.0 0.0 0.043 0.0 0.0
Other languages 0.751 0.685 0.727 0.798 0.702 0.070 0.836 0.712
Macro-average 0.342 0.440 0.337 0.418 0.332 0.186 0.356 0.360
weighted average 0.902 0.901 0.900 0.900 0.897 0.690 0.898 0.901
Table 4: Precision, recall, and F1-scores of STL approach for Malayalam

In CE, easily classified examples incur a loss with non-trivial magnitude. We define focal loss by adding a modulating factor loss to the cross-entropy loss with a tunable focusing parameter Lin et al. (2017) as:


where is the estimated probability of the model for a given class. When , the above formula represents CE. If , the modulating factor assists in reducing the loss for easily classified examples () resulting in more number of corrections of misclassified examples.

4.3.4 Kullback–Leibler Divergence Loss

The Kullback–Leibler divergence (KLD) measures the difference in the probability distribution of two random variables

Kullback and Leibler (1951). Theoretically, a KLD score of 0 indicates that the two distributions are identical.

Precision (sentiment analysis)
Positive 0.713 0.740 0.664 0.702 0.689 0.686 0.714 0.676
Negative 0.413 0.417 0.393 0.392 0.462 0.444 0.399 0.505
Mixed feelings 0.411 0.329 0.436 0.345 0.440 0.419 0.397 0.307
Neutral 0.444 0.472 0.464 0.465 0.534 0.423 0.526 0.540
Other language 0.713 0.621 0.667 0.630 0.675 0.660 0.517 0.543
Macro-average 0.520 0.516 0.525 0.507 0.560 0.527 0.511 0.514
Weighted average 0.596 0.607 0.574 0.584 0.609 0.584 0.602 0.587
Precision (Offensive language identification)
Not offensive 0.797 0.860 0.796 0.826 0.815 0.837 0.853 0.839
Offensive untargeted 0.294 0.385 0.284 0.438 0.428 0.370 0.409 0.368
Offensive targeted individual 0.0 0.371 0.0 0.384 0.382 0.377 0.409 0.349
Offensive targeted group 0.0 0.318 0.0 0.297 0.423 0.329 0.454 0.323
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other language 0.840 0.829 0.727 0.857 0.780 0.660 0.806 0.780
Macro-average 0.322 0.461 0.301 0.467 0.472 0.429 0.458 0.451
Weighted average 0.636 0.738 0.630 0.718 0.715 0.714 0.725 0.719
Recall (sentiment analysis)
Positive 0.854 0.805 0.882 0.827 0.888 0.863 0.831 0.891
Negative 0.391 0.401 0.265 0.427 0.329 0.323 0.440 0.274
Mixed feelings 0.129 0.228 0.131 0.123 0.174 0.095 0.174 0.144
Neutral 0.373 0.442 0.322 0.365 0.366 0.386 0.363 0.298
Other language 0.554 0.601 0.404 0.545 0.526 0.474 0.582 0.620
Macro-average 0.460 0.495 0.401 0.457 0.457 0.428 0.478 0.445
Weighted average 0.627 0.625 0.612 0.614 0.641 0.618 0.625 0.626
Recall (Offensive language identification)
Not offensive 0.964 0.887 0.947 0.935 0.946 0.910 0.915 0.919
Offensive untargeted 0.345 0.408 0.384 0.339 0.432 0.426 0.479 0.414
Offensive targeted individual 0.0 0.288 0.0 0.266 0.082 0.275 0.326 0.288
Offensive targeted group 0.0 0.351 0.0 0.134 0.154 0.092 0.210 0.102
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.618 0.708 0.539 0.742 0.719 0.775 0.715 0.685
Macro-average 0.321 0.440 0.312 0.403 0.389 0.413 0.439 0.401
Weighted average 0.752 0.750 0.740 0.765 0.767 0.752 0.741 0.755
F1-score (sentiment analysis)
Positive 0.777 0.771 0.758 0.759 0.776 0.765 0.768 0.769
Negative 0.402 0.409 0.316 0.409 0.384 0.374 0.419 0.355
Mixed feelings 0.196 0.269 0.202 0.181 0.250 0.154 0.242 0.196
Neutral 0.406 0.456 0.380 0.409 0.434 0.404 0.430 0.384
Other languages 0.586 0.611 0.503 0.584 0.591 0.552 0.552 0.579
Macro-average 0.473 0.503 0.432 0.469 0.487 0.450 0.489 0.457
Weighted average 0.600 0.614 0.571 0.589 0.607 0.583 0.609 0.585
F1-score (offensive language identification)
Not offensive 0.873 0.873 0.865 0.877 0.876 0.872 0.882 0.877
Offensive untargeted 0.317 0.396 0.326 0.383 0.430 0.396 0.441 0.387
Offensive targeted individual 0.0 0.324 0.0 0.314 0.135 0.318 0.362 0.315
Offensive targeted group 0.0 0.334 0.0 0.185 0.226 0.144 0.287 0.155
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.712 0.764 0.619 0.795 0.749 0.713 0.737 0.751
Macro-average 0.317 0.448 0.302 0.426 0.403 0.407 0.430 0.414
Weighted average 0.688 0.743 0.679 0.735 0.726 0.727 0.734 0.731
Table 5: Precision, recall, and F1-scores of STL approach for Tamil
Hard parameter sharing Soft parameter sharing
Losses Cross-entropy Hinge loss Focal loss KLD CE HL FL KLD
Precision (sentiment analysis)
Positive 0.683 0.673 0.666 0.459 0.713 0.534 0.679 0.459
Negative 0.677 0.685 0.651 0.0 0.596 0.733 0.631 0.0
Mixed feelings 0.0 0.0 0.0 0.0 0.143 0.0 0.0 0.0
Neutral 0.5 0.700 0.479 0.0 0.491 0.577 0.492 0.0
Other languages 0.442 0.477 0.459 0.0 0.538 0.703 0.454 0.0
Macro-average 0.460 0.507 0.451 0.092 0.496 0.509 0.451 0.092
Weighted average 0.574 0.597 0.560 0.210 0.592 0.558 0.562 0.210
Precision (Offensive language identification)
Not offensive 0.802 0.755 0.764 0.559 0.727 0.707 0.796 0.559
Offensive untargeted 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted individual 0.640 0.625 0.588 0.0 0.641 0.838 0.718 0.0
Offensive targeted group 0.171 0.0 0.0 0.0 0.250 0.0 0.132 0.0
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.703 0.700 0.709 0.0 0.730 0.716 0.682 0.0
Macro-average 0.680 0.348 0.344 0.093 0.391 0.377 0.388 0.093
Weighted average 0.680 0.642 0.644 0.313 0.648 0.641 0.678 0.313
Recall (sentiment analysis)
Positive 0.787 0.790 0.781 1.0 0.746 0.967 0.772 1.0
Negative 0.689 0.689 0.659 0.0 0.774 0.268 0.689 0.0
Mixed feelings 0.0 0.0 0.0 0.0 0.016 0.0 0.0 0.0
Unknown state 0.237 0.287 0.322 0.0 0.325 0.188 0.375 0.0
Other languages 0.701 0.828 0.644 0.0 0.655 0.299 0.563 0.0
Macro-average 0.483 0.496 0.474 0.200 0.503 0.344 0.480 0.200
Weighted average 0.626 0.636 0.615 0.459 0.632 0.560 0.618 0.459
Recall (Offensive language identification)
Not offensive 0.828 0.857 0.853 1.0 0.882 0.870 0.794 1.0
Offensive untargeted 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted individual 0.585 0.671 0.610 0.0 0.610 0.378 0.622 0.0
Offensive targeted group 0.159 0.0 0.0 0.0 0.023 0.0 0.159 0.0
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.877 0.818 0.870 0.0 0.721 0.883 0.877 0.0
Macro-average 0.408 0.391 0.389 0.167 0.373 0.355 0.409 0.167
Weighted average 0.724 0.728 0.729 0.559 0.716 0.716 0.709 0.559
F1-score (sentiment analysis)
Positive 0.732 0.727 0.719 0.629 0.729 0.688 0.723 0.629
Negative 0.683 0.687 0.655 0.0 0.674 0.393 0.659 0.0
Mixed feelings 0.0 0.0 0.0 0.0 0.029 0.0 0.0 0.0
Neutral 0.322 0.280 0.359 0.0 0.391 0.283 0.426 0.0
Other languages 0.542 0.605 0.536 0.0 0.591 0.419 0.603 0.0
Macro-average 0.456 0.460 0.454 0.126 0.483 0.357 0.462 0.126
Weighted average 0.590 0.591 0.581 0.289 0.602 0.485 0.587 0.289
F1-score (offensive language identification)
Not offensive 0.815 0.803 0.806 0.717 0.797 0.780 0.795 0.717
Offensive untargeted 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted individual 0.611 0.647 0.610 0.0 0.625 0.521 0.667 0.0
Offensive targeted group 0.165 0.0 0.0 0.0 0.042 0.0 0.144 0.0
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.780 0.759 0.781 0.0 0.725 0.791 0.767 0.0
Macro-average 0.395 0.368 0.364 0.120 0.365 0.349 0.395 0.120
Weighted average 0.700 0.683 0.683 0.401 0.672 0.662 0.690 0.401
Table 6: MTL results on the Kannada data set

It is defined as ”the average number of extra bits needed to encode the data, due to the fact that we used the distribution q to encode the data instead of the true distribution p” Murphy (2012)

and also known as relative entropy. It is generally used for models that have complex functions such as variational autoencoders (VAEs) for text generation rather than simple multi-class classification tasks

Prokhorov et al. (2019); Asperti and Trentin (2020). Consider two random distributions, P and Q. The main intuition behind KLD is that if the probability for an event from Q is small, but that of P is large; then there exists a large divergence. It is used to compute the divergence between discrete and continuous probability distributions Brownlee . It is computed as follows:


5 Experiments

5.1 Experiments Setup

We use the pretrained models available on huggingface transformers444 Wolf et al. (2020), while fine-tuning the model on a python-based deep learning framework, Pytorch555 Paszke et al. (2019) and Scikit-Learn666 Pedregosa et al. (2011) to evaluate the performance of the models. For MTL experiments, we take the naive weighted sum by assigning equal weights to the losses Eigen and Fergus (2015)

. We have experimented with various hyperparameters that are listed as shown in Table 

7. The experiments were conducted by dividing the data set into three parts: 80% for training, 10% for validation, and 10% for testing. The class-wise support of the test is shown in Fig. 2 The experiments of all models were conducted on Google Colab Bisong (2019).

  • Epoch: As the pretrained LMs consisted of hundreds of millions of parameters, we limited training up to 5 epochs, owing to memory limitations Pan and Yang (2009).

  • Batch size: Different batch sizes were used among [16, 32, 64].

  • Optimiser: we used Adam optimiser with weight decay. Loshchilov and Hutter (2019).

  • Loss weights: We give equal importance to two tasks and add losses as shown in Equation 8.

Hyper-parameters Characteristics
Shared layers STL: Pooler output from pretrained LM, 1 Linear layer (hidden_size,128) with a ReLU
activation function, output layer with (128, n_classes) for the tasks.
MTL: Same characteristics as STL
Task-specific layers MTL: 1 linear layer of shape (128, n_classes) depending on the number of labels
in the datasets.
Output layers

STL: 1 Linear layer each for sentiment task ([5] neurons each) and Offensive

task [6], [5], [6] neurons for Kannada, Malayalam, and Tamil, respectively.
MTL: 2 Linear layers for both tasks ([5,6], [5,5], [5,6] output neurons for
Kannada, Malayalam, and Tamil, respectively)
Loss [CE, KLD, HL, FL]
Epoch 5
Batch size [16, 32, 64]
Optimiser AdamW Loshchilov and Hutter (2019)
Dropout 0.4 Srivastava et al. (2014)
Loss weights [1, 1] (All tasks are treated equally.)
Table 7: Various hyper-parameters used for our experiments

5.2 Transfer Learning Fine-Tuning

We have exhaustively implemented several pretrained language models by fine-tuning them for text classification. All of the models we use are pretrained on large corpora consisting of unlabelled text. As we are dealing with code-mixed text, it would be interesting to see the performance of the models in Kannada, Malayalam, and Tamil; as all models are pretrained on either mono-lingual or multi-lingual corpora, it is imperative that the fine-tuned models could have a difficulty to classify code-mixed sentences. For the optimiser, we leverage weight decay in Adam optimiser (AdamW), by decoupling weight decay from the gradient update Loshchilov and Hutter (2019); Kingma and Ba (2014). The primary step is to use the pretrained tokeniser to first cleave the word into tokens. Then, we add the special tokens needed for sentence classification ([CLS] at the first position, and [SEP] at the end of the sentence as shown in Fig. 2). In the figure, tokens T1, T2, …, Tn represent the cleaved tokens obtained after tokenising. After special tokens are added, the tokeniser replaces each token with its id from the embedding table which is a component we obtain from the pretrained model.

BERT Devlin et al. (2019) was originally pretrained on English texts and was later extended for mbert Pires et al. (2019), which is a language model pretrained on the Wikipedia dumps of the top 104 languages. mBERT consists of 512 input tokens, output being represented as a 768 dimensional vector and 12 attention heads. It is worth noting that we have used both multi-lingual and mono-lingual models (pretrained in English) to analyse the improvements, as we are dealing with code-mixed texts. There are two models of mBERT that are available, for our task, we use the BERT-base, Multi-lingual Cased checkpoint777 In MTL, we use distilbert-base-multilingual-cased for Tamil while bert-base-multilingual-cased on Malayalam and Kannada datasets, based on their performances on STL on the DravidianCodeMix data set.

6 Results

This section entails a comprehensive analysis of the capabilities of several pretrained LMs. We have employed the popular metrics in NLP tasks, including precision (P), recall (R), F1-score (F), weighted average, and macro-average. F1-Score is the harmonic average of recall and precision, taking values between 0 and 1. The metrics are computed as follows:


where c is the number of classes:

  • TP – True positive examples are predicted to be positive and are positive;

  • TN – True negative examples are predicted to be negative and are negative;

  • FP – False positive examples are predicted to be positive but are negative; and

  • FN – False negative examples are predicted to be negative but are positive.

The weighted average and macro-average are computed as follows:


We have evaluated the performance of the models on various metrics such as precision, recall, and F1-score. Accuracy gives more importance to the TP and TN, while disregarding FN and FP. F1-score is the harmonic average of precision and recall. Therefore, this score considers both FP and FN. Due to the persistence of class imbalance in our datasets, we use weighted F1-score as the evaluation metric. The benefits of MTL is multi-fold as these two tasks are related to each other. MTL offers several advantages such as improved data efficiency, reduces overfitting while faster learning by leveraging auxillary representations

Ruder (2017). As a result, MTL allows us to have a single shared model in lieu of training independent models per task Dobrescu et al. (2020). Hard parameter sharing involves the practice of sharing model weights between multiple tasks, as they are trained jointly to minimise multiple losses. However, soft parameter sharing involves all individual task-specific models to have different weights which would add the distance between the different task-specific models that would have to be optimised Crawshaw (2020). We intend to experiment with MTL to achieve a slight improvement in the performance of the model along with reduced time and space constraints in hard parameter sharing.

Hard Parameter Sharing Soft Parameter Sharing
Losses Cross-entropy Hinge loss Focal loss KLD CE HL FL KLD
Precision (sentiment analysis)
Positive 0.726 0.448 0.453 0.426 0.450 0.620 0.660 0.426
Negative 0.453 0.0 0.0 0.0 0.0 0.0 0.521 0.0
Mixed feelings 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Neutral 0.709 0.0 0.0 0.0 0.0 0.595 0.743 0.0
Other languages 0.745 0.653 0.613 0.0 0.616 0.724 0.758 0.0
Macro-average 0.527 0.220 0.213 0.085 0.213 0.388 0.536 0.085
Weighted average 0.647 0.239 0.238 0.181 0.237 0.515 0.638 0.181
Precision (Offensive language identification)
Not offensive 0.939 0.930 0.930 0.887 0.933 0.933 0.939 0.887
Offensive untargeted 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted individual 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted group 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.750 0.635 0.585 0.0 0.714 0.780 0.789 0.0
Macro-average 0.338 0.313 0.303 0.177 0.330 0.343 0.345 0.177
Weighted average 0.886 0.870 0.866 0.787 0.879 0.883 0.888 0.787
Recall (sentiment analysis)
Positive 0.800 0.972 0.965 1.000 0.965 0.801 0.871 1.000
Negative 0.496 0.0 0.0 0.0 0.0 0.0 0.370 0.0
Mixed feelings 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Neutral 0.724 0.0 0.0 0.0 0.0 0.670 0.653 0.0
Other language 0.777 0.681 0.777 0.0 0.734 0.755 0.734 0.0
Macro-average 0.559 0.331 0.348 0.200 0.340 0.445 0.526 0.200
Weighted average 0.690 0.464 0.468 0.426 0.465 0.619 0.681 0.426
Recall (Offensive language identification)
Not offensive 0.979 0.969 0.961 1.000 0.977 0.984 0.983 1.000
Offensive untargeted 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted individual 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted group 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.800 0.678 0.689 0.0 0.722 0.711 0.789 0.0
Macro-average 0.356 0.329 0.330 0.200 0.340 0.339 0.354 0.200
Weighted average 0.925 0.908 0.901 0.887 0.918 0.923 0.928 0.887
F1-score (sentiment analysis)
Positive 0.761 0.614 0.617 0.597 0.614 0.699 0.751 0.597
Negative 0.473 0.0 0.0 0.0 0.0 0.0 0.433 0.0
Mixed feelings 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Neutral 0.716 0.0 0.0 0.0 0.0 0.630 0.695 0.0
Other language 0.760 0.667 0.685 0.0 0.670 0.740 0.746 0.0
Macro-average 0.542 0.256 0.260 0.119 0.257 0.414 0.525 0.119
Weighted average 0.668 0.310 0.313 0.254 0.311 0.561 0.651 0.254
F1-score (offensive language identification)
Not offensive 0.959 0.949 0.945 0.940 0.955 0.958 0.960 0.940
Offensive untargeted 0.0 0.0 0.0 0.0 0.0 0.0 0.00.350 0.0
Offensive targeted individual 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Offensive targeted group 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.774 0.656 0.633 0.0 0.718 0.744 0.789 0.0
Macro-average 0.347 0.321 0.316 0.188 0.335 0.340 0.350 0.188
Weighted average 0.905 0.888 0.883 0.834 0.898 0.902 0.908 0.834
Table 8: MTL results on the Malayalam data set
CE: Cross-entropy loss, HL: Multi-class hinge loss, FL: Focal loss, KLD: Kullback–Leibler divergence
Hard Parameter Sharing Soft Parameter Sharing
Losses Cross-entropy Hinge loss Focal loss KLD CE HL FL KLD
Precision (sentiment analysis)
Positive 0.721 0.703 0.729 0.560 0.710 0.686 0.717 0.676
Negative 0.446 0.467 0.449 0.500 0.430 0.444 0.428 0.505
Mixed feelings 0.443 0.396 0.389 0.428 0.370 0.419 0.292 0.307
Neutral 0.501 0.523 0.464 0.0 0.540 0.599 0.511 0.540
Other language 0.669 0.695 0.667 0.0 0.663 0.721 0.570 0.543
Macro-average 0.556 0.557 0.537 0.112 0.543 0.583 0.504 0.514
Weighted average 0.619 0.612 0.613 0.313 0.610 0.614 0.596 0.587
Precision (Offensive language identification)
Not offensive 0.844 0.805 0.843 0.726 0.862 0.821 0.870 0.839
Offensive untargeted 0.408 0.4310 0.370 0.0 0.392 0.438 0.395 0.368
Offensive targeted individual 0.341 0.462 0.335 0.0 0.371 0.418 0.318 0.349
Offensive targeted group 0.353 0.373 0.317 0.0 0.410 0.420 0.327 0.323
Offensive targeted others 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Other languages 0.823 0.834 0.828 0.0 0.761 0.856 0.850 0.830
Macro-average 0.461 0.485 0.449 0.121 0.466 0.492 0.460 0.451
Weighted average 0.728 0.713 0.721 0.527 0.744 0.726 0.744 0.719
Recall (sentiment analysis)
Positive 0.844 0.867 0.830 1.0 0.857 0.928 0.831 1.0
Negative 0.461 0.433 0.465 0 0.429 0.278 0.342 0
Mixed feelings 0.168 0.155 0.155 0 0.168 0.159 0.241 0
Unknown state 0.420 0.372 0.433 0 0.366 0.289 0.356 0
Other languages 0.568 0.568 0.587 0 0.601 0.559 0.615 0
Macro-average 0.492 0.479 0.494 0.200 0.484 0.443 0.477 0.200
Weighted average 0.643 0.643 0.637 0.560 0.639 0.645 0.620 0.560
Recall (Offensive language identification)
Not offensive 0.927 0.959 0.919 1.000 0.909 0.945 0.881 1.000
Offensive untargeted 0.443 0.357 0.414 0 0.485 0.375 0.435 0
Offensive targeted individual 0.196 0.114 0.212 0 0.269 0.177 0.323 0
Offensive targeted group 0.200 0.102 0.174 0 0.252 0.190 0.325 0
Offensive targeted others 0.0 0.0 0.0 0 0.0 0.945 0.0 0.0
Other language 0.730 0.708 0.730 0 0.787 0.736 0.730 0
Macro-average 0.416 0.373 0.408 0.167 0.450 0.404 0.449 0.167
Weighted average 0.766 0.769 0.757 0.726 0.767 0.772 0.750 0.726
F1-score (sentiment analysis)
Positive 0.778 0.777 0.777 0.718 0.777 0.778 0.770 0.718
Negative 0.454 0.449 0.457 0.0 0.430 0.357 0.380 0
Mixed feelings 0.243 0.223 0.222 0.0 0.231 0.232 0.264 0
Neutral 0.457 0.435 0.450 0.0 0.436 0.390 0.420 0
Other language 0.614 0.625 0.616 0.0 0.631 0.630 0.591 0
Macro-average 0.509 0.502 0.504 0.144 0.501 0.477 0.485 0.144
Weighted average 0.621 0.614 0.625 0.402 0.614 0.598  0.603 0.402
F1-score (offensive language identification)
Not offensive 0.883 0.875 0.879 0.841 0.885 0.879 0.876 0.841
Offensive untargeted 0.425 0.393 0.390 0 0.434 0.404 0.414 0
Offensive targeted individual 0.249 0.183 0.260 0 0.312 0.249 0.320 0
Offensive targeted group 0.255 0.160 0.225 0 0.312 0.262 0.326 0
Offensive targeted others 0.0 0.0 0.000 0 0.0 0.0 0 0
Other language 0.774 0.766 0.776 0 0.773 0.792 0.785 0
Macro-average 0.431 0.396 0.422 0.140 0.453 0.431 0.453 0.140
Weighted average 0.742 0.729 0.745 0.611 0.753 0.739 0.746 0.611
Table 9: MTL on the Tamil data set

Apart from the pretrained LMs shown in Table 3, 4, and 5, we have tried other domain specific LMs. IndicBERT Kakwani et al. (2020) is a fastText-based word embeddings and ALBERT-based LM trained on 11 Indian languages along with English and is a multi-lingual model similar to mBERT Pires et al. (2019). IndicBERT was pretrained on IndicCorp Kakwani et al. (2020), the sentences of which were trained using a sentence piece tokeniser Kudo and Richardson (2018). However, when IndicBERT was fine-tuned for sentiment SA and OLI, it was found that the model performed very poorly, despite being pretrained on 12 languages, inclusive of Kannada, Malayalam, Tamil, and English. We believe that one of the main reasons for its poor performance, despite being pretrained on a large corpus, has to do with the architecture of the model, which was also encountered by Puranik et al. (2021). Even though IndicBERT was pretrained in a multi-lingual setting, it followed the architecture of ALBERT using the standard MLM objective. We believe that this is due to the cross-layer parameter sharing that hinders its performance in a multi-lingual setting (when its pretrained on more than one language). Consequentially, ALBERT tends to focus on the overall spatial complexity and training time Lan et al. (2019).

We have also experimented with multi-lingual Representations for Indian languages (MuRIL), a pretrained LM that was pretrained on Indian languages as similar to IndicBERT but differs in terms of pretraining strategy and the corpus used Khanuja et al. (2021). It follows the architecture of BERT-base encoder model; however, it is trained on two objectives, MLM, and transliterated language modeling (TLM). TLM leverages parallel data unlike the former training objective. However, in spite of having a TLM objective, the model performed worse than BERT base. which was the base encoder model for MuRIL. Hence, we have not tabulated the classification report of both MuRIL and IndicBERT. We will be analysing the performance of models on different datasets separately.

6.1 Kannada

Table 3 and Table 6 present the classification report of the models on Kannada data set. We opted to train the STL with CE loss, which is a popular choice among loss functions for classification instances De Boer et al. (2005). We observe that mBERT (multi-lingual-bert) achieves the highest weighted F1-score among other models as observed in Table 3.

Figure 8: Train Accuracy during MTL

mBERT achieved 0.591 for SA and 0.686 for OLI. In spite of achieving the highest weighted average of F1-scores of the classes, we can observe that the some classes have performed very poorly with 0.0 as their F1-score. In OLI, we can see that most of the models perform poorly on 3 classes: Offensive Untargeted, Offensive Targeted Group, and Offensive Targeted Others. One of the main reasons could be due to the less support of these classes in the test set. Offensive Untargeted has a support of 27, while offensive Targeted Group has 44, and Offensive Targeted others has a mere 14 out of the total test set having a support of 728. As the ratio of the majority class to the minority class is very severe (407:14), which is essentially the reason why the weighted F1-scores of Not Offensive and Offensive Targeted Individual are much higher in comparison to the low-resourced classes. Among mono-lingual LMs, characterBERT performs at par with mBERT. It is interesting to note that we use general-character-bert888, a LM that was pretrained only on English Wikipedia and OpenWebText Liu and Curran (2006). We believe the approach employed in characterBERT, of attending to the characters to attain a single word embedding as opposed to the process of tokenisation in BERT, as illustrated in Fig. 5. This is one of the reasons why characterBERT outperforms other mono-lingual models such as RoBERTa, ALBERT, and XLNet. Despite the superiority of other models that are pretrained on more data and better strategies.

(a) Pos: Positive, Neut: Neutral, OL: Other languages, MF: Mixed feelings, Neg: Negative
(b) NO: Not offensive, OL: Other language, OU: Offensive untargeted, OTG: Offensive targeted group, OTI: Offensive targeted individual, OTO: Offensive targeted others
Figure 9:

Heatmap of confusion matrix for the best performing model on the Kannada data set

To our surprise, we observe that XLM-RoBERTa, a multi-lingual LM pretrained on the top 100 languages, performed the worst among the models. One of the main reasons for the low performance of XLM-RoBERTa (–20%) could be on the account of low support of the test set. However, due to the nature of the language, it would be hard to extract more data. Another reason for the poor performance is based on the architecture of RoBERTa models, upon which the XLM-R model is based. Unlike the conventional wordpiece Schuster and Nakajima (2012) and unigram tokeniser Kudo (2018), RoBERTa uses a byte-level BPE tokeniser Sennrich et al. (2016) for its tokenisation process. However, BPE tends to have a poor morphological alignment with the original text Jain et al. (2020). As Dravidian languages are morphologically rich Tanwar and Majumder (2020), this approach results in poor performance on SA and OLI. DistilBERT scores more than the other multi-lingual models such as XLM and XLM-R, in spite of having very few parameters in comparison to large LMs such as XLM-R base (66M vs 270M). However, among all models, mBERT had the best overall scores in both tasks. Hence, we used BERT to perform the MTL models by training them for hard parameter sharing and soft parameter sharing.

We have fine-tuned BERT for these loss functions as it outperformed other pretrained LMs for CEs. We have employed four loss functions. The highest weighted F1-score for SA was 0.602, which was achieved when mBERT was trained in soft parameter sharing strategy with CE as the loss function. There is an increase of 0.101 in contrast to STL. However, the output of the second task scored 0.672, which is lower than the STL score (-0.140 F1-score). As we take the naive sum of the losses when training in a multi-task setting, the losses of one task suppresses the other task, thus, only one task majorly benefit from the approach Maninis et al. (2019). However, it is worth noting that the approach did achieve a competitive weighted F1-score, if not greater than the former. The highest F1-score achieved for OLI was 0.700, which was achieved by training mBERT in hard parameter sharing strategy, with CE as its loss function. However, the same issue of supression of performance on the other task was observed as it scored 0.590. It can be observed that only two of the classes scored 0.0 in MTL in contrast to three in STL. When both models were trained with KLD as its loss function, the performance was abysmal, and it was only able to classify a single class in the respective tasks (positive and not offensive) among all. When trained on hinge loss and focal loss, we observe that they attend to the class imbalance. Hence, we observe that MTL frameworks tend to perform slightly better than when treated as individual tasks.

Fig. 8 represents the training accuracy of SA and OLI in an MTL scenario. It can be observed from the graph that there is a bigger jump for SA after the first epoch in contrast to OLI, where there is a steady increase in the accuracy. Fig. 8(a) and Fig. 8(b) exhibit the confusion matrices of the best performing model for the respective tasks. For SA, we observe that a lot of labels are being misclassified. The trend of misclassification can also be observed in OLI, where most of the samples are being misclassified into four classes (instead of 6), with the absence of samples being classified into OTO and OU. The samples of OTO and OU are being misclassified into other classes, which is mainly due to the low support of these classes in the test set.

Figure 10: Train accuracy during MTL of DistilmBERT on the Tamil data set

6.2 Tamil

Table 2 gives an insight into the support of the classes on the test set. We observe that the support of Offensive Targeted Others is quite low in contrast to not offensive (52:3148). From Table 5, DistilmBERT is the best performing model among all models. In spite of being a smaller, distilled form of BERT, it performs better than the parent model. Even though it was suggested that DistilBERT retains 97% of BERT’s accuracy, it has performed better than BERT previously Maslej-Krešň’akov’a et al. (2020), mainly due to the custom triple loss function and fewer parameters Lan et al. (2019). The weighted average of F1-score for SA is 0.614 and 0.743 for OLI, all trained with CE. Even though BERT achieves competitive performance, three out of six classes have a weighted average F1-score of 0.0, due to misclassification. Hence, we train DistilmBERT on MTL. We observe that XLM has also performed better than BERT and XLM-RoBERTa.

(a) Pos: Positive, Neut: Neutral, OL: Other languages, MF: Mixed feelings, Neg: Negative
(b) NO: Not offensive, OL: Other languages, OU: Offensive untargeted, OTG: Offensive targeted group, OTI: Offensive targeted individual
Figure 11: Heatmap of confusion matrix for the best performing model on the Tamil data set

It can be observed from Table 9 that DistilmBERT, when trained using hard parameter sharing strategy on CE loss yields the best results for OLI with a weighted F1-score of 0.753 (+1.0%), and 0.614 on SA (+/-0.0%). However, the best score for OLI was achieved by training DistilmBERT on soft parameter sharing strategy with Focal Loss. It achieved a score of 0.625 (+1.1%) on SA and 0.745 (+0.2%) on OLI. It is to be noted that training on both CE and FL achieves better results than the best scores set forth by the STL model. The hinge loss function helps achieve better results for precision and recall. During training, the accuracy curve of SA is increasing at a greater rate in contrast to OLI, as displayed in Fig. 10. Fig. 11 displays the class-wise predictions by DistilmBERT trained on hard parameter sharing with CE. Most of the samples of NO class are predicted correctly, with the majority of the rest being mislabelled as OU. Due to low support in the training and test set, no samples of OTO have been predicted correctly. Most of the classes are incorrectly predicted as OU, which is primarily due to the close interconnection among the classes. In SA, it can be observed that a significant number of samples of Positive and Negative classes have been predicted as Neutral. This is also observed in Mixed Feeling (MF), where most of the samples have been misclassified as Positive and Negative. The misclassification is mainly due to the nature of the class, as Mixed Feelings (MF) can be either of the two classes.

6.3 Malayalam

Figure 12: Train Accuracy during MTL of mBERT on the Malayalam data set

Table 4 and 8 illustrate the classification report of the models on STL and MTL, respectively. It can be observed that among STL, fine-tuning mBERT with CE gave a weighted F1-score of 0.635 for SA and 0.902 for OLI. DistilBERT outperforms BERT on OLI, however, it scores poorly on SA. DistilBERT reduces the occurrence of misclassification, as it is able to classify four out of the five classes separately, unlike mBERT, which correctly classifies two out of five classes in OLI. Even though XLM has higher class-wise F1-scores for both tasks, when trained on an MTL setting, it performed very poorly. This could be due to the pretraining strategy of XLM, as it was only pretrained on Wikipedia, which does not improve the performance on low-resource languages. Added to that, acquiring parallel data for TLM during pretraining could be very challenging Grégoire and Langlais (2018). Several classes have scored 0.0 for class-wise weighted F1-scores due to less support on the test set (Class Imbalance). Since mBERT is the strongest model among all, we experiment it for MTL.

(a) Pos: Positive, Neut: Neutral, OL: Other languages, MF: Mixed feelings, Neg: Negative
(b) NO: Not offensive, OL: Other languages, OU: Offensive untargeted, OTG: Offensive targeted group, OTI: Offensive tTargeted individual
Figure 13: Heatmap of confusion matrix for the best performing model on the Malayalam data set

We observe that mBERT fine-tuned on hard parameter sharing strategy with CE as its loss, achieved the highest score of 0.668 (+3.3% from STL) in SA, while its secondary task scored 0.905 (+0.03%). Despite not being the best weighted F1-score, it outperformed the STL result of mBERT. We observe that when mBERT is trained using soft parameter sharing with focal loss being its loss function, it attains the highest weighted F1-score for OLI (+0.6%), while scoring 0.651 (+1.6%) on SA. We also observe that KLD’s performance is appalling and could be due to the perseverance of the loss function. KLD is likely to treat the following multi-class classification problem as a regression problem, hence predicting a single class that has the highest support. However, it has been previously used for classification problems Nakov et al. (2016), but in our case, the loss performs poorer than the performance of the algorithms that serve as the baseline Chakravarthi et al. (2020a). It is to be noted that the increase in performance is more for SA in contrast to OLI. The time required to train, along with the memory constraints, is one of the reasons why we opt for MTL.

Fig. 12 represents the training accuracy of both tasks in the best performing model. After the first epoch, we see that there is a steady increase in both of the tasks. However, after the third epoch, we do not see any improvement in the performance of both of the tasks. Fig. 13 reports the confusion matrices of the best performing models (in MTL). For SA, we observe that most of the labels are misclassified for the samples of mixed feelings, which is mainly due to the low support in the training and test set as observed in Fig. 12(a). For OLI, we observe that the model tends to classify all of the samples in the test set into two classes: NO and OL. As stated previously, low support of the other classes during training is why the model misclassifies its labels.

7 Conclusion

Despite of the rising popularity of social media, the lack of code-mixed data in Dravidian languages has motivated us to develop MTL frameworks for SA and OLI in three code-mixed Dravidian corpora, namely, Kannada, Malayalam, and Tamil. The proposed approach of fine-tuning multi-lingual BERT to a hard parameter sharing with cross-entropy loss yields the best performance for both of the tasks in Kannada and Malayalam, achieving competitive scores in contrast to the performance when its counterparts are treated as separate tasks. For Tamil, our approach of fine-tuning multi-lingual distilBERT in soft parameter sharing with cross-entropy loss scores better than the other models, mainly due to its triple loss function employed during pretraining. The performance of these models highlights the advantages of using an MTL model to attend to two tasks at a time, mainly reducing the time required to train the models while additionally reducing the space complexities required to train them separately. For future work, we intend to use uncertainty weighting to calculate the impact of one loss function on the other during MTL.

The author Bharathi Raja Chakravarthi was supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289P2 (Insight2), co-funded by the European Regional Development Fund and Irish Research Council grant IRCLA/2017/129 (CARDAMOM-Comparative Deep Models of Language for Minority and Historical Languages).


This research has not been funded by any company or organization

Compliance with Ethical Standards

Conflict of interest: The authors declare that they have no conflict of interest.

Availability of data and material: The datasets used in this paper are obtained from

Code availability: The data and approaches discussed in this paper are available at

Ethical Approval: This article does not contain any studies with human participants or animals performed by any of the authors.


  • M. Anagha, R. R. Kumar, K. Sreetha, and P. R. Raj (2015) Fuzzy logic based hybrid approach for sentiment analysis of malayalam movie reviews. In 2015 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES), pp. 1–4. Cited by: §2.1.
  • A. R. Appidi, V. K. Srirangam, D. Suhas, and M. Shrivastava (2020) Creation of corpus and analysis in code-mixed Kannada-English Twitter data for emotion prediction. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6703–6709. External Links: Link Cited by: §2.1.
  • A. Asperti and M. Trentin (2020) Balancing reconstruction error and kullback-leibler divergence in variational autoencoders. IEEE Access 8 (), pp. 199440–199448. External Links: Document Cited by: §4.3.4.
  • K. Bali, J. Sharma, M. Choudhury, and Y. Vyas (2014) “I am borrowing ya mixing ?” an analysis of English-Hindi code mixing in Facebook. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 116–126. External Links: Link, Document Cited by: item 1, §2.2.
  • S. Banerjee, B. R. Chakravarthi, and J. P. McCrae (2020) Comparison of pretrained embeddings to identify hate speech in indian code-mixed text. In 2020 2nd International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), pp. 21–25. Cited by: §2.2.
  • U. Barman, A. Das, J. Wagner, and J. Foster (2014) Code mixing: a challenge for language identification in the language of social media. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 13–23. External Links: Link, Document Cited by: §1.
  • S. Bhat (2012) Morpheme segmentation for Kannada standing on the shoulder of giants. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing, Mumbai, India, pp. 79–94. External Links: Link Cited by: §1.
  • E. Bisong (2019) Google colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners, pp. 59–64. External Links: ISBN 978-1-4842-4470-8, Document, Link Cited by: §5.1.
  • [9] J. Brownlee How to calculate the kl divergence for machine learning. Note: Available at (2019/11/01) Cited by: §4.3.4.
  • R. Caruana (1997) Multitask learning. Machine Learning 28 (1), pp. 41–75. External Links: ISSN 1573-0565, Document, Link Cited by: §1, §2.3, §4.2.1.
  • B. R. Chakravarthi, N. Jose, S. Suryawanshi, E. Sherly, and J. P. McCrae (2020a) A sentiment analysis dataset for code-mixed Malayalam-English. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), Marseille, France, pp. 177–184 (English). External Links: Link, ISBN 979-10-95546-35-1 Cited by: item 1, §3, §6.3.
  • B. R. Chakravarthi, V. Muralidaran, R. Priyadharshini, and J. P. McCrae (2020b) Corpus creation for sentiment analysis in code-mixed Tamil-English text. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), Marseille, France, pp. 202–210 (English). External Links: Link, ISBN 979-10-95546-35-1 Cited by: §1, §3.
  • B. R. Chakravarthi, R. Priyadharshini, N. Jose, A. K. M, T. Mandl, P. K. Kumaresan, R. Ponnusamy, V,Hariharan, and E. McCrae (2021) Findings of the shared task on Offensive Language Identification in Tamil, Malayalam, and Kannada. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Cited by: §2.2, §3.
  • B. R. Chakravarthi, R. Priyadharshini, V. Muralidaran, N. Jose, S. Suryawanshi, E. Sherly, and J. P. McCrae (2021) DravidianCodeMix: sentiment analysis and offensive language identification dataset for dravidian languages in code-mixed text. Language Resources and Evaluation. Cited by: §3.
  • B. R. Chakravarthi (2020) HopeEDI: a multilingual hope speech detection dataset for equality, diversity, and inclusion. In Proceedings of the Third Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s in Social Media, Barcelona, Spain (Online), pp. 41–53. External Links: Link Cited by: §2.2.
  • S. Changpinyo, H. Hu, and F. Sha (2018) Multi-task learning for sequence tagging: an empirical study. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2965–2977. External Links: Link Cited by: §2.3.
  • G. Chhablani, Y. Bhartia, A. Sharma, H. Pandey, and S. Suthaharan (2021) NLRG at semeval-2021 task 5: toxic spans detection leveraging bert-based token classification and span prediction techniques. External Links: 2102.12254 Cited by: §2.2.
  • I. Clarke and J. Grieve (2017) Dimensions of abusive language on Twitter. In Proceedings of the First Workshop on Abusive Language Online, Vancouver, BC, Canada, pp. 1–10. External Links: Link, Document Cited by: §1.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020) Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8440–8451. External Links: Link, Document Cited by: §4.1.7.
  • M. Crawshaw (2020) Multi-task learning with deep neural networks: a survey. arXiv preprint arXiv:2009.09796. Cited by: §6.
  • M. Dadvar, D. Trieschnigg, R. Ordelman, and F. de Jong (2013) Improving cyberbullying detection with user context. In Advances in Information Retrieval, P. Serdyukov, P. Braslavski, S. O. Kuznetsov, J. Kamps, S. Rüger, E. Agichtein, I. Segalovich, and E. Yilmaz (Eds.), Berlin, Heidelberg, pp. 693–696. External Links: ISBN 978-3-642-36973-5 Cited by: §2.2.
  • Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. CoRR abs/1901.02860. External Links: Link, 1901.02860 Cited by: §4.1.6.
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2978–2988. External Links: Link, Document Cited by: §4.1.5.
  • A. Das and S. Bandyopadhyay (2010) SentiWordNet for Indian languages. In Proceedings of the Eighth Workshop on Asian Language Resouces, Beijing, China, pp. 56–63. External Links: Link Cited by: §2.1.
  • A. Das and B. Gambäck (2014) Identifying languages at the word level in code-mixed Indian social media text. In Proceedings of the 11th International Conference on Natural Language Processing, Goa, India, pp. 378–387. External Links: Link Cited by: item 1.
  • P. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein (2005) A tutorial on the cross-entropy method. Annals of operations research 134 (1), pp. 19–67. Cited by: §6.1.
  • J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Document Cited by: §4.1.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: Figure 2, Figure 3, §4.1.1, §4.1.3, §5.2.
  • N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Radosavljevic, and N. Bhamidipati (2015) Hate speech detection with comment embeddings. In Proceedings of the 24th international conference on world wide web, pp. 29–30. Cited by: §2.2.
  • A. Dobrescu, M. V. Giuffrida, and S. A. Tsaftaris (2020) Doing more with less: a multitask deep learning approach in plant phenotyping. Frontiers in plant science 11. Cited by: §6.
  • L. Duong, T. Cohn, S. Bird, and P. Cook (2015) Low resource dependency parsing: cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, pp. 845–850. External Links: Link, Document Cited by: item 1.
  • D. Eigen and R. Fergus (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2650–2658. External Links: Document Cited by: §5.1.
  • H. El Boukkouri, O. Ferret, T. Lavergne, H. Noji, P. Zweigenbaum, and J. Tsujii (2020) CharacterBERT: reconciling ELMo and BERT for word-level open-vocabulary representations from characters. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6903–6915. External Links: Link Cited by: Figure 5, §4.1.8.
  • N. K. Ghanghor, P. Krishnamurthy, S. Thavareesan, R. Priyadharshini, and B. R. Chakravarthi (2021) IIITK@DravidianLangTech-EACL2021: Offensive Language Identification and Meme Classification in Tamil, Malayalam and Kannada. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Online. Cited by: §2.2.
  • F. Grégoire and P. Langlais (2018)

    Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation

    In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1442–1453. External Links: Link Cited by: §6.3.
  • A. Hande, R. Priyadharshini, and B. R. Chakravarthi (2020) KanCMD: Kannada CodeMixed dataset for sentiment analysis and offensive language detection. In Proceedings of the Third Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s in Social Media, Barcelona, Spain (Online), pp. 54–63. External Links: Link Cited by: §3.
  • K. Jain, A. Deshpande, K. Shridhar, F. Laumann, and A. Dash (2020) Indic-transformers: an analysis of transformer language models for indian languages. External Links: 2011.02323 Cited by: §6.1.
  • N. Jin, J. Wu, X. Ma, K. Yan, and Y. Mo (2020) Multi-task learning model based on multi-scale cnn and lstm for sentiment classification. IEEE Access 8 (), pp. 77060–77072. External Links: Document Cited by: §2.3.
  • D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, and P. Kumar (2020) IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 4948–4961. External Links: Link, Document Cited by: §6.
  • S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Aggarwal, R. T. Nagipogu, S. Dave, et al. (2021) MuRIL: multilingual representations for indian languages. arXiv preprint arXiv:2103.10730. Cited by: §6.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
  • O. Kuchaiev and B. Ginsburg (2017) Factorization tricks for LSTM networks. CoRR abs/1703.10722. External Links: Link, 1703.10722 Cited by: §4.1.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §6.
  • T. Kudo (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 66–75. External Links: Link, Document Cited by: §6.1.
  • S. Kullback and R. A. Leibler (1951) On information and sufficiency. Ann. Math. Statist. 22 (1), pp. 79–86. External Links: Document, Link Cited by: §4.3.4.
  • R. Kumar, A. Kr. Ojha, B. Lahiri, M. Zampieri, S. Malmasi, V. Murdock, and D. Kadar (Eds.) (2020) Proceedings of the second workshop on trolling, aggression and cyberbullying. European Language Resources Association (ELRA), Marseille, France. External Links: Link, ISBN 979-10-95546-56-6 Cited by: §2.2.
  • S. S. Kumar, M. A. Kumar, K. Soman, and P. Poornachandran (2020) Dynamic mode-based feature with random mapping for sentiment analysis. In Intelligent systems, technologies and applications, pp. 1–15. Cited by: §2.1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019) ALBERT: A lite BERT for self-supervised learning of language representations. CoRR abs/1909.11942. External Links: Link, 1909.11942 Cited by: §4.1.3, §6.2, §6.
  • N. Li, C. -Y. Chow, and J. -D. Zhang (2020) SEML: a semi-supervised multi-task learning framework for aspect-based sentiment analysis. IEEE Access 8 (), pp. 189287–189297. External Links: Document Cited by: §2.3.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017) Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §4.3.3, §4.3.3.
  • V. Liu and J. R. Curran (2006) Web text corpus for natural language processing. In 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy. External Links: Link Cited by: §6.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §4.1.4.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, External Links: Link Cited by: 3rd item, §5.2, Table 7.
  • Y. Ma, L. Zhao, and J. Hao (2020) XLP at SemEval-2020 task 9: cross-lingual models with focal loss for sentiment analysis of code-mixing language. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona (online), pp. 975–980. External Links: Link Cited by: §4.3.3.
  • T. Mandl, S. Modha, A. Kumar M, and B. R. Chakravarthi (2020) Overview of the hasoc track at fire 2020: hate speech and offensive language identification in tamil, malayalam, hindi, english and german. In Forum for Information Retrieval Evaluation, FIRE 2020, New York, NY, USA, pp. 29–32. External Links: ISBN 9781450389785, Link, Document Cited by: §2.2.
  • K. Maninis, I. Radosavovic, and I. Kokkinos (2019) Attentive single-tasking of multiple tasks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1851–1860. Cited by: §6.1.
  • H. Martínez Alonso and B. Plank (2017) When is multitask learning effective? semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 44–53. External Links: Link Cited by: §1.
  • V. Maslej-Krešň’akov’a, M. Sarnovsk‘y, P. Butka, and K. Machov’a (2020) Comparison of deep learning models and various text pre-processing techniques for the toxic comments classification. Applied Sciences 10 (23), pp. 8631. Cited by: §6.2.
  • L. Mou and X. X. Zhu (2018) Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network. IEEE Transactions on Geoscience and Remote Sensing 56 (11), pp. 6699–6711. External Links: Document Cited by: §2.3.
  • K. P. Murphy (2012) Machine learning: a probabilistic perspective. The MIT Press, London,England. Cited by: §4.3.1, §4.3.4.
  • D. S. Nair, J. P. Jayan, R. Rajeev, and E. Sherly (2015) Sentiment analysis of malayalam film review using machine learning techniques. In 2015 international conference on advances in computing, communications and informatics (ICACCI), pp. 2381–2384. Cited by: §2.1.
  • D. S. Nair, J. P. Jayan, R. R. Rajeev, and E. Sherly (2014) SentiMa - sentiment extraction for malayalam. In 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Vol. , pp. 1719–1723. External Links: Document Cited by: §2.1.
  • P. Nakov, A. Ritter, S. Rosenthal, F. Sebastiani, and V. Stoyanov (2016) SemEval-2016 task 4: sentiment analysis in Twitter. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 1–18. External Links: Link, Document Cited by: §6.3.
  • C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang (2016) Abusive language detection in online user content. In Proceedings of the 25th international conference on world wide web, pp. 145–153. Cited by: §2.2.
  • X. Ouyang, S. Xu, C. Zhang, P. Zhou, Y. Yang, G. Liu, and X. Li (2019) A 3d-cnn and lstm based multi-task learning architecture for action recognition. IEEE Access 7 (), pp. 40757–40770. External Links: Document Cited by: §2.3.
  • R. Padmamala and V. Prema (2017) Sentiment analysis of online tamil contents using recursive neural network models approach for tamil language. In 2017 IEEE International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), Vol. , pp. 28–31. External Links: Document Cited by: §2.1.
  • S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: 1st item.
  • B. Pang and L. Lee (2008) Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2 (1–2), pp. 1–135. External Links: ISSN 1554-0669, Link, Document Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In NeurIPS, Cited by: §5.1.
  • B. G. Patra, D. Das, A. Das, and R. Prasath (2015) Shared task on sentiment analysis in indian languages (sail) tweets - an overview. In Mining Intelligence and Knowledge Exploration, R. Prasath, A. K. Vuppala, and T. Kathirvalavakumar (Eds.), Cham, pp. 650–655. External Links: ISBN 978-3-319-26832-3 Cited by: §2.1.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, G. Louppe, P. Prettenhofer, R. Weiss, R. J. Weiss, J. VanderPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, pp. 2825–2830. Cited by: §5.1.
  • Y. Peng, Q. Chen, and Z. Lu (2020) An empirical study of multi-task learning on BERT for biomedical text mining. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online, pp. 205–214. External Links: Link, Document Cited by: §2.3.
  • S. Phani, S. Lahiri, and A. Biswas (2016) Sentiment analysis of tweets in three Indian languages. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), Osaka, Japan, pp. 93–102. External Links: Link Cited by: §2.1.
  • T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual BERT?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4996–5001. External Links: Link, Document Cited by: §5.2, §6.
  • F. M. Plaza-del-Arco, M. D. Molina-González, L. A. Ureña-López, and M. T. Martín-Valdivia (2021) Comparing pre-trained language models for spanish hate speech detection. Expert Systems with Applications 166, pp. 114120. External Links: ISSN 0957-4174, Document Cited by: §2.2.
  • S. Prabhu, U. Narayan, A. Debnath, S. S, and M. Shrivastava (2020) Detection and annotation of events in Kannada. In 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation PROCEEDINGS, Marseille, pp. 88–93 (English). External Links: Link, ISBN 979-10-95546-48-1 Cited by: §1.
  • V. Prokhorov, E. Shareghi, Y. Li, M. T. Pilehvar, and N. Collier (2019) On the importance of the Kullback-Leibler divergence term in variational autoencoders for text generation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, pp. 118–127. External Links: Link, Document Cited by: §4.3.4.
  • K. Puranik, A. Hande, R. Priyadharshini, S. Thavareesan, and B. R. Chakravarthi (2021) IIITT@LT-EDI-EACL2021-Hope Speech Detection: There is always hope in Transformers . In Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, Cited by: §6.
  • A. Radford (2018) Improving language understanding by generative pre-training. Cited by: 1st item.
  • A. Rakhlin (2016) MIT Online Methods in Machine Learning 6.883, Lecture Notes: Multiclass and multilabel problems. Note: URL: Last visited on 2021/02/08 Cited by: §4.3.2.
  • T. Ranasinghe and M. Zampieri (2021) MUDES: multilingual detection of offensive spans. External Links: 2102.09665 Cited by: §2.2.
  • P. Rani, S. Suryawanshi, K. Goswami, B. R. Chakravarthi, T. Fransen, and J. P. McCrae (2020) A comparative study of different state-of-the-art hate speech detection methods in hindi-english code-mixed data. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, pp. 42–48. Cited by: §2.2.
  • A. H. Razavi, D. Inkpen, S. Uritsky, and S. Matwin (2010) Offensive language detection using multi-level classification. In

    Canadian Conference on Artificial Intelligence

    pp. 16–27. Cited by: §2.2.
  • S. Reddy and S. Sharoff (2011) Cross language POS taggers (and other tools) for Indian languages: an experiment with Kannada using Telugu resources. In Proceedings of the Fifth International Workshop On Cross Lingual Information Access, Chiang Mai, Thailand, pp. 11–19. External Links: Link Cited by: §1.
  • S. Ruder (2017) An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098. External Links: Link, 1706.05098 Cited by: §4.2.1, §6.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. External Links: Link, 1910.01108 Cited by: §4.1.2, §4.1.2.
  • M. Schuster and K. Nakajima (2012) Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5149–5152. External Links: Document Cited by: §6.1.
  • S. Se, R. Vinayakumar, M. A. Kumar, and K. Soman (2016) Predicting the sentimental reviews in tamil movie using machine learning algorithms. Indian Journal of Science and Technology 9 (45), pp. 1–5. Cited by: §2.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §4.1.4, §6.1.
  • A. Severyn, A. Moschitti, O. Uryupina, B. Plank, and K. Filippova (2014) Opinion mining on YouTube. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 1252–1261. External Links: Link, Document Cited by: §1.
  • Shalev-Shwartz,Shai and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press, New York. Cited by: §4.3.2.
  • N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean (2017) Outrageously large neural networks: the sparsely-gated mixture-of-experts layer. External Links: 1701.06538 Cited by: §4.1.
  • S. Soumya and K. Pramod (2020) Sentiment analysis of malayalam tweets using machine learning techniques. ICT Express 6 (4), pp. 300–305. Cited by: §2.1.
  • B. S. Sowmya Lakshmi and B. R. Shambhavi (2017) An automatic language identification system for code-mixed english-kannada social media text. In 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Vol. , pp. 1–5. External Links: Document, ISSN Cited by: §2.1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15 (1), pp. 1929–1958. External Links: ISSN 1532-4435 Cited by: Table 7.
  • A. Tanwar and P. Majumder (2020) Translating morphologically rich indian languages under zero-resource conditions. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19 (6). External Links: ISSN 2375-4699, Link, Document Cited by: §6.1.
  • W. L. Taylor (1953) “Cloze procedure”: a new tool for measuring readability. Journalism & Mass Communication Quarterly 30, pp. 415 – 433. Cited by: 1st item.
  • S. Thavareesan and S. Mahesan (2019) Sentiment analysis in tamil texts: a study on machine learning techniques and feature representation. In 2019 14th Conference on Industrial and Information Systems (ICIIS), Vol. , pp. 320–325. External Links: Document Cited by: §2.1.
  • R. Thilagavathi and K. Krishnakumari (2016) Tamil english language sentiment analysis system. International Journal of Engineering Research & Technology (IJERT) 4, pp. 114–118. Cited by: §2.1.
  • Y. Tian, T. Galery, G. Dulcinati, E. Molimpakis, and C. Sun (2017) Facebook sentiment: reactions and emojis. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, Valencia, Spain, pp. 11–16. External Links: Link, Document Cited by: §1.
  • A. Tontodimamma, E. Nissi, A. Sarra, and L. Fontanella (2021) Thirty years of research into hate speech: topics of interest and their evolution. Scientometrics 126 (1), pp. 157–179. Cited by: §2.2.
  • D. Tula, P. Potluri, S. Ms, S. Doddapaneni, P. Sahu, R. Sukumaran, and P. Patwa (2021) Bitions@DravidianLangTech-EACL2021: ensemble of multilingual language models with pseudo labeling for offence detection in Dravidian languages. In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Kyiv, pp. 291–299. External Links: Link Cited by: §4.3.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. External Links: 1706.03762 Cited by: Figure 1, §4.1.
  • J. Weston and C. Watkins (1999) Support vector machines for multi-class pattern recognition. In ESANN, Cited by: §4.3.2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link, Document Cited by: §5.1.
  • Y. Yang and T. M. Hospedales (2017) Trace norm regularised deep multi-task learning. ArXiv abs/1606.04038. Cited by: item 2.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §4.1.6.
  • Yasaswini,Konthala, K. Puranik, A. Hande, R. Priyadharshini, S. Thavareesan, and B. R. Chakravarthi (2021) IIITT@DravidianLangTech-EACL2021: Transfer Learning for Offensive Language Detection in Dravidian Languages . In Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, Cited by: §2.2.
  • M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019) SemEval-2019 task 6: identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 75–86. External Links: Link, Document Cited by: §2.2.
  • M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pitenis, and Ç. Çöltekin (2020) SemEval-2020 task 12: multilingual offensive language identification in social media (OffensEval 2020). In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona (online), pp. 1425–1447. External Links: Link Cited by: §2.2.
  • P. Zhai, Y. Tao, H. Chen, T. Cai, and J. Li (2020) Multi-task learning for lung nodule classification on chest ct. IEEE Access 8 (), pp. 180317–180327. External Links: Document Cited by: §2.3.
  • H. Zhang, S. Sun, Y. Hu, J. Liu, and Y. Guo (2020a) Sentiment classification for chinese text based on interactive multitask learning. IEEE Access 8 (), pp. 129626–129635. External Links: Document Cited by: §2.3.
  • K. Zhang, L. Wu, Z. Zhu, and J. Deng (2020b) A multitask learning model for traffic flow and speed forecasting. IEEE Access 8 (), pp. 80707–80715. External Links: Document Cited by: §2.3.
  • Y. Zhang and Q. Yang (2018) A survey on multi-task learning. External Links: 1707.08114 Cited by: §2.3.
  • Z. Zhang, C. Chen, G. Dai, W. Li, and D. Yeung (2014) Multicategory large margin classification methods: hinge losses vs. coherence functions. Artificial Intelligence 215, pp. 55–78. External Links: ISSN 0004-3702, Document Cited by: §4.3.2.