PELESent: Cross-domain polarity classification using distant supervision

The enormous amount of texts published daily by Internet users has fostered the development of methods to analyze this content in several natural language processing areas, such as sentiment analysis. The main goal of this task is to classify the polarity of a message. Even though many approaches have been proposed for sentiment analysis, some of the most successful ones rely on the availability of large annotated corpus, which is an expensive and time-consuming process. In recent years, distant supervision has been used to obtain larger datasets. So, inspired by these techniques, in this paper we extend such approaches to incorporate popular graphic symbols used in electronic messages, the emojis, in order to create a large sentiment corpus for Portuguese. Trained on almost one million tweets, several models were tested in both same domain and cross-domain corpora. Our methods obtained very competitive results in five annotated corpora from mixed domains (Twitter and product reviews), which proves the domain-independent property of such approach. In addition, our results suggest that the combination of emoticons and emojis is able to properly capture the sentiment of a message.


page 1

page 2

page 3

page 4


What Can We Learn From Almost a Decade of Food Tweets

We present the Latvian Twitter Eater Corpus - a set of tweets in the nar...

The Moral Foundations Reddit Corpus

Moral framing and sentiment can affect a variety of online and offline b...

Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm

NLP tasks are often limited by scarcity of manually annotated data. In s...

Q-WordNet PPV: Simple, Robust and (almost) Unsupervised Generation of Polarity Lexicons for Multiple Languages

This paper presents a simple, robust and (almost) unsupervised dictionar...

Building a Sentiment Corpus of Tweets in Brazilian Portuguese

The large amount of data available in social media, forums and websites ...

Reproducibility, Replicability and Beyond: Assessing Production Readiness of Aspect Based Sentiment Analysis in the Wild

With the exponential growth of online marketplaces and user-generated co...

Enhanced Twitter Sentiment Classification Using Contextual Information

The rise in popularity and ubiquity of Twitter has made sentiment analys...

Code Repositories


Polarity classification using distant supervision

view repo


Source code to

view repo

I Introduction

In the last few years, Sentiment Analysis has become a prominent field in natural language processing (NLP), mostly due to its direct application in several real-world scenarios [pang2008opinion], such as product reviews, government intelligence, and the prediction of the stock markets. One of the main tasks in Sentiment Analysis is the polarity classification, i.e., classifying texts into categories according to the emotions expressed on them. In general, the classes are positive, negative and neutral.

A popular application of polarity classification is in social media content. Microblogging and social networks websites, such as Twitter, have been used to express personal thoughts. According to Twitter’s website111, more than 500 million short messages, known as tweets, are posted each day. The analysis of this type of content is particularly challenging due to its specific language, which is mostly informal, with spelling errors, out of the vocabulary words, as well as the usage of emoticons and emojis to express ideas and sentiments.

Machine learning methods have been widely applied to polarity classification tasks in the context of social networks. This is particularly evident in shared tasks such as the SemEval Sentiment Analysis tasks [rosenthal2015semeval, nakov2016semeval], where these methods usually outperform lexical-based approaches. However, a major drawback of machine learning is its high dependency on large annotated corpora, and since manual annotation usually is time-consuming and expensive [pan2010cross], many non-English languages lack this type of resource or, when existing, are very limited and specific, as it is the case for Portuguese.

In this paper, we adapt a distant supervision approach [go2009twitter] to annotate a large number of tweets in Portuguese and use them to train state-of-the-art methods for polarity classification. We applied these methods in manually annotated corpora from the same domain (Twitter) and cross-domain (product reviews). The obtained results indicate that the proposed approach is well suited for both: same domain and cross-domain. Moreover, it is a powerful alternative to produce sentiment analysis corpora with less effort than manual annotation.

This paper is organized as follows. Section II gives a brief overview of some approaches for sentiment analysis and presents some works that have applied distant supervision to this task. Our approach is described in Section III. The evaluation corpora, machine learning algorithms and results are given in Section IV. Finally, our conclusions are drawn in Section V.

Ii Related Work

Currently, methods devised to perform sentiment analysis and, more specifically, polarity classification range from machine learning to lexical-based approaches. While machine learning methods have proved useful in scenarios where a large amount of training data is available along with top quality NLP resources (such as taggers, parsers and others), they usually have low performance in opposite scenarios. Since most non-English languages face resource limitations, for example Portuguese, lexical-based approaches have become very popular. Some works following this line are [souza2011construction, BalageFilho2013, avancco2014lexicon].

Another alternative for languages with fewer resources is the use of hybrid systems, which combine machine learning and lexical-based methods. Avanço et al.[avanccoimproving] showed that this combination outperforms both individual approaches. This may imply that the development of better individual elements will lead to better results in the final combination.

Machine learning approaches rely on document representations, normally vectorial ones with features like -grams [pang2008opinion]

, a simple example is the bag-of-words model. Once a representation has been chosen, several classification methods are available, such as Support Vector Machines (SVM), Naive Bayes (NB), Maximum Entropy (MaxEnt), Conditional Random Fields (CRF), and ensembles of classifiers 


Apart from the traditional features, such as -grams, some researchers have taken advantage of word embeddings, which are known to capture some linguistic properties, such as semantic and syntactic features. A well-known example of word embeddings is Word2Vec [mikolov2013efficient, mikolov2013distributed]. Algebraic operations, such as sum or average, can be applied to convert word vectors into a sentence or document vector [zhou2016ecnu, correa2017nilc]. However, this representation does not consider the order of the words in the sentence.

Paragraph vectors [mikolovdoc2vec] (also known as Doc2Vec) can be understood as a generalization of Word2Vec for larger blocks of text, such as paragraphs or documents. This technique has obtained state-of-the-art results on sentiment analysis for two datasets of movie reviews [mikolovdoc2vec]. The main goal of these dense representations is to predict the words in those blocks. Two models were proposed by Le and Mikolov [mikolovdoc2vec], in which one of them accounts for the word order.

In addition, deep neural networks also consider the word order. Their methods have achieved good results in sentiment analysis, as shown in 

[socher2013recursive, kalchbrenner2014convolutional, kim2014cnn] and in the SemEval Sentiment Analysis Tasks [rosenthal2015semeval, nakov2016semeval]. Nevertheless, these approaches need large datasets for training. Distant Supervision is a good alternative to obtain these datasets for the training/pre-training of deep neural networks [kalchbrenner2014convolutional, severyn2015unitn, deriu2016swisscheese].

Distant supervision is an alternative to create large datasets without the need of manual annotation. Some works have reported the use of emoticons as semantic indicator for sentiment [read2005using, go2009twitter, pak2010twitter, severyn2015unitn], while others use emoticons and hashtags for the same purpose [davidov2010enhanced, kouloumpis2011twitter]. Go et al.[go2009twitter], the first work to apply distant supervision to Twitter data, collected approximately 1.6 million of tweets containing positive and negative emoticons – e.g. “:)” and “:(” – equally distributed into two classes. They combined sets of features – unigrams, bigrams, part-of-speech (POS) tags – in order to train machine learning algorithms (NB, MaxEnt and SVM) and evaluate those in manually annotated datasets. The best accuracy was achieved using unigram and bigram as features for a MaxEnt classifier.

Severyn and Moschitti[severyn2015unitn]

used Distant Supervision to pre-train a Convolutional Neural Network (CNN). An architecture similar to the one proposed by Kim 


. The network is composed of a first layer to convert words in dense vectors, following a single convolutional layer with a non-linear activation function, max pooling and soft-max. Deriu et al.


used a combination of 2 CNNs with a Random Forest classifier. However, this approach did not obtain improvements with distant supervision.

Despite the numerous studies and investigations of different techniques and methods for polarity classification, the problem of relying on large annotated corpora remains open and the difficulty is intensified in non-English languages. In this paper, our contributions are the adapted framework for building polarity classification corpus to Portuguese, the built corpus itself and an evaluation on different state-of-the-art methods using this corpus, for same domain and cross-domain corpora.

Iii Approach

Following the approach of Go et al. [go2009twitter], we initially collected a large amount of tweets in order to create the distant supervision corpus. Only tweets in Portuguese were crawled, and no specific queries were employed. In total, 41 million of tweets were collected.

After collecting the tweets, the next step was to split them into positive and negative classes. In order to do so, we used lists of emojis and emoticons selected according to the sentiment conveyed by them. Therefore, the polarity of a tweet is determined by the presence of emojis and emoticons in it – if it only contains positive ones (from the positive list), its polarity is assigned as positive. If a tweet contains both positive and negative elements, it is discarded since it is likely to be ambiguous. Following this idea, we used the same list of emoticons used by Go et al. [go2009twitter], which is presented in Table I. Go et al.[go2009twitter] did not use emojis, but these graphic symbols are also employed to convey ideas and sentiments [novak2015sentiment]. In contrast to the small set of emoticons, there are hundreds of possible emojis. Therefore, we selected a representative list with positive and negative emojis. All the emojis conveying positive emotion are presented in Fig. 1. Fig. 2 illustrates the selected ones with negative emotion.

After filtering the tweets by the aforementioned criteria, we obtained a labeled corpus comprising 554,623 positive tweets and 425,444 negative ones. This corpus was used to train the machine learning methods. It is important to highlight that emojis and emoticons were removed from the tweets in the final corpus, so that their presence as a sentiment indicator is not learned by the models.

In addition to the filtering process, some preprocessing steps were performed to improve the corpus quality. Details about the preprocessing steps are given in the Supplementary Material, Section A. After these steps, tweets containing less than 4 tokens were discarded from the corpus. The complete framework (tweets collection, filtering and preprocessing methods) along with all experimental evaluation will be made available 222

Positive Negative
:)  :-)  :D  =) :(  :-(
TABLE I: All emoticons used to represent emotion.
Fig. 1: All emojis used to represent positive emotion. Their respective unicodes are: (a) U+1F60A, (b) U+1F60B, (c) U+1F60D, (d) U+1F603, (e) U+1F606, (f) U+1F600, and (g) U+1F61D.
Fig. 2: All emojis used to represent negative emotion. Their respective unicodes are: (a) U+1F620, (b) U+1F627, (c) U+1F61E, (d) U+1F628, (e) U+1F626, (f) U+1F623, (g) U+1F614, (h) U+1F629, (i) U+1F612, (j) U+1F621, (k) U+2639, and (l) U+1F61F.

Iv Experimental Evaluation

In order to evaluate the quality of the corpus built using distant supervision, we trained state-of-the-art methods for polarity classification and applied the learned models to 5 well known manually annotated sentiment corpora. In the following, we present these corpora along with the message polarity classification methods, and finally, the obtained results.

Iv-a Corpora

Sentiment classifiers are usually trained on manually annotated corpora. Because sentiments may be expressed differently in different domains [pan2010cross], it is common to create domain-specific corpus. Since we intend to create a robust and generic corpus that is not domain-specific, we selected 5 corpora for evaluation, 2 being from the same domain (Twitter) and 3 from a different domain (product reviews). Below, we present the corpora that were used.

Brazilian Presidential Election [silva2011effective]

This dataset is formed by tweets about the Brazilian presidential election run in 2010. The corpus is divided in two parts, one referencing Dilma Rousseff (BPE-Dilma) and the other José Serra (BPE-Serra), both being the most popular candidates in the election. The corpora were manually annotated in positive and negative, and used to evaluate stream based sentiment analysis systems.

Buscapé [hartmann2014large]

This dataset is formed by product reviews extracted from Buscapé website333 The documents were automatically labeled based on two informations given by the users. The first (Buscape-1) is based on a recommendation tag while the second (Buscape-2) is based on a 5-star scale (1-2 stars for negative and 4-5 stars for positive). Both corpora are balanced between the two classes, even though there is a notable difference on their sizes, possibly due to the low use of the recommendation tag.

Mercado Livre [avanccoimproving]

Similar to the Buscapé dataset, this corpus is formed by product reviews from the online marketplace Mercado Livre444 The corpus was also automatically annotated based on a 5-star scale given by the authors of the reviews. The dataset is balanced between the positive and negative classes.

Table II presents a summary of the corpora.

Dataset Total Positive Negative
Mercado Livre
TABLE II: Datasets used in the evaluation of the system.

Iv-B Machine Learning Methods

Machine learning has dominated the area of sentiment analysis, mostly because its high performance when manually annotated data is available. However, thanks to the great variety of methods, there is no consensus about which method is the best in this scenario. In the last editions of SemEval Sentiment Analysis Task, most of the best methods/systems used deep learning techniques 

[rosenthal2015semeval, nakov2016semeval]. In this work, the evaluated methods range from simple linear models for classification using vector space models to hybrid (machine learning and lexical-based) and Deep Learning methods. The idea was to thoroughly evaluate the quality of the corpus regardless of the technique being used for learning. Below, each method is briefly described.

Logistic Regression (LR)

Also known as logit regression, LR can be understood as a generalization of linear regression models to the binary classification scenario, where a sigmoid function outputs the class probabilities 


. In this paper, the logistic regression model predicts the class probabilities of a text, where the classes are ”positive” and ”negative”. As input for this classifier, three text representations were used: (1) a bag-of-words model (LR+tfidf), where each document (tweet or review) is represented by its set of words weighted by

tf-idf [salton1989automatic]; (2) a word embeddings based model (LR+w2v), where each document is represented by the weighted average of the embedding vectors (generated by Word2Vec [mikolov2013distributed, mikolov2013efficient]) of the words that compose the document, the weights are defined by tf-idf; (3) the Paragraph Vector model (LR+d2v), which uses a neural network to generate embeddings for words and documents simultaneously in an unsupervised manner. Only the vectorial representations of documents were used by the classifier.

Convolutional Neural Networks (CNNs)

With the popularity of deep learning, CNNs have been applied to many different contexts, including several NLP tasks [collobert2011nlp] and, more specifically, sentiment analysis [kalchbrenner2014convolutional, kim2014cnn, severyn2015unitn, deriu2016swisscheese]. Our CNN is similar to the architecture proposed by Kim [kim2014cnn], which uses a single convolutional layer. In this architecture, the network receives as input a matrix representing the document, and each word in the document is represented by a dense continuous vector. The output of the network is the probability of a document being negative or positive.

Recurrent Convolutional Neural Networks (RCNNs)

This deep neural architecture uses both convolutional and recurrent layers. Recently explored by many works in NLP [kalchbrenner2013recurrent, treviso2017sentence, lai2015recurrent], this architecture has been successfully applied to sentiment analysis [lai2015recurrent, nakov2016semeval]. Our architecture consists of a slight modification of the one used by Treviso et al.[treviso2017sentence], where the final layer returns the probability for the whole document, indicating a positive/negative polarity. Using this combination of convolutional and recurrent layers, we explored the principle that nearby words have a greater influence in the classification, while distant words may also have some impact.


This method is a combination of two classifiers previously used for sentiment classification in cross-domain corpora [avanccoimproving] and follows the same setting introduced by Avanço[avancco2015normalizaccao]

. The method consists of a SVM classifier combined with a lexical-based approach. The documents are represented by arrays of features including a binary bag-of-words (presence/absence of terms), emoticons, sentiment words and POS tags. Documents located near the separation hyperplane (in a threshold assumed as

) learned by the SVM are considered to be uncertain. Those documents are then classified with a lexical-based approach, that uses linguistic rules for polarity classification in Portuguese.

For all methods, well-known machine learning libraries were used, such as Scikit-learn [pedregosa2011scikit]

and Keras 

[chollet2015keras]. Particularities such as parameters, details about the architecture, initializations and others can be found in the Supplementary Material, Section B.

Iv-C Results and Discussion

To evaluate and compare the methods in each corpus, F1 score (macro-averaged), recall (macro-averaged) and accuracy were chosen, mostly because of their traditional use in sentiment analysis [rosenthal2015semeval, nakov2016semeval].

The main results are shown in Table III. Along with the results of each polarity classification method, we present the state-of-the-art (SotA) result reported for each corpus. Because the BPE corpora were conceived for a different context, there are no SotA reported results for those corpora. We also ranked each evaluated method by its F1 score.

The differences between the best method (in bold) and the SotA vary between and , very competitive results given the fact that all SotA reported results were obtained by a 10-fold cross validation scheme and our methods used a corpus from a different domain for training. Of all the methods, the Hybrid was the one that had the best performance in the corpora of product reviews. Such a result was due to the regularity of the language in this type of corpus, which makes lexical approaches highly effective. However, in domains such as Twitter, errors, abbreviations and slangs are very common, which decreases the effectiveness of lexical-based approaches. This effect can be seen in the BPE-Dilma corpus.

An important aspect of Sentiment Analysis is the sensitivity of its methods to elements such as domain and temporality. In our evaluation, both were present in the selected corpora, which demonstrates the robustness of the constructed corpus and its resilience to temporality and the non-regularity of the language.

Regarding the deep learning methods (CNN and RCNN), both presented high rankings in almost all corpora. However, there was no huge difference between deep and shallow methods (logistic regression + document representation), indicating that large datasets decrease the performance difference between methods from different natures (even between simple and complex methods), a result commonly found in the big data era [halevy2009unreasonable].

Dataset Method score Recall Accuracy
BPE-Dilma LR + w2v 5
LR + tfidf 0.64771 0.7128
LR + d2v 4
RCNN 2 0.6586
Hybrid 6

LR + w2v 6
LR + tfidf 5
LR + d2v 3
Hybrid 0.57451 0.6073 0.7344

LR + w2v 4
LR + tfidf 3
LR + d2v 6
Hybrid 0.76681 0.7695 0.7695

LR + w2v 6
LR + tfidf 2
LR + d2v 5
Hybrid 0.79171 0.7930 0.7934

Mercado Livre
LR + w2v 6
LR + tfidf 3
LR + d2v 4
Hybrid 0.86141 0.8614 0.8614
TABLE III: Results obtained by each method trained on the distant supervision corpus.

V Conclusion and Future Work

In recent years, the polarity classification task has drawn the attention of the scientific community, mainly due to its direct application in scenarios such as social media content and product reviews. Even though machine learning methods present themselves as high performance alternatives, they suffer from the need of a large amount of data during their training phases. In this paper, we adapted a distant supervision approach to build a large sentiment corpus for Portuguese. State-of-the-art methods were trained on this corpus and applied to 5 selected corpora, from same domain and different domain (cross-domain). Competitive results were obtained for all methods, although our best results did not outperform the best ones reported for the same corpora.

As future works, we intend to explore ways to improve the quality of the distant supervision corpus by applying techniques to remove outliers and tweets that do not convey any sentiment or convey the wrong sentiment. We also intend to modify this framework to make it able to represent the neutral class.


E.A.C.J. acknowledges financial support from Google (Google Research Awards in Latin America grant) and CAPES (Coordination for the Improvement of Higher Education Personnel). V.Q.M. acknowledges financial support from FAPESP (grant no. 15/05676-8). L.B.S. acknowledges financial support from Google (Google Research Awards in Latin America grant) and CNPq (National Council for Scientific and Technological Development, Brazil). T.F.C.B., M.V.T., and H.B.B. acknowledge financial support from CNPq. In part of this work a GPU donated by NVIDIA Corporation was used.


Supplementary Material

V-a Preprocessing

In order to properly tokenize and preprocess the tweets, the following steps were performed:

  • Punctuation marks forming an emoticon were considered as a single token (e.g. :-( and ;) )

  • Sequences of consecutive emojis were split so that each emoji formed a single token

  • Additional punctuation marks (not forming any emoticon) were removed

  • Usernames (an @ symbol followed by up to 15 characters) were replaced by the tag USERNAME

  • Hashtags (a # symbol followed by any sequence of characters) were replaced by the tag HASHTAG

  • URLs were completely replaced by the tag URL

  • Numbers, including dates, telephone numbers and currency values were replaced by the tag NUMBER

  • Subsequent character repetitions were limited to 3 – i.e., all sequences of the same character were trimmed to fit the limit of 3

The following tweet is used to illustrate the preprocessing steps:

Original: hj tenho aula de manhã, tarde e noite. das 8h ate 19h :(( #cansado

Preprocessed: hj tenho aula de manhã tarde e noite das NUMBER ate NUMBER :(( HASHTAG

V-B Details about the machine learning methods

Logistic Regression

There is no additional information about this classifier.

Convolutional Neural Networks

The complete architecture is presented in Figure 3, where the input layer is a matrix composed by input words, and each word is a dimensionality real vector. The convolutional layer receives these vectors as input and it is responsible for the automatic extraction of features depending on a sliding window of length . The output from the convolutional layer is then passed to the max-overtime pooling layer, and the new extracted features are concatenated. This results in a large dimensional vector that is passed to a fully connected layer, where the softmax operation [Murphy:2012] is applied, returning the probability of a document being negative or positive. In the penultimate layer, we employed dropout with a constraint on l2-norms of the weight vectors to reduce the chance of overfitting [Art:Srivastava:2014:dropout].

Fig. 3: CNN architecture adapted from [kim2014cnn].

Recurrent Convolutional Neural Networks

The complete architecture is illustrated in Figure 4. The architecture is composed by an input layer that has input features, and each feature has a dimensionality of . The convolutional layer is responsible for the automatic extraction of new features depending on neighboring words. Then, a max-pooling operation is applied over time, looking at a region of elements to find the most significant features. The new extracted features are fed into a recurrent bidirectional layer which has

units known as Long Short-Term Memory

[Art:Hochreiter:1997:lstm], which are able to learn over long dependencies between words. Finally, the last recurrent state output is passed to a totally connected layer, where the softmax operation [Murphy:2012] is calculated, giving the probability of a document being negative or positive. Between these two layers, dropout is used to reduce the chance of overfitting [Art:Srivastava:2014:dropout].

Fig. 4: RCNN architecture adapted from [treviso2017sentence].

For both neural network models (CNN and RCNN), we employed the early stopping strategy (

) to avoid overfitting, i.e. the training phase finishes when the validation loss has stopped improving. Other experimental settings for CNN and RCNN (number of epochs, batch size, learning rate, etc.) can be seen in their original papers 

[kim2014cnn, treviso2017sentence], respectively.


The two methods combined to classify a document in polarity classes are described below:

  • SVM classifier: The SVM employed uses a RBF Kernel (gamma n_features), defined as , and L1 penalty for regularization.

  • Lexical-based classifier: Each word present in a sentiment lexicon receives a value according to its polarity. Positive words are valued as

    and negative ones as . The presence of an intensification word (e.g. muito, demais) in a window around the word multiplies its value by . The presence of a downtoner divides the current value by . A negation multiplies the value of a word by , inverting its polarity. Whenever a negation is in the same window as an intensification, it becomes a downtoner (e.g. não muito), the same occurs with a downtoner (não pouco). The polarity values are then summed up to determine the document polarity.