Polarity classification using distant supervision
The enormous amount of texts published daily by Internet users has fostered the development of methods to analyze this content in several natural language processing areas, such as sentiment analysis. The main goal of this task is to classify the polarity of a message. Even though many approaches have been proposed for sentiment analysis, some of the most successful ones rely on the availability of large annotated corpus, which is an expensive and time-consuming process. In recent years, distant supervision has been used to obtain larger datasets. So, inspired by these techniques, in this paper we extend such approaches to incorporate popular graphic symbols used in electronic messages, the emojis, in order to create a large sentiment corpus for Portuguese. Trained on almost one million tweets, several models were tested in both same domain and cross-domain corpora. Our methods obtained very competitive results in five annotated corpora from mixed domains (Twitter and product reviews), which proves the domain-independent property of such approach. In addition, our results suggest that the combination of emoticons and emojis is able to properly capture the sentiment of a message.READ FULL TEXT VIEW PDF
We present the Latvian Twitter Eater Corpus - a set of tweets in the nar...
NLP tasks are often limited by scarcity of manually annotated data. In s...
This paper presents a simple, robust and (almost) unsupervised
The large amount of data available in social media, forums and websites
With the exponential growth of online marketplaces and user-generated co...
As microblogging services like Twitter are becoming more and more influe...
This thesis explores the ways by how people express their opinions on Ge...
Polarity classification using distant supervision
Source code to https://arxiv.org/pdf/1707.02657.pdf
In the last few years, Sentiment Analysis has become a prominent field in natural language processing (NLP), mostly due to its direct application in several real-world scenarios [pang2008opinion], such as product reviews, government intelligence, and the prediction of the stock markets. One of the main tasks in Sentiment Analysis is the polarity classification, i.e., classifying texts into categories according to the emotions expressed on them. In general, the classes are positive, negative and neutral.
A popular application of polarity classification is in social media content. Microblogging and social networks websites, such as Twitter, have been used to express personal thoughts. According to Twitter’s website111https://business.twitter.com/en/basics.html, more than 500 million short messages, known as tweets, are posted each day. The analysis of this type of content is particularly challenging due to its specific language, which is mostly informal, with spelling errors, out of the vocabulary words, as well as the usage of emoticons and emojis to express ideas and sentiments.
Machine learning methods have been widely applied to polarity classification tasks in the context of social networks. This is particularly evident in shared tasks such as the SemEval Sentiment Analysis tasks [rosenthal2015semeval, nakov2016semeval], where these methods usually outperform lexical-based approaches. However, a major drawback of machine learning is its high dependency on large annotated corpora, and since manual annotation usually is time-consuming and expensive [pan2010cross], many non-English languages lack this type of resource or, when existing, are very limited and specific, as it is the case for Portuguese.
In this paper, we adapt a distant supervision approach [go2009twitter] to annotate a large number of tweets in Portuguese and use them to train state-of-the-art methods for polarity classification. We applied these methods in manually annotated corpora from the same domain (Twitter) and cross-domain (product reviews). The obtained results indicate that the proposed approach is well suited for both: same domain and cross-domain. Moreover, it is a powerful alternative to produce sentiment analysis corpora with less effort than manual annotation.
This paper is organized as follows. Section II gives a brief overview of some approaches for sentiment analysis and presents some works that have applied distant supervision to this task. Our approach is described in Section III. The evaluation corpora, machine learning algorithms and results are given in Section IV. Finally, our conclusions are drawn in Section V.
Currently, methods devised to perform sentiment analysis and, more specifically, polarity classification range from machine learning to lexical-based approaches. While machine learning methods have proved useful in scenarios where a large amount of training data is available along with top quality NLP resources (such as taggers, parsers and others), they usually have low performance in opposite scenarios. Since most non-English languages face resource limitations, for example Portuguese, lexical-based approaches have become very popular. Some works following this line are [souza2011construction, BalageFilho2013, avancco2014lexicon].
Another alternative for languages with fewer resources is the use of hybrid systems, which combine machine learning and lexical-based methods. Avanço et al.[avanccoimproving] showed that this combination outperforms both individual approaches. This may imply that the development of better individual elements will lead to better results in the final combination.
Machine learning approaches rely on document representations, normally vectorial ones with features like -grams [pang2008opinion]
, a simple example is the bag-of-words model. Once a representation has been chosen, several classification methods are available, such as Support Vector Machines (SVM), Naive Bayes (NB), Maximum Entropy (MaxEnt), Conditional Random Fields (CRF), and ensembles of classifiers[nakov2016semeval].
Apart from the traditional features, such as -grams, some researchers have taken advantage of word embeddings, which are known to capture some linguistic properties, such as semantic and syntactic features. A well-known example of word embeddings is Word2Vec [mikolov2013efficient, mikolov2013distributed]. Algebraic operations, such as sum or average, can be applied to convert word vectors into a sentence or document vector [zhou2016ecnu, correa2017nilc]. However, this representation does not consider the order of the words in the sentence.
Paragraph vectors [mikolovdoc2vec] (also known as Doc2Vec) can be understood as a generalization of Word2Vec for larger blocks of text, such as paragraphs or documents. This technique has obtained state-of-the-art results on sentiment analysis for two datasets of movie reviews [mikolovdoc2vec]. The main goal of these dense representations is to predict the words in those blocks. Two models were proposed by Le and Mikolov [mikolovdoc2vec], in which one of them accounts for the word order.
In addition, deep neural networks also consider the word order. Their methods have achieved good results in sentiment analysis, as shown in[socher2013recursive, kalchbrenner2014convolutional, kim2014cnn] and in the SemEval Sentiment Analysis Tasks [rosenthal2015semeval, nakov2016semeval]. Nevertheless, these approaches need large datasets for training. Distant Supervision is a good alternative to obtain these datasets for the training/pre-training of deep neural networks [kalchbrenner2014convolutional, severyn2015unitn, deriu2016swisscheese].
Distant supervision is an alternative to create large datasets without the need of manual annotation. Some works have reported the use of emoticons as semantic indicator for sentiment [read2005using, go2009twitter, pak2010twitter, severyn2015unitn], while others use emoticons and hashtags for the same purpose [davidov2010enhanced, kouloumpis2011twitter]. Go et al.[go2009twitter], the first work to apply distant supervision to Twitter data, collected approximately 1.6 million of tweets containing positive and negative emoticons – e.g. “:)” and “:(” – equally distributed into two classes. They combined sets of features – unigrams, bigrams, part-of-speech (POS) tags – in order to train machine learning algorithms (NB, MaxEnt and SVM) and evaluate those in manually annotated datasets. The best accuracy was achieved using unigram and bigram as features for a MaxEnt classifier.
Severyn and Moschitti[severyn2015unitn]
used Distant Supervision to pre-train a Convolutional Neural Network (CNN). An architecture similar to the one proposed by Kim[kim2014cnn]deriu2016swisscheese]
used a combination of 2 CNNs with a Random Forest classifier. However, this approach did not obtain improvements with distant supervision.
Despite the numerous studies and investigations of different techniques and methods for polarity classification, the problem of relying on large annotated corpora remains open and the difficulty is intensified in non-English languages. In this paper, our contributions are the adapted framework for building polarity classification corpus to Portuguese, the built corpus itself and an evaluation on different state-of-the-art methods using this corpus, for same domain and cross-domain corpora.
Following the approach of Go et al. [go2009twitter], we initially collected a large amount of tweets in order to create the distant supervision corpus. Only tweets in Portuguese were crawled, and no specific queries were employed. In total, 41 million of tweets were collected.
After collecting the tweets, the next step was to split them into positive and negative classes. In order to do so, we used lists of emojis and emoticons selected according to the sentiment conveyed by them. Therefore, the polarity of a tweet is determined by the presence of emojis and emoticons in it – if it only contains positive ones (from the positive list), its polarity is assigned as positive. If a tweet contains both positive and negative elements, it is discarded since it is likely to be ambiguous. Following this idea, we used the same list of emoticons used by Go et al. [go2009twitter], which is presented in Table I. Go et al.[go2009twitter] did not use emojis, but these graphic symbols are also employed to convey ideas and sentiments [novak2015sentiment]. In contrast to the small set of emoticons, there are hundreds of possible emojis. Therefore, we selected a representative list with positive and negative emojis. All the emojis conveying positive emotion are presented in Fig. 1. Fig. 2 illustrates the selected ones with negative emotion.
After filtering the tweets by the aforementioned criteria, we obtained a labeled corpus comprising 554,623 positive tweets and 425,444 negative ones. This corpus was used to train the machine learning methods. It is important to highlight that emojis and emoticons were removed from the tweets in the final corpus, so that their presence as a sentiment indicator is not learned by the models.
In addition to the filtering process, some preprocessing steps were performed to improve the corpus quality. Details about the preprocessing steps are given in the Supplementary Material, Section A. After these steps, tweets containing less than 4 tokens were discarded from the corpus. The complete framework (tweets collection, filtering and preprocessing methods) along with all experimental evaluation will be made available 222https://github.com/edilsonacjr/pelesent.
|:) :-) :D =)||:( :-(|
In order to evaluate the quality of the corpus built using distant supervision, we trained state-of-the-art methods for polarity classification and applied the learned models to 5 well known manually annotated sentiment corpora. In the following, we present these corpora along with the message polarity classification methods, and finally, the obtained results.
Sentiment classifiers are usually trained on manually annotated corpora. Because sentiments may be expressed differently in different domains [pan2010cross], it is common to create domain-specific corpus. Since we intend to create a robust and generic corpus that is not domain-specific, we selected 5 corpora for evaluation, 2 being from the same domain (Twitter) and 3 from a different domain (product reviews). Below, we present the corpora that were used.
This dataset is formed by tweets about the Brazilian presidential election run in 2010. The corpus is divided in two parts, one referencing Dilma Rousseff (BPE-Dilma) and the other José Serra (BPE-Serra), both being the most popular candidates in the election. The corpora were manually annotated in positive and negative, and used to evaluate stream based sentiment analysis systems.
This dataset is formed by product reviews extracted from Buscapé website333http://www.buscape.com.br/. The documents were automatically labeled based on two informations given by the users. The first (Buscape-1) is based on a recommendation tag while the second (Buscape-2) is based on a 5-star scale (1-2 stars for negative and 4-5 stars for positive). Both corpora are balanced between the two classes, even though there is a notable difference on their sizes, possibly due to the low use of the recommendation tag.
Similar to the Buscapé dataset, this corpus is formed by product reviews from the online marketplace Mercado Livre444http://www.mercadolivre.com.br/. The corpus was also automatically annotated based on a 5-star scale given by the authors of the reviews. The dataset is balanced between the positive and negative classes.
Table II presents a summary of the corpora.
Machine learning has dominated the area of sentiment analysis, mostly because its high performance when manually annotated data is available. However, thanks to the great variety of methods, there is no consensus about which method is the best in this scenario. In the last editions of SemEval Sentiment Analysis Task, most of the best methods/systems used deep learning techniques[rosenthal2015semeval, nakov2016semeval]. In this work, the evaluated methods range from simple linear models for classification using vector space models to hybrid (machine learning and lexical-based) and Deep Learning methods. The idea was to thoroughly evaluate the quality of the corpus regardless of the technique being used for learning. Below, each method is briefly described.
. In this paper, the logistic regression model predicts the class probabilities of a text, where the classes are ”positive” and ”negative”. As input for this classifier, three text representations were used: (1) a bag-of-words model (LR+tfidf), where each document (tweet or review) is represented by its set of words weighted bytf-idf [salton1989automatic]; (2) a word embeddings based model (LR+w2v), where each document is represented by the weighted average of the embedding vectors (generated by Word2Vec [mikolov2013distributed, mikolov2013efficient]) of the words that compose the document, the weights are defined by tf-idf; (3) the Paragraph Vector model (LR+d2v), which uses a neural network to generate embeddings for words and documents simultaneously in an unsupervised manner. Only the vectorial representations of documents were used by the classifier.
With the popularity of deep learning, CNNs have been applied to many different contexts, including several NLP tasks [collobert2011nlp] and, more specifically, sentiment analysis [kalchbrenner2014convolutional, kim2014cnn, severyn2015unitn, deriu2016swisscheese]. Our CNN is similar to the architecture proposed by Kim [kim2014cnn], which uses a single convolutional layer. In this architecture, the network receives as input a matrix representing the document, and each word in the document is represented by a dense continuous vector. The output of the network is the probability of a document being negative or positive.
This deep neural architecture uses both convolutional and recurrent layers. Recently explored by many works in NLP [kalchbrenner2013recurrent, treviso2017sentence, lai2015recurrent], this architecture has been successfully applied to sentiment analysis [lai2015recurrent, nakov2016semeval]. Our architecture consists of a slight modification of the one used by Treviso et al.[treviso2017sentence], where the final layer returns the probability for the whole document, indicating a positive/negative polarity. Using this combination of convolutional and recurrent layers, we explored the principle that nearby words have a greater influence in the classification, while distant words may also have some impact.
This method is a combination of two classifiers previously used for sentiment classification in cross-domain corpora [avanccoimproving] and follows the same setting introduced by Avanço[avancco2015normalizaccao]
. The method consists of a SVM classifier combined with a lexical-based approach. The documents are represented by arrays of features including a binary bag-of-words (presence/absence of terms), emoticons, sentiment words and POS tags. Documents located near the separation hyperplane (in a threshold assumed as) learned by the SVM are considered to be uncertain. Those documents are then classified with a lexical-based approach, that uses linguistic rules for polarity classification in Portuguese.
For all methods, well-known machine learning libraries were used, such as Scikit-learn [pedregosa2011scikit]
and Keras[chollet2015keras]. Particularities such as parameters, details about the architecture, initializations and others can be found in the Supplementary Material, Section B.
To evaluate and compare the methods in each corpus, F1 score (macro-averaged), recall (macro-averaged) and accuracy were chosen, mostly because of their traditional use in sentiment analysis [rosenthal2015semeval, nakov2016semeval].
The main results are shown in Table III. Along with the results of each polarity classification method, we present the state-of-the-art (SotA) result reported for each corpus. Because the BPE corpora were conceived for a different context, there are no SotA reported results for those corpora. We also ranked each evaluated method by its F1 score.
The differences between the best method (in bold) and the SotA vary between and , very competitive results given the fact that all SotA reported results were obtained by a 10-fold cross validation scheme and our methods used a corpus from a different domain for training. Of all the methods, the Hybrid was the one that had the best performance in the corpora of product reviews. Such a result was due to the regularity of the language in this type of corpus, which makes lexical approaches highly effective. However, in domains such as Twitter, errors, abbreviations and slangs are very common, which decreases the effectiveness of lexical-based approaches. This effect can be seen in the BPE-Dilma corpus.
An important aspect of Sentiment Analysis is the sensitivity of its methods to elements such as domain and temporality. In our evaluation, both were present in the selected corpora, which demonstrates the robustness of the constructed corpus and its resilience to temporality and the non-regularity of the language.
Regarding the deep learning methods (CNN and RCNN), both presented high rankings in almost all corpora. However, there was no huge difference between deep and shallow methods (logistic regression + document representation), indicating that large datasets decrease the performance difference between methods from different natures (even between simple and complex methods), a result commonly found in the big data era [halevy2009unreasonable].
|BPE-Dilma||LR + w2v||5|
|LR + tfidf||0.64771||0.7128|
|LR + d2v||4|
|LR + w2v||6|
|LR + tfidf||5|
|LR + d2v||3|
|LR + w2v||4|
|LR + tfidf||3|
|LR + d2v||6|
|LR + w2v||6|
|LR + tfidf||2|
|LR + d2v||5|
|LR + w2v||6|
|LR + tfidf||3|
|LR + d2v||4|
In recent years, the polarity classification task has drawn the attention of the scientific community, mainly due to its direct application in scenarios such as social media content and product reviews. Even though machine learning methods present themselves as high performance alternatives, they suffer from the need of a large amount of data during their training phases. In this paper, we adapted a distant supervision approach to build a large sentiment corpus for Portuguese. State-of-the-art methods were trained on this corpus and applied to 5 selected corpora, from same domain and different domain (cross-domain). Competitive results were obtained for all methods, although our best results did not outperform the best ones reported for the same corpora.
As future works, we intend to explore ways to improve the quality of the distant supervision corpus by applying techniques to remove outliers and tweets that do not convey any sentiment or convey the wrong sentiment. We also intend to modify this framework to make it able to represent the neutral class.
E.A.C.J. acknowledges financial support from Google (Google Research Awards in Latin America grant) and CAPES (Coordination for the Improvement of Higher Education Personnel). V.Q.M. acknowledges financial support from FAPESP (grant no. 15/05676-8). L.B.S. acknowledges financial support from Google (Google Research Awards in Latin America grant) and CNPq (National Council for Scientific and Technological Development, Brazil). T.F.C.B., M.V.T., and H.B.B. acknowledge financial support from CNPq. In part of this work a GPU donated by NVIDIA Corporation was used.
In order to properly tokenize and preprocess the tweets, the following steps were performed:
Punctuation marks forming an emoticon were considered as a single token (e.g. :-( and ;) )
Sequences of consecutive emojis were split so that each emoji formed a single token
Additional punctuation marks (not forming any emoticon) were removed
Usernames (an @ symbol followed by up to 15 characters) were replaced by the tag USERNAME
Hashtags (a # symbol followed by any sequence of characters) were replaced by the tag HASHTAG
URLs were completely replaced by the tag URL
Numbers, including dates, telephone numbers and currency values were replaced by the tag NUMBER
Subsequent character repetitions were limited to 3 – i.e., all sequences of the same character were trimmed to fit the limit of 3
The following tweet is used to illustrate the preprocessing steps:
Original: hj tenho aula de manhã, tarde e noite. das 8h ate 19h :(( #cansado
Preprocessed: hj tenho aula de manhã tarde e noite das NUMBER ate NUMBER :(( HASHTAG
There is no additional information about this classifier.
The complete architecture is presented in Figure 3, where the input layer is a matrix composed by input words, and each word is a dimensionality real vector. The convolutional layer receives these vectors as input and it is responsible for the automatic extraction of features depending on a sliding window of length . The output from the convolutional layer is then passed to the max-overtime pooling layer, and the new extracted features are concatenated. This results in a large dimensional vector that is passed to a fully connected layer, where the softmax operation [Murphy:2012] is applied, returning the probability of a document being negative or positive. In the penultimate layer, we employed dropout with a constraint on l2-norms of the weight vectors to reduce the chance of overfitting [Art:Srivastava:2014:dropout].
The complete architecture is illustrated in Figure 4. The architecture is composed by an input layer that has input features, and each feature has a dimensionality of . The convolutional layer is responsible for the automatic extraction of new features depending on neighboring words. Then, a max-pooling operation is applied over time, looking at a region of elements to find the most significant features. The new extracted features are fed into a recurrent bidirectional layer which has
units known as Long Short-Term Memory[Art:Hochreiter:1997:lstm], which are able to learn over long dependencies between words. Finally, the last recurrent state output is passed to a totally connected layer, where the softmax operation [Murphy:2012] is calculated, giving the probability of a document being negative or positive. Between these two layers, dropout is used to reduce the chance of overfitting [Art:Srivastava:2014:dropout].
For both neural network models (CNN and RCNN), we employed the early stopping strategy (
) to avoid overfitting, i.e. the training phase finishes when the validation loss has stopped improving. Other experimental settings for CNN and RCNN (number of epochs, batch size, learning rate, etc.) can be seen in their original papers[kim2014cnn, treviso2017sentence], respectively.
The two methods combined to classify a document in polarity classes are described below:
SVM classifier: The SVM employed uses a RBF Kernel (gamma n_features), defined as , and L1 penalty for regularization.
Lexical-based classifier: Each word present in a sentiment lexicon receives a value according to its polarity. Positive words are valued asand negative ones as . The presence of an intensification word (e.g. muito, demais) in a window around the word multiplies its value by . The presence of a downtoner divides the current value by . A negation multiplies the value of a word by , inverting its polarity. Whenever a negation is in the same window as an intensification, it becomes a downtoner (e.g. não muito), the same occurs with a downtoner (não pouco). The polarity values are then summed up to determine the document polarity.