Building a Sentiment Corpus of Tweets in Brazilian Portuguese

12/24/2017 ∙ by Henrico Bertini Brum, et al. ∙ Universidade de São Paulo 0

The large amount of data available in social media, forums and websites motivates researches in several areas of Natural Language Processing, such as sentiment analysis. The popularity of the area due to its subjective and semantic characteristics motivates research on novel methods and approaches for classification. Hence, there is a high demand for datasets on different domains and different languages. This paper introduces TweetSentBR, a sentiment corpora for Brazilian Portuguese manually annotated with 15.000 sentences on TV show domain. The sentences were labeled in three classes (positive, neutral and negative) by seven annotators, following literature guidelines for ensuring reliability on the annotation. We also ran baseline experiments on polarity classification using three machine learning methods, reaching 80.99 F-Measure and 82.06 and 64.62

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentiment Analysis (SA) became a popular area of Natural Language Processing in the last decade. The classification of semantic orientation of documents is a challenge for artificial intelligence methods since it is based not only on the regular meaning of words, but also on their semantic role in the context and on the author’s intention. Furthermore, the amount of data available in blogs, social media posts and forums has created a great opportunity for researchers to build datasets for evaluating methods and studying new linguistic phenomena.

Websites on e-commerce, movie reviews and hotel reservations usually allow the user to an objective evaluation besides the written commentaries. This objective evaluation (binary recommendation, star score, 10-point scale) can be a good feature for automatic labeling large datasets on semantic orientation, thus improving the resources for researches over the past decades [Pang et al.2002, Pang and Lee2005, Blitzer et al.2007].

The limitation of this technique is the data available in this conditions. Social media, for example, is a large source of user opinions and evaluation [Pak and Paroubek2010], but the lack of an objective score attached to the posts demands a manual annotation in order to data become useful for SA, even though the data is enriched by linguistic phenomena such as expressions, slangs and irony.

Manual annotation ends up being more expensive and time consuming, since it demands several guarantees of accuracy, such as developing guidelines, training annotators and revising the annotation [Hovy and Lavid2010].

In this paper we introduce TweetSentBR (TTsBR), a corpus manually annotated with data extracted from Twitter. The section 2 presents some related work on SA and corpus annotation. Section 3 presents the corpus and its properties, such as the size, the annotation tags, the information on annotators and the process of data extraction. Section 4 presents some data analysis and polarity classification experiments on the corpus. Section 5 is a brief discussion on the importance of the corpus and how it can be used in Brazilian Portuguese research on SA.

2 Related Work

Several works present new methods and approaches for tasks such as polarity classification [Turney2002, Pang and Lee2005], detection of irony [Carvalho et al.2009, Reyes et al.2012] and aspect extraction in text [Hu and Liu2004].

One of the major issues of this area is the building of datasets for evaluating methods and for training machine learning models. turney2002thumbs, one of the first works on polarity classification, used product reviews labeled as “recommended” and “not recommended”. The source of the data was a website called Epinions, where users could evaluate products and leave a five star score for each review. The authors considered any review with less than 3 stars as “not recommended”. pang2002thumbs uses a similar score (star rating) in order to compile a corpus of movie reviews on three classes (positive, negative and neutral).

The automatic approach worked very well for building large datasets, but the method limited research on domains where users input an objective score. Despite of the challenges of the manual annotation, researches began building new datasets by training annotators to label the data. socher2013recursive introduces Stanford Sentiment Treebank, a re-labeling of the previous IMDB corpus presented in [Pang and Lee2005]. SemEval, an important semantic evaluation event, also produces several datasets for English designed for SA tasks [Nakov et al.2016]. Some authors even used distant supervision techniques for automatic labeling large datasets quickly using features such as emoticons [Go et al.2009].

In Brazilian Portuguese, several works presented corpora for SA. freitas2012vampiro introduce ReLi, a sentiment corpus of book reviews manually annotated in three classes (positive, neutral and negative). The authors have chosen books from different publics in order to vary the linguistic phenomena in the corpus (from teenage books to literature classics). ReLi contains annotation of semantic orientation, part-of-speech tagging and aspect of opinion, and it was later used as resource for researches in SA [Balage et al.2013, Brum et al.2016]. One of the issues on this corpus observed on the literature is the unbalanced classes - the majority of sentences is neutral (), while the negative class represents only of the data.

On the product review domain, hartmann2014large presented Buscape corpus, a large corpora in Brazilian Portuguese. The corpus contains reviews labeled as positive and negative, using scores given by users on Buscape, a popular e-commerce website. A similar dataset is Mercado Livre corpus, introduced in avancco2015normalizaccao, containing product reviews also labeled automatically and balanced between the two classes.

silva2011effective collected a corpus from Twitter in Portuguese. The dataset was collected by searching two entities in the social network (Dilma and Serra, two running candidates at the time) and manually annotated as positive or negative. The corpus contains documents balanced between positive and negative. The corpus was originally constructed for sentiment stream analysis meaning it contains several retweets and links, phenomena that may interfere on sentiment classification but is vital to maintain the stream for the former task.

Also on binary polarity classification, moraes20157x1 introduce the Corpus x, a brazilian portuguese corpus on Twitter comments during the 2014 World Cup semi-finals. The corpus presents some interesting user behavior such as irony, sarcasm, cheering and angry due to the final match score. Corpus x contains tweets labeled manually in three classes - the neutral class represents tweets that do not align with wither positive or negative sentiments.

moraes2016classificaccao also uses Twitter as the source of data, but compile a corpus of computer products containing

tweets. The data is manually labeled in three classes and the authors also performed experiments on SA using lexical-based classifiers and SVM.

A large Twitter corpus was compiled by [Correa Junior et al.2017] using distant supervision. The authors labeled tweets in Brazilian Portuguese using emojis representing positive and negative sentiments following the work of go2009twitter in English. The corpus contains positive tweets and negative. The approach is a fast way to label data, but the method can not guarantee the absence of noise data such as irony, sarcasm or incorrect labels.

3 TweetSentBR

TweetSentBr is composed of tweets ( tokens) extracted using Python-Twitter 111https://github.com/bear/python-twitter, a wrapper for Twitter API. Due to the limitations of Twitter API, we developed a continuous crawler in order to obtain documents during the first semester of 2017. The final dataset is split in two documents - a training set with documents labeled in positive (%), neutral (%) and negative (%); and a test set composed of documents with similar distribution to the training set, %, % and % respectively. See Table 1 for the number of documents in each class.

Class Training set Test set Total
Positive  ()  ()
Neutral  ()  ()
Negative  ()  ()
Total
Table 1: Amount of documents in the corpus in each class.

3.1 Data source

Data was extracted from Twitter between January and July in 2017. We chose to focus on the TV show domain because of the large amount of user generated content on Twitter during the exhibition of the shows. Hashtags () are used on social media to group messages on topics and the TV shows usually ask for its audience to use a specific hashtag in order to get visibility in these social networks. Some of the program hashtags group hundreds of thousands of messages during the exhibition of a show and that content can represent suggestions, complaints, evaluations and questions to the entities related to the programs.

We empirically defined nine programs from three major TV channels in Brazil based on its popularity and activeness in social media. Talk-shows, reality shows (gastronomy and music) and variety shows were chosen in order to diversify the phenomena in the corpus. The periodicity of the exhibitions are also different, some shows go live daily when others go live once or twice a week.

Since we were looking for user generated content, we ignored documents generated by public entities, such as celebrities, companies, TV channels or any official user on Twitter. We also discarded retweets, which are the reposts of popular posts in the social network.

3.2 Classes definition

Following hovy2010towards recommendations, a codebook or manual was written to ensure the agreement between annotators. The codebook contains examples, definitions and tips for the annotation process. The rules and guidelines were formed by empirically observing the dataset crawled before the annotation update based on the feedback from the annotators during the early stages of annotation.

The definitions were created based on the domain and the input received by the annotators after the first contact with the data. These are the guidelines for the annotation in TTsBR:

Positive class: Positive sentences describe feelings of pleasure, satisfaction, compliment or recommendation. The target of the sentiment must be the TV show or any entity related to it (host, guests, audience, sketches, invited bands…). Positive comparisons, such as “This show is better than the other” are considered positive and emojis can be strong indicatives of positivity.

Negative class: Negative sentences describe feelings of disagreement, disapprove, complaint or hate. The target of the sentiment must be the TV show or any entity related to it (hosts, guests, audience, sketches, invited bands…). Negative sentences can be direct, as in “Today’s show is terrible…” or implied in a suggestion, as in “the host could improve its jokes, right?”. Factual information such as delays, abrupt cuts or technical failures are also considered negative as long as it refers for the show or any entity related to it. Emojis are also good indicatives of negativity.

Neutral class: The neutral label must be used for any sentence the annotator could not identify as an opinion (positive or negative) direct or implied. Factual sentences that do not represent a hit or a flaw, such as “Show X just started”, inaccurate semantic orientation (“Don’t know what to think about this”) and sentences the annotator can not completely comprehend were instructed to be annotated as neutral. Some tweets in the corpus were generated by social media robots (most of them on audience measurements) and the annotators labeled these as neutral as well.

We also wanted to keep track of the sentences that most caused doubt in the annotators. The annotators had a check box to mark in case of doubt in the annotation, even though this option did not prevent the annotator from labeling the sentiment of the sentence. The annotators were instructed to mark the doubt option every time they felt divided between two or more classes or when they took more than the average time ( minutes) in a sentence. The addition of a doubt option gives us new information on the data and also reduces the stress on annotators. In the first stages of annotation, only an average of of sentences were marked.

3.3 Annotation process

For the annotation process we recruited seven native speakers of Brazilian Portuguese in three different areas - linguistics, journalism and computer science. The annotation process was based on hovy2010towards, following the eight steps of annotation in order to improve the reliability of the resource.

We developed a user friendly interface for the annotators to label the tweets (Figure 1). The interface contains the codebook, the phases of annotation, a progress bar and a panel with tweets for labeling. The annotation panel shows to the annotator the three classes and a box to be checked when in doubt (even though every tweet must have a label chosen in order to proceed to the next phase) and a side box with quick tips, contact information and a link to the codebook.

Figure 1: Snapshot of the annotation interface.

Each annotator received a set with one hundred tweets to be labeled. After this step we measured the average time, agreement (all received the same set) and we took notes of the questions about the codebook guidelines. Then we proceeded to rewrite the codebook, adding more examples and detailing the definitions based on the questions presented on the first annotation.

The next step was a meeting with all annotators to receive the feedback of the annotators, when the tweets were revised by everyone and we presented the new version of the codebook.

We then proceeded to the regular annotation. First the participants labeled tweets in order to measure agreement. We used Krippendorf’s Alpha [Kripendorff2004] to measure the agreement of the annotators. In this phase we obtained on the nominal measure and on the interval measure. The annotators began the individual phases when each one labeled around tweets in six weeks, completing the training portion of the corpus.

Two supervisors annotated a small portion of the corpus in order to obtain a test set revised. The goal was to form a part of the dataset specially labeled for evaluating machine learning methods. Some of the data annotated in the agreement phase was also used to compose this set.

For the release, we define the general sentiment for each document based on a major voting of the labels provided by each annotator. Some documents were only labeled by one annotator, while other were annotated by or annotators. documents tied and have no sentiment label, even though they were kept in the dataset with a “none” label.

3.4 Release and distribution

The dataset is available in http://bitbucket.org/HBrum/tweetsentbr/. Twitter has a Privacy Policy forbidding the redistribution of the data, so we managed to provide only the ids of the tweets in the corpus. Any user with a working identification can search for the tweets freely.

We provide the dataset with the ids, the hashtags used in the search, full annotators labels, the count of how many annotators checked the doubt option and the general sentiment for each document, as well as a tool for downloading the dataset as long as having a valid credential (provided by Twitter itself). An example of the dataset is presented below:

id       hashtag     labels  h  s split
----------------------------------------
86304477Ψ#encontro   [1,1,1] 0  1 train
86558371Ψ#theNoite   [1,0,0] 2  0 test
86506323Ψ#encontro   [1]     1  1 train
86466839Ψ#masterChef [-1]    0 -1 test

4 Experiments

In order to investigate the properties of the corpora, we defined a series of experiments to determine word frequency, the class balance, and we also performed some polarity classification using baseline methods for Portuguese.

For the experiments, we performed a preprocessing of the data - we replaced numbers (dates, currency values) by a NUMBER token, we also replaced user names and links by the tokens USERNAME and URL, respectively. We trimmed repetition of characters (eg. “looooove” turns into “love”) to a minimum of repeat characters

4.1 Corpus statistics

In order to extract some information from the corpus, we measured the relevance of words on each class. We calculated the tf-idf value of each term ignoring hashtags, stop-words, emojis and punctuation. We chose to report only the polarity classes (positive and negative) since the neutral class groups several characteristics (facts, out-of-topic sentences, confusing content) that led the analysis to terms not expressive, such as the name of the shows, users nicknames and neutral verbs (present, watch,…).

The terms indicated in Table 2 show a notable semantic orientation represented in the classes - the positive class shows the verb “to love” and positive adjectives, while the negative class shows adjectives (the word “trash” is popular used as adjective on Twitter) and the verb “to remove” that may indicate a request for removing a guest from a show or even a participant from a reality show.

Positive class Negative class
# PT-BR EN PT-BR EN
1 amo to love ridículo ridiculous
2 fofura cute péssimo awful
3 adorando loving lixo trash
4 emocionada emotional tirem to remove
5 linda beautiful mala boring
Table 2: Five most relevant terms according to tf-idf for each polarity class.

4.2 Polarity classification task

In order to evaluate the corpus on the polarity classification task we used three machine learning methods proposed in avancco2015normalizaccao in product review corpora for Brazilian Portuguese. We ran two classic classifiers, Naive Bayes and SVM (linear kernel), and a hybrid approach initially proposed for cross-domain polarity classification.

The hybrid classifier combines SVM and a lexical-based approach for binary sentiment classification. The lexical-based classifier uses linguistic rules designed for binary classification, such as identifying strong sentiment words in sentences and increasing or decreasing their value based on the presence of terms (eg. very, less, much

). The lexical-based method is used when the sentence representation is located near the hyperplane (in a threshold assumed as

by the former author). The work has been used in other works in the literature [Avanço et al.2016, Correa Junior et al.2017].

We represented the sentences in vector using a binary bag-of-words (presence/absence of terms), sentiment words (obtained through Portuguese lexicons), emoticons and POS tags. The methods were built using Scikit-learn library 

[Pedregosa et al.2011].

Table 3 presents the results obtained for the binary classification which is a common approach in the area [Turney2002, Avanço et al.2016, Nakov et al.2016]. The neutral class was not used in this experiments.

Method F-Measure Accuracy
Naive Bayes 0.7987 0.8099
SVM 0.8099 0.8206
Hybrid Classifier 0.7659 0.7684
Table 3: Pos/neg classification of TweetSentBR.

For the three class classification we followed the pipeline approach used in moraes2016classificaccao - first we classify in objective and subjective (meaning neutral or polarity sentences, respectively) and then classify the polarity one in positive or negative. We did not evaluate the corpora using the Hybrid Classifier since it uses a lexical-based approach that could not be re-factored to analyze the neutral class. The results are shown in Table 4

Method F-Measure Accuracy
Naive Bayes 0.5985 0.6462
SVM 0.5964 0.6462
Table 4: Pos/neu/neg classification of TweetSentBR.

5 Discussion and future work

TweetSentBR is a manually annotated corpus designed for polarity classification. The corpus was formed using a novel domain for the Brazilian Portuguese language what can be exploited by new machine learning approaches such as deep learning architectures and ensembles.

It also offers new resources for linguistic approaches on natural language by observing the expressions, social media behavior or hate speech detection. The doubt label, for example, can be used for a better evaluation of classifiers by comparing machine learning flaws with human uncertainty on labeling data.

This corpora also differs from other approaches by including the neutral class. The addition of the neutral class approximates the corpora to popular applications, since the polarity classifiers available in the industry must find solutions for separating the opinions of users from noisy data. ReLi [Freitas et al.2012] and Computer-BR [Moraes et al.2016] are the only corpora we found in the literature that describes the use of a neutral class on the annotation.

We believe this corpora can still be improved by labeling more data manually or by using semi-supervised methods, such as self-training or co-training [da Silva et al.2016].

6 Acknowledgement

We acknowledge financial support from CNPq and the help of Amanda Carneiro, Fernando Nóbrega, Juliana Batista, Rafael Anchiêta and Thales Bertaglia. for the volunteer annotation of TTsBR and also Marcos Treviso for the development of the annotation interface.

7 Bibliographical References

References

  • [Avanço et al.2016] Avanço, L. V., Brum, H. B., and Nunes, M. (2016). Improving opinion classifiers by combining different methods and resources. XIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 25–36.
  • [Avanço2015] Avanço, L. V. (2015). Sobre normalização e classificação de polaridade de textos opinativos na web.
  • [Balage et al.2013] Balage, P. P., Pardo, T. A. S., and Aluısio, S. M. (2013). An evaluation of the brazilian portuguese liwc dictionary for sentiment analysis. Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology (STIL), pages 215–219.
  • [Blitzer et al.2007] Blitzer, J., Dredze, M., Pereira, F., et al. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. ACL, 7:440–447.
  • [Brum et al.2016] Brum, H., Araujo, F., and Kepler, F. (2016).

    Sentiment analysis for brazilian portuguese over a skewed class corpora.

    International Conference on Computational Processing of the Portuguese Language, pages 134–138.
  • [Carvalho et al.2009] Carvalho, P., Sarmento, L., Silva, M. J., and De Oliveira, E. (2009). Clues for detecting irony in user-generated contents: oh…!! it’s so easy;-. In Proceedings of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion, pages 53–56. ACM.
  • [Correa Junior et al.2017] Correa Junior, E. A., Marinho, V. Q., Santos, L. B. d., Bertaglia, T. F., Treviso, M. V., and Brum, H. B. (2017). Pelesent: Cross-domain polarity classification using distant supervision. arXiv preprint arXiv:1707.02657.
  • [da Silva et al.2016] da Silva, N. F. F., Coletta, L. F., Hruschka, E. R., and Hruschka Jr, E. R. (2016). Using unsupervised information to improve semi-supervised tweet sentiment classification. Information Sciences, 355:348–365.
  • [Freitas et al.2012] Freitas, C., Motta, E., Milidiú, R., and César, J. (2012). Vampiro que brilha… rá! desafios na anotaçao de opiniao em um corpus de resenhas de livros. Encontro de Lingística de corpus, 11:3.
  • [Go et al.2009] Go, A., Bhayani, R., and Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(2009):12.
  • [Hartmann et al.2014] Hartmann, N. S., Avanço, L. V., Balage Filho, P. P., Duran, M. S., Nunes, M. d. G. V., Pardo, T. A. S., and Aluisio, S. M. (2014). A large corpus of product reviews in portuguese: tackling out-of-vocabulary words. 9th International Conference on Language Resources and Evaluation.
  • [Hovy and Lavid2010] Hovy, E. and Lavid, J. (2010). Towards a ’science’ of corpus annotation: a new methodological challenge for corpus linguistics. International journal of translation, 22(1):13–36.
  • [Hu and Liu2004] Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM.
  • [Kripendorff2004] Kripendorff, K. (2004). Reliability in content analysis: Some common misconceptions. Human Communications Research, 30:411–433.
  • [Moraes et al.2015] Moraes, S. M. W., Manssour, I. H., and Silveira, M. S. (2015). 7x1pt: um corpus extraído do twitter para análise de sentimentos em língua portuguesa. Proceedings of Symposium in Information and Human Language Technology.
  • [Moraes et al.2016] Moraes, S. M., Santos, A. L., Redecker, M. S., Machado, R. M., and Meneguzzi, F. R. (2016). Classificação de sentimentos em nível de sentença: uma abordagem de múltiplas camadas para em lingua portuguesa. XIII Encontro Nacional de Inteligência Artificial e Computacional.
  • [Nakov et al.2016] Nakov, P., Ritter, A., Rosenthal, S., Sebastiani, F., and Stoyanov, V. (2016). Semeval-2016 task 4: Sentiment analysis in twitter. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016).
  • [Pak and Paroubek2010] Pak, A. and Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. In LREc, volume 10.
  • [Pang and Lee2005] Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 115–124, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Pang et al.2002] Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up?: Sentiment classification using machine learning techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 79–86, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Pedregosa et al.2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830.
  • [Reyes et al.2012] Reyes, A., Rosso, P., and Buscaldi, D. (2012). From humor recognition to irony detection: The figurative language of social media.

    Data & Knowledge Engineering

    , 74:1–12.
  • [Silva et al.2011] Silva, I. S., Gomide, J., Veloso, A., Meira Jr, W., and Ferreira, R. (2011). Effective sentiment stream analysis with self-augmenting training and demand-driven projection. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 475–484. ACM.
  • [Socher et al.2013] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  • [Turney2002] Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 417–424, Stroudsburg, PA, USA. Association for Computational Linguistics.