SemEval-2014 Task 9: Sentiment Analysis in Twitter

12/06/2019 ∙ by Sara Rosenthal, et al. ∙ Johns Hopkins University Qatar Foundation Columbia University Carnegie Mellon University 0

We describe the Sentiment Analysis in Twitter task, ran as part of SemEval-2014. It is a continuation of the last year's task that ran successfully as part of SemEval-2013. As in 2013, this was the most popular SemEval task; a total of 46 teams contributed 27 submissions for subtask A (21 teams) and 50 submissions for subtask B (44 teams). This year, we introduced three new test sets: (i) regular tweets, (ii) sarcastic tweets, and (iii) LiveJournal sentences. We further tested on (iv) 2013 tweets, and (v) 2013 SMS messages. The highest F1-score on (i) was achieved by NRC-Canada at 86.63 for subtask A and by TeamX at 70.96 for subtask B.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decade, new forms of communication have emerged and have become ubiquitous through social media. Microblogs (e.g., Twitter), Weblogs (e.g., LiveJournal) and cell phone messages (SMS) are often used to share opinions and sentiments about the surrounding world, and the availability of social content generated on sites such as Twitter creates new opportunities to automatically study public opinion.This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/

Working with these informal text genres presents new challenges for natural language processing beyond those encountered when working with more traditional text genres such as newswire. The language in social media is very informal, with creative spelling and punctuation, misspellings, slang, new words, URLs, and genre-specific terminology and abbreviations, e.g., RT for re-tweet and #hashtags

111Hashtags are a type of tagging for Twitter messages..

Moreover, tweets and SMS messages are short: a sentence or a headline rather than a document.

How to handle such challenges so as to automatically mine and understand people’s opinions and sentiments has only recently been the subject of research [8, 2, 3, 5, 13, 14, 18, 9].

Several corpora with detailed opinion and sentiment annotation have been made freely available, e.g., the MPQA newswire corpus [19], the movie reviews corpus [15], or the restaurant and laptop reviews corpora that are part of this year’s SemEval Task 4 [16]. These corpora have proved very valuable as resources for learning about the language of sentiment in general, but they do not focus on tweets. While some Twitter sentiment datasets were created prior to SemEval-2013, they were either small and proprietary, such as the i-sieve corpus [9] or focused solely on message-level sentiment.

Thus, the primary goal of our SemEval task is to promote research that will lead to better understanding of how sentiment is conveyed in Social Media. Toward that goal, we created the SemEval Tweet corpus as part of our inaugural Sentiment Analysis in Twitter Task, SemEval-2013 Task 2 [12]. It contains tweets and SMS messages with sentiment expressions annotated with contextual phrase-level and message-level polarity. This year, we extended the corpus by adding new tweets and LiveJournal sentences.

Another interesting phenomenon that has been studied in Twitter is the use of the #sarcasm hashtag to indicate that a tweet should not be taken literally [7, 10]. In fact, sarcasm indicates that the message polarity should be flipped. With this in mind, this year, we also evaluate on sarcastic tweets.

In the remainder of this paper, we first describe the task, the dataset creation process and the evaluation methodology. We then summarize the characteristics of the approaches taken by the participating systems, and we discuss their scores.

2 Task Description

As SemEval-2013 Task 2, we included two subtasks: an expression-level subtask and a message-level subtask. Participants could choose to participate in either or both. Below we provide short descriptions of the objectives of these two subtasks.

Subtask A: Contextual Polarity Disambiguation

Given a message containing a marked instance of a word or a phrase, determine whether that instance is positive, negative or neutral in that context. The instance boundaries were provided: this was a classification task, not an entity recognition task.

Subtask B: Message Polarity Classification

Given a message, decide whether it is of positive, negative, or neutral sentiment. For messages conveying both positive and negative sentiment, the stronger one is to be chosen.

Each participating team was allowed to submit results for two different systems per subtask: one constrained, and one unconstrained. A constrained system could only use the provided data for training, but it could also use other resources such as lexicons obtained elsewhere. An unconstrained system could use any additional data as part of the training process; this could be done in a supervised, semi-supervised, or unsupervised fashion.

Note that constrained/unconstrained refers to the data used to train a classifier. For example, if other data (excluding the test data) was used to develop a sentiment lexicon, and the lexicon was used to generate features, the system would still be constrained. However, if other data (excluding the test data) was used to develop a sentiment lexicon, and this lexicon was used to automatically label additional Tweet/SMS messages and then used with the original data to train the classifier, then such a system would be considered unconstrained.

3 Datasets

Corpus Positive Negative Objective
/ Neutral
Twitter2013-train 5,895 3,131 471
Twitter2013-dev 648 430 57
Twitter2013-test 2,734 1,541 160
SMS2013-test 1,071 1,104 159
Twitter2014-test 1,807 578 88
Twitter2014-sarcasm 82 37 5
LiveJournal2014-test 660 511 144
Table 1: Dataset statistics for Subtask A.

In this section, we describe the process of collecting and annotating the 2014 testing tweets, including the sarcastic ones, and LiveJournal sentences.

3.1 Datasets Used

For training and development, we released the Twitter train/dev/test datasets from SemEval-2013 task 2, as well as the SMS test set, which uses messages from the NUS SMS corpus [4], which we annotated for sentiment in 2013.

We further added a new 2014 Twitter test set, as well as a small set of tweets that contained the #sarcasm hashtag to determine how sarcasm affects the tweet polarity. Finally, we included sentences from LiveJournal in order to determine how systems trained on Twitter perform on other sources. The statistics for each dataset and for each subtask are shown in Tables 1 and  2.

Corpus Positive Negative Objective
/ Neutral
Twitter2013-train 3,662 1,466 4,600
Twitter2013-dev 575 340 739
Twitter2013-test 1,572 601 1,640
SMS2013-test 492 394 1,207
Twitter2014-test 982 202 669
Twitter2014-sarcasm 33 40 13
LiveJournal2014-test 427 304 411
Table 2: Dataset statistics for Subtask B.

3.2 Annotation

Source Example Polarity
Twitter Why would you [still]- wear shorts when it’s this cold?! I [love]+ how Britain see’s a bit of sun and they’re [like ’OOOH]+ LET’S STRIP!’ positive
SMS [Sorry]- I think tonight [cannot]- and I [not feeling well]- after my rest. negative
LiveJournal [Cool]+ posts , dude ; very [colorful]+ , and [artsy]+ . positive
Twitter Sarcasm [Thanks]+ manager for putting me on the schedule for Sunday negative
Table 3: Example of polarity for each source of messages. The target phrases are marked in [], and are followed by their polarity; the sentence-level polarity is shown in the last column.

We annotated the new tweets as in 2013: by identifying tweets from popular topics that contain sentiment-bearing words by using SentiWordNet [1]

as a filter. We altered the annotation task for the sarcastic tweets, displaying them to the Mechanical Turk annotators without the #sarcasm hashtag; the Turkers had to determine whether the tweet is sarcastic on their own. Moreover, we asked Turkers to indicate the degree of sarcasm as (a) definitely sarcastic, (b) probably sarcastic, and (c) not sarcastic.

As in 2013, we combined the annotations using intersection, where a word had to appear in 2/3 of the annotations to be accepted. An annotated example from each source is shown in Table 3.

3.3 Tweets Delivery

We did not deliver the annotated tweets to the participants directly; instead, we released annotation indexes, a list of corresponding Twitter IDs, and a download script that extracts the corresponding tweets via the Twitter API.222https://dev.twitter.com We provided the tweets in this manner in order to ensure that Twitter’s terms of service are not violated. Unfortunately, due to this restriction, the task participants had access to different number of training tweets depending on when they did the downloading. This varied between a minimum of 5,215 tweets and the full set of 10,882 tweets. On average the teams were able to collect close to 9,000 tweets; for teams that did not participate in 2013, this was about 8,500. The difference in training data size did not seem to have had a major impact. In fact, the top two teams in subtask B (coooolll and TeamX) trained on less than 8,500 tweets.

4 Scoring

The participating systems were required to perform a three-way classification for both subtasks. A particular marked phrase (for subtask A) or an entire message (for subtask B) was to be classified as positive, negative or objective/neutral. We scored the systems by computing a score for predicting positive/negative phrases/messages. For instance, to compute positive precision, , we find the number of phrases/messages that a system correctly predicted to be positive, and we divide that number by the total number it predicted to be positive. To compute positive recall, , we find the number of phrases/messages correctly predicted to be positive and we divide that number by the total number of positives in the gold standard. We then calculate F1-score for the positive class as follows . We carry out a similar computation for , for the negative phrases/messages. The overall score is then .

We used the two test sets from 2013 and the three from 2014, which we combined into one test set and we shuffled to make it hard to guess which set a sentence came from. This guaranteed that participants would submit predictions for all five test sets. It also allowed us to test how well systems trained on standard tweets generalize to sarcastic tweets and to LiveJournal sentences, without the participants putting extra efforts into this. The participants were also not informed about the source the extra test sets come from.

We provided the participants with a scorer that outputs the overall score

and a confusion matrix for each of the five test sets.

5 Participants and Results

The results are shown in Tables 4 and 5, and the team affiliations are shown in Table LABEL:T:teams. Tables 4 and 5 contain results on the two progress test sets (tweets and SMS messages), which are the official test sets from the 2013 edition of the task, and on the three new official 2014 testsets (tweets, tweets with sarcasm, and LiveJournal). The tables further show macro- and micro-averaged results over the 2014 datasets. There is an index for each result showing the relative rank of that result within the respective column. The participating systems are ranked by their score on the Twitter-2014 testset, which is the official ranking for the task; all remaining rankings are secondary.

As we mentioned above, the participants were not told that the 2013 test sets would be included in the big 2014 test set, so that they do not overtune their systems on them. However, the 2013 test sets were made available for development, but it was explicitly forbidden to use them for training. Still, some participants did not notice this restriction, which resulted in their unusually high scores on Twitter2013-test; we did our best to identify all such cases, and we asked the authors to submit corrected runs. The tables mark such resubmissions accordingly.

Most of the submissions were constrained, with just a few unconstrained: 7 out of 27 for subtask A, and 8 out of 50 for subtask B. In any case, the best systems were constrained. Some teams participated with both a constrained and an unconstrained system, but the unconstrained system was not always better than the constrained one: sometimes it was worse, sometimes it performed the same. Thus, we decided to produce a single ranking, including both constrained and unconstrained systems, where we mark the latter accordingly.

5.1 Subtask A

Uncon- 2013: Progress 2014: Official 2014: Average
# System strain.? Tweet SMS Tweet Tweet Live- Macro Micro
sarcasm Journal
1 NRC-Canada 90.14 88.03 86.63 77.13 85.49 83.08 85.61
2 SentiKLUE 90.11 85.16 84.83 79.32 85.61 83.25 85.15
3 CMUQ-Hybrid 88.94 87.98 84.40 76.99 84.21 81.87 84.05
4 CMU-Qatar 89.85 88.08 83.45 78.07 83.89 81.80 83.56
5 ECNU 87.29 89.26 82.93 73.71 81.69 79.44 81.85
6 ECNU 87.28 89.31 82.67 73.71 81.67 79.35 81.75
7 Think_Positive 88.06 87.65 82.05 76.74 80.90 79.90 81.15
8 Kea 84.83 84.14 81.22 65.94 81.16 76.11 80.70
9 Lt_3 86.28 85.26 81.02 70.76 80.44 77.41 80.33
10 senti.ue 84.05 78.72 80.54 82.75 81.90 81.73 81.47
11 LyS 85.69 81.44 79.92 71.67 83.95 78.51 82.21
12 UKPDIPF 80.45 79.05 79.67 65.63 81.42 75.57 80.33
13 UKPDIPF 80.45 79.05 79.67 65.63 81.42 75.57 80.33
14 TJP 81.13 84.41 79.30 71.20 78.27 76.26 78.39
15 SAP-RI 80.32 80.26 77.26 70.64 77.68 75.19 77.32
16 senti.ue 83.80 82.93 77.07 80.02 79.70 78.93 78.83
17 SAIL 78.47 74.46 76.89 65.56 70.62 71.02 72.57
18 columbia_nlp 81.50 74.55 76.54 61.76 78.19 72.16 77.11
19 IIT-Patna 76.54 75.99 76.43 71.43 77.99 75.28 77.26
20 Citius 76.59 69.31 75.21 68.40 75.82 73.14 75.38
21 Citius 74.71 61.44 73.03 65.18 71.64 69.95 71.90
22 IITPatna 70.91 77.04 72.25 66.32 76.03 71.53 74.45
23 SU-sentilab 74.34 62.58 68.26 53.31 69.53 63.70 68.59
24 Univ. Warwick 62.25 60.12 67.28 58.08 64.89 63.42 65.48
25 Univ. Warwick 64.91 63.01 67.17 60.59 67.46 65.07 67.14
26 DAEDALUS 67.42 63.92 60.98 45.27 61.01 55.75 60.50
27 DAEDALUS 61.95 55.97 58.11 49.19 58.65 55.32 58.17
Majority baseline 38.1 31.5 42.2 39.8 33.4
Table 4: Results for subtask A. The indicates system resubmissions (because they initially trained on Twitter2013-test), and the indicates a system that includes a task co-organizer as a team member. The systems are sorted by their score on the Twitter2014 test dataset; the rankings on the individual datasets are indicated with a subscript. The last two columns show macro- and micro-averaged results across the three 2014 test datasets.

Table 4 shows the results for subtask A, which attracted 27 submissions from 21 teams. There were seven unconstrained submissions: five teams submitted both a constrained and an unconstrained run, and two teams submitted an unconstrained run only. The best systems were constrained. All participating systems outperformed the majority class baseline by a sizable margin.

5.2 Subtask B

Uncon- 2013: Progress 2014: Official 2014: Average
# System strain.? Tweet SMS Tweet Tweet Live- Macro Micro
sarcasm Journal
1 TeamX 72.12 57.36 70.96 56.50 69.44 65.63 69.99
2 coooolll 70.40 67.68 70.14 46.66 72.90 63.23 70.51
3 RTRGO 69.10 67.51 69.95 47.09 72.20 63.08 70.15
4 NRC-Canada 70.75 70.28 69.85 58.16 74.84 67.62 71.37
5 TUGAS 65.64 62.77 69.00 52.87 69.79 63.89 68.84
6 CISUC_KIS 67.56 65.90 67.95 55.49 74.46 65.97 70.02
7 SAIL 66.80 56.98 67.77 57.26 69.34 64.79 68.06
8 SWISS-CHOCOLATE 64.81 66.43 67.54 49.46 73.25 63.42 69.15
9 Synalp-Empathic 63.65 62.54 67.43 51.06 71.75 63.41 68.57
10 Think_Positive 68.15 63.20 67.04 47.85 66.96 60.62 66.47
11 SentiKLUE 69.06 67.40 67.02 43.36 73.99 61.46 68.94
12 JOINT_FORCES 66.61 62.20 66.79 45.40 70.02 60.74 67.39
13 AMI_ERIC 70.09 60.29 66.55 48.19 65.32 60.02 65.58
14 AUEB 63.92 64.32 66.38 56.16 70.75 64.43 67.71
15 CMU-Qatar 65.11 62.95 65.53 40.52 65.63 57.23 64.87
16 Lt_3 65.56 64.78 65.47 47.76 68.56 60.60 66.12
17 columbia_nlp 64.60 59.84 65.42 40.02 68.79 58.08 65.96
18 LyS 66.92 60.45 64.92 42.40 69.79 59.04 66.10
19 NILC_USP 65.39 61.35 63.94 42.06 69.02 58.34 65.21
20 senti.ue 67.34 59.34 63.81 55.31 71.39 63.50 66.38
21 UKPDIPF 60.65 60.56 63.77 54.59 71.92 63.43 66.53
22 UKPDIPF 60.65 60.56 63.77 54.59 71.92 63.43 66.53
23 SU-FMI 60.96 61.67 63.62 48.34 68.24 60.07 64.91
24 ECNU 62.31 59.75 63.17 51.43 69.44 61.35 65.17
25 ECNU 63.72 56.73 63.04 49.33 64.08 58.82 63.04
26 Rapanakis 58.52 54.02 63.01 44.69 59.71 55.80 61.28
27 Citius 63.25 58.28 62.94 46.13 64.54 57.87 63.06
28 CMUQ-Hybrid 63.22 61.75 62.71 40.95 65.14 56.27 63.00
29 Citius 62.53 57.69 61.92 41.00 62.40 55.11 61.51
30 KUNLPLab 58.12 55.89 61.72 44.60 63.77 56.70 62.00
31 senti.ue 65.21 56.16 61.47 54.09 68.08 61.21 63.71
32 UPV-ELiRF 63.97 55.36 59.33 37.46 64.11 53.63 60.49
33 USP_Biocom 58.05 53.57 59.21 43.56 67.80 56.86 61.96
34 DAEDALUS 58.94 54.96 57.64 35.26 60.99 51.30 58.26
35 IIT-Patna 52.58 51.96 57.25 41.33 60.39 52.99 57.97
36 DejaVu 57.43 55.57 57.02 42.46 64.69 54.72 59.46
37 GPLSI 57.49 46.63 56.06 53.90 57.32 55.76 56.47
38 BUAP 56.85 44.27 55.76 51.52 53.94 53.74 54.97
39 SAP-RI 50.18 49.00 55.47 48.64 57.86 53.99 56.17
40 UMCC_DLSI_Sem 51.96 50.01 55.40 42.76 53.12 50.43 54.20
41 IBM_EG 54.51 46.62 52.26 34.14 59.24 48.55 54.34
42 Alberta 53.85 49.05 52.06 40.40 52.38 48.28 51.85
43 lsis_lif 46.38 38.56 52.02 34.64 61.09 49.25 54.90
44 SU-sentilab 50.17 49.60 49.52 31.49 55.11 45.37 51.09
45 SINAI 50.59 57.34 49.50 31.15 58.33 46.33 52.26
46 IITPatna 50.32 40.56 48.22 36.73 54.68 46.54 50.29
47 Univ. Warwick 39.17 29.50 45.56 39.77 39.60 41.64 43.19
48 UMCC_DLSI_Graph 43.24 36.66 45.49 53.15 47.81 48.82 46.56
49 Univ. Warwick 34.23 24.63 45.11 31.40 29.34 35.28 38.88
50 DAEDALUS 36.57 40.86 33.03 28.96 40.83 34.27 35.81
Majority baseline 29.2 19.0 34.6 27.7 27.2
Table 5: Results for subtask B. The indicates system resubmissions (because they initially trained on Twitter2013-test), and the indicates a system that includes a task co-organizer as a team member. The systems are sorted by their score on the Twitter2014 test dataset; the rankings on the individual datasets are indicated with a subscript. The last two columns show macro- and micro-averaged results across the three 2014 test datasets.

The results for subtask B are shown in Table 5. The subtask attracted 50 submissions from 44 teams. There were eight unconstrained submissions: six teams submitted both a constrained and an unconstrained run, and two teams submitted an unconstrained run only. As for subtask A, the best systems were constrained. Again, all participating systems outperformed the majority class baseline; however, some systems were very close to it.

6 Discussion

Overall, we observed similar trends as in SemEval-2013 Task 2. Almost all systems used supervised learning. Most systems were constrained, including the best ones in all categories. As in 2013, we observed several cases of a team submitting a constrained and an unconstrained run and the constrained run performing better.

It is unclear why unconstrained systems did not outperform constrained ones. It could be because participants did not use enough external data or because the data they used was too different from Twitter or from our annotation method. Or it could be due to our definition of unconstrained, which labels as unconstrained systems that use additional tweets directly, but considers unconstrained those that use additional tweets to build sentiment lexicons and then use these lexicons.

As in 2013, the most popular classifiers were SVM, MaxEnt, and Naive Bayes. Moreover, two submissions used deep learning,

coooolll (Harbin Institute of Technology) and ThinkPositive (IBM Research, Brazil), which were ranked second and tenth on subtask B, respectively.

The features used were quite varied, including word-based (e.g., word and character -grams, word shapes, and lemmata), syntactic, and Twitter-specific such as emoticons and abbreviations. The participants still relied heavily on lexicons of opinion words, the most popular ones being the same as in 2013: MPQA, SentiWordNet and Bing Liu’s opinion lexicon. Popular this year was also the NRC lexicon [11], created by the best-performing team in 2013, which is top-performing this year as well.

Preprocessing of tweets was still a popular technique. In addition to standard NLP steps such as tokenization, stemming, lemmatization, stop-word removal and POS tagging, most teams applied some kind of Twitter-specific processing such as substitution/removal of URLs, substitution of emoticons, word normalization, abbreviation lookup, and punctuation removal. Finally, several of the teams used Twitter-tuned NLP tools such as part of speech and named entity taggers [6, 17].

The similarity of preprocessing techniques, NLP tools, classifiers and features used in 2013 and this year is probably partially due to many teams participating in both years. As Table LABEL:T:teams shows, 18 out of the 46 teams are returning teams.

Comparing the results on the progress Twitter test in 2013 and 2014, we can see that NRC-Canada, the 2013 winner for subtask A, have now improved their F1 score from 88.93 to 90.14, which is the 2014 best score. The best score on the Progress SMS in 2014 of 89.31 belongs to ECNU; this is a big jump compared to their 2013 score of 76.69, but it is less compared to the 2013 best of 88.37 achieved by GU-MLT-LT. For subtask B, on the Twitter progress testset, the 2013 winner NRC-Canada improves their 2013 result from 69.02 to 70.75, which is the second best in 2014; the winner in 2014, TeamX, achieves 72.12. On the SMS progress test, the 2013 winner NRC-Canada improves its F1 score from 68.46 to 70.28. Overall, we see consistent improvements on the progress testset for both subtasks: 0-1 and 2-3 points absolute for subtasks A and B, respectively.

Finally, note that for both subtasks, the best systems on the Twitter-2014 dataset are those that performed best on the 2013 progress Twitter dataset: NRC-Canada for subtask A, and TeamX (Fuji Xerox Co., Ltd.) for subtask B.

It is interesting to note that the best results for Twitter2014-test are lower than those for Twitter2013-test for both subtask A (86.63 vs. 90.14) and subtask B (70.96 vs 72.12). This is so despite the baselines for Twitter2014-test being higher than those for Twitter2013-test: 42.2 vs. 38.1 for subtask A, and 34.6 vs. 29.2 for subtask B. Most likely, having access to Twitter2013-test at development time, teams have overfitted on it. It could be also the case that some of the sentiment dictionaries that were built in 2013 have become somewhat outdated by 2014.

Finally, note that while some teams such as NRC-Canada performed well across all test sets, other such as TeamX, which used a weighting scheme tuned specifically for class imbalances in tweets, were only strong on Twitter datasets.

7 Conclusion

We have described the data, the experimental setup and the results for SemEval-2014 Task 9. As in 2013, our task was the most popular one at SemEval-2014, attracting 46 participating teams: 21 in subtask A (27 submissions) and 44 in subtask B (50 submissions).

We introduced three new test sets for 2014: an in-domain Twitter dataset, an out-of-domain LiveJournal test set, and a dataset of tweets containing sarcastic content. While the performance on the LiveJournal test set was mostly comparable to the in-domain Twitter test set, for most teams there was a sharp drop in performance for sarcastic tweets, highlighting better handling of sarcastic language as one important direction for future work in Twitter sentiment analysis.

We plan to run the task again in 2015 with the inclusion of a new sub-evaluation on detecting sarcasm with the goal of stimulating research in this area; we further plan to add one more test domain.

In the 2015 edition of the task, we might also remove the constrained/unconstrained distinction.

Finally, as there are multiple opinions about a topic in Twitter, we would like to focus on detecting the sentiment trend towards a topic.

Acknowledgements

We would like to thank Kathleen McKeown and Smaranda Muresan for funding the 2014 Twitter test sets. We also thank the anonymous reviewers.

References

  • [1] S. Baccianella, A. Esuli, and F. Sebastiani (19-21) SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC ’10, Valletta, Malta (english). External Links: ISBN 2-9517408-6-7 Cited by: §3.2.
  • [2] L. Barbosa and J. Feng (2010) Robust sentiment detection on Twitter from biased and noisy data. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING ’10, Beijing, China, pp. 36–44. External Links: Link Cited by: §1.
  • [3] A. Bifet, G. Holmes, B. Pfahringer, and R. Gavaldà (2011) Detecting sentiment change in Twitter streaming data.

    Journal of Machine Learning Research, Proceedings Track

    17, pp. 5–11.
    Cited by: §1.
  • [4] T. Chen and M. Kan (2013) Creating a live, public short message service corpus: the NUS SMS corpus. Language Resources and Evaluation 47 (2), pp. 299–335 (English). External Links: ISSN 1574-020X, Document, Link Cited by: §3.1.
  • [5] D. Davidov, O. Tsur, and A. Rappoport (2010) Semi-supervised recognition of sarcasm in Twitter and Amazon. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, CoNLL ’10, Uppsala, Sweden, pp. 107–116. External Links: Link Cited by: §1.
  • [6] K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and N. A. Smith (2011) Part-of-speech tagging for Twitter: annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT ’11, Portland, Oregon, USA, pp. 42–47. External Links: Link Cited by: §6.
  • [7] R. González-Ibáñez, S. Muresan, and N. Wacholder (2011) Identifying sarcasm in Twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Short Papers, ACL-HLT ’11, Portland, Oregon, USA, pp. 581–586. Cited by: §1.
  • [8] B. Jansen, M. Zhang, K. Sobel, and A. Chowdury (2009) Twitter power: tweets as electronic word of mouth. J. Am. Soc. Inf. Sci. Technol. 60 (11), pp. 2169–2188. External Links: ISSN 1532-2882, Link, Document Cited by: §1.
  • [9] E. Kouloumpis, T. Wilson, and J. Moore (2011) Twitter sentiment analysis: the good the bad and the OMG!. In Proceedings of the Fifth International Conference on Weblogs and Social Media, ICWSM ’11, Barcelona, Catalonia, Spain. Cited by: §1, §1.
  • [10] C. Liebrecht, F. Kunneman, and A. Van den Bosch (2013) The perfect solution for detecting sarcasm in tweets #not. In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Atlanta, Georgia, USA, pp. 29–37. External Links: Link Cited by: §1.
  • [11] S. Mohammad, S. Kiritchenko, and X. Zhu (2013) NRC-Canada: building the state-of-the-art in sentiment analysis of tweets. In Proceedings of the Seventh international workshop on Semantic Evaluation Exercises, SemEval-2013, Atlanta, Georgia, USA, pp. 321–327. Cited by: §6.
  • [12] P. Nakov, S. Rosenthal, Z. Kozareva, V. Stoyanov, A. Ritter, and T. Wilson (2013) SemEval-2013 task 2: sentiment analysis in Twitter. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation, SemEval ’13, Atlanta, Georgia, USA, pp. 312–320. External Links: Link Cited by: §1.
  • [13] B. O’Connor, R. Balasubramanyan, B. Routledge, and N. Smith (2010) From tweets to polls: linking text sentiment to public opinion time series. In Proceedings of the Fourth International Conference on Weblogs and Social Media, ICWSM ’10, Washington, DC, USA. Cited by: §1.
  • [14] A. Pak and P. Paroubek (2010) Twitter based system: using Twitter for disambiguating sentiment ambiguous adjectives. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, Uppsala, Sweden, pp. 436–439. External Links: Link Cited by: §1.
  • [15] B. Pang, L. Lee, and S. Vaithyanathan (2002) Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pp. 79–86. External Links: Link, Document Cited by: §1.
  • [16] M. Pontiki, H. Papageorgiou, D. Galanis, I. Androutsopoulos, J. Pavlopoulos, and S. Manandhar (2014) SemEval-2014 task 4: aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval ’14, Dublin, Ireland. Cited by: §1.
  • [17] A. Ritter, S. Clark, Mausam, and O. Etzioni (2011) Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, Edinburgh, Scotland, UK, pp. 1524–1534. External Links: Link Cited by: §6.
  • [18] A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe (2010) Predicting elections with Twitter: what 140 characters reveal about political sentiment. In Proceedings of the Fourth International Conference on Weblogs and Social Media, ICWSM ’10, Washington, DC, USA. Cited by: §1.
  • [19] J. Wiebe, T. Wilson, and C. Cardie (2005) Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39 (2-3), pp. 165–210. Cited by: §1.