Targeted sentiment analysis using neural networks based on package https://github.com/SUTDNLP/LibN3L
In recent years, sentiment analysis in social media has attracted a lot of research interest and has been used for a number of applications. Unfortunately, research has been hindered by the lack of suitable datasets, complicating the comparison between approaches. To address this issue, we have proposed SemEval-2013 Task 2: Sentiment Analysis in Twitter, which included two subtasks: A, an expression-level subtask, and B, a message-level subtask. We used crowdsourcing on Amazon Mechanical Turk to label a large Twitter training dataset along with additional test sets of Twitter and SMS messages for both subtasks. All datasets used in the evaluation are released to the research community. The task attracted significant interest and a total of 149 submissions from 44 teams. The best-performing team achieved an F1 of 88.9 69READ FULL TEXT VIEW PDF
We describe the Sentiment Analysis in Twitter task, ran as part of
We developed a system to automatically classify stance towards vaccinati...
This paper discusses the fourth year of the “Sentiment Analysis in Twitt...
State of the art benchmarks for Twitter Sentiment Analysis do not consid...
This paper describes the participation of the team "TwiSE" in the SemEva...
We widely use emojis in social networking to heighten, mitigate or negat...
We investigate cross-lingual sentiment analysis, which has attracted
Targeted sentiment analysis using neural networks based on package https://github.com/SUTDNLP/LibN3L
Mood_India involves mining the rich Twitter Data to capture national mood patterns and visualise the geographical and temporal features.
Sentiment Analysis with Incremental Learning WITH COMMAND LINE OPTIONS FOR AUTOMATION
In the past decade, new forms of communication, such as microblogging and text messaging have emerged and become ubiquitous. Twitter messages (tweets) and cell phone messages (SMS) are often used to share opinions and sentiments about the surrounding world, and the availability of social content generated on sites such as Twitter creates new opportunities to automatically study public opinion.
Working with these informal text genres presents new challenges for natural language processing beyond those encountered when working with more traditional text genres such as newswire.
Tweets and SMS messages are short in length: a sentence or a headline rather than a document. The language they use is very informal, with creative spelling and punctuation, misspellings, slang, new words, URLs, and genre-specific terminology and abbreviations, e.g., RT for re-tweet and #hashtags.111Hashtags are a type of tagging for Twitter messages. How to handle such challenges so as to automatically mine and understand the opinions and sentiments that people are communicating has only very recently been the subject of research [6, 2, 3, 5, 8, 9, 12, 7].
Another aspect of social media data, such as Twitter messages, is that they include rich structured information about the individuals involved in the communication. For example, Twitter maintains information about who follows whom. Re-tweets (re-shares of a tweet) and tags inside of tweets provide discourse information. Modeling such structured information is important because it provides means for empirically studying social interactions where opinion is conveyed, e.g., we can study the properties of persuasive language or those associated with influential users.
|RT @tash_jade: That’s really sad, Charlie RT “Until tonight I never realised how fucked up I was” - Charlie Sheen #sheenroast|
|SMS||Glad to hear you are coping fine in uni… So, wat interview did you go to? How did it go?|
Several corpora with detailed opinion and sentiment annotation have been made freely available, e.g., the MPQA corpus  of newswire text. These corpora have proved very valuable as resources for learning about the language of sentiment in general, but they did not focus on social media.
While some Twitter sentiment datasets have already been created, they were either small and proprietary, such as the i-sieve corpus , or they were created only for Spanish like the TASS corpus222http://www.daedalus.es/TASS/corpus.php , or they relied on noisy labels obtained from emoticons and hashtags. They further focused on message-level sentiment, and no Twitter or SMS corpus with expression-level sentiment annotations has been made available so far.
Thus, the primary goal of our SemEval-2013 task 2 has been to promote research that will lead to a better understanding of how sentiment is conveyed in Tweets and SMS messages. Toward that goal, we created the SemEval Tweet corpus, which contains Tweets (for both training and testing) and SMS messages (for testing only) with sentiment expressions annotated with contextual phrase-level polarity as well as an overall message-level polarity. We used this corpus as a testbed for the system evaluation at SemEval-2013 Task 2.
In the remainder of this paper, we first describe the task, the dataset creation process, and the evaluation methodology. We then summarize the characteristics of the approaches taken by the participating systems and we discuss their scores.
We had two subtasks: an expression-level subtask and a message-level subtask. Participants could choose to participate in either or both subtasks. Below we provide short descriptions of the objectives of these two subtasks.
Given a message containing a marked instance of a word or a phrase, determine whether that instance is positive, negative or neutral in that context. The boundaries for the marked instance were provided: this was a classification task, not an entity recognition task.
Given a message, decide whether it is of positive, negative, or neutral sentiment. For messages conveying both a positive and a negative sentiment, whichever is the stronger one was to be chosen.
Each participating team was allowed to submit results for two different systems per subtask: one constrained, and one unconstrained. A constrained system could only use the provided data for training, but it could also use other resources such as lexicons obtained elsewhere. An unconstrained system could use any additional data as part of the training process; this could be done in a supervised, semi-supervised, or unsupervised fashion.
Note that constrained/unconstrained refers to the data used to train a classifier. For example, if other data (excluding the test data) was used to develop a sentiment lexicon, and the lexicon was used to generate features, the system would still be constrained. However, if other data (excluding the test data) was used to develop a sentiment lexicon, and this lexicon was used to automatically label additional Tweet/SMS messages and then used with the original data to train the classifier, then such a system would be unconstrained.
|Average # of||Total Phrase Count||Vocabulary|
|Twitter - Training||25.4||120.0||5,895||3,131||471||20,012|
|Twitter - Dev||25.5||120.0||648||430||57||4,426|
|Twitter - Test||25.4||121.2||2,734||1,541||160||11,736|
|SMS - Test||24.5||95.6||1,071||1,104||159||3,562|
In the following sections we describe the collection and annotation of the Twitter and SMS datasets.
Twitter is the most common micro-blogging site on the Web, and we used it to gather tweets that express sentiment about popular topics. We first extracted named entities using a Twitter-tuned NER system  from millions of tweets, which we collected over a one-year period spanning from January 2012 to January 2013; we used the public streaming Twitter API to download tweets.
We then identified popular topics as those named entities that are frequently mentioned in association with a specific date . Given this set of automatically identified topics, we gathered tweets from the same time period which mentioned the named entities. The testing messages had different topics from training and spanned later periods.
To identify messages that express sentiment towards these topics, we filtered the tweets using SentiWordNet . We removed messages that contained no sentiment-bearing words, keeping only those with at least one word with positive or negative sentiment score that is greater than 0.3 in SentiWordNet for at least one sense of the words. Without filtering, we found class imbalance to be too high.333Filtering based on an existing lexicon does bias the dataset to some degree; however, note that the text still contains sentiment expressions outside those in the lexicon.
|Twitter - Training||3,662||1,466||4,600|
|Twitter - Dev||575||340||739|
|Twitter - Test||1,573||601||1,640|
|SMS - Test||492||394||1,208|
Twitter messages are rich in social media features, including out-of-vocabulary (OOV) words, emoticons, and acronyms; see Table 1. A large portion of the OOV words are hashtags (e.g., #sheenroast) and mentions (e.g., @tash_jade).
We annotated the same Twitter messages with annotations for subtask A and subtask B. However, the final training and testing datasets overlap only partially between the two subtasks since we had to throw away messages with low inter-annotator agreement, and this differed between the subtasks. For testing, we also annotated SMS messages, taken from the NUS SMS corpus444http://wing.comp.nus.edu.sg/SMSCorpus/ . Tables 2 and 3 show statistics about the corpora we created for subtasks A and B.
|Twitter - Train||64.7||82.4||90.8||82.7|
|Twitter - Dev||51.2||74.7||87.8||78.4|
|Twitter - Test||68.8||83.6||90.9||76.9|
|SMS - Test||66.5||88.5||81.2||77.6|
|Authorities are only too aware that Kashgar is 4,000 kilometres (2,500 miles) from Beijing but only a tenth of the distance from the Pakistani border, and are desperate to ensure instability or militancy does not leak over the frontiers.|
|Taiwan-made products stood a good chance of becoming even more competitive thanks to wider access to overseas markets and lower costs for material imports, he said.|
|”March appears to be a more reasonableestimate while earlier admission cannot be entirely ruled out,” according to Chen, also Taiwan’s chief WTO negotiator.|
|friday evening plans were great, but saturday’s plans didnt go as expected – i went dancing & it was an ok club, but terribly crowded :-(|
|WHY THE HELL DO YOU GUYS ALL HAVE MRS. KENNEDY! SHES A FUCKING DOUCHE|
|AT&T was okay but whenever they do something nice in the name of customer service it seems like a favor, while T-Mobile makes that a normal everyday thin|
|obama should be impeached on TREASON charges. Our Nuclear arsenal was TOP Secret. Till HE told our enemies what we had. #Coward #Traitor|
|My graduation speech: ”I’d like to thanks Google, Wikipedia and my computer! :D #iThingteens|
In addition, we filtered spammers by considering the following kinds of annotations invalid:
containing overlapping subjective phrases;
subjective but without a subjective phrase;
marking every single word as subjective;
not having the overall sentiment marked.
Our datasets were annotated for sentiment on Mechanical Turk. Each sentence was annotated by five Mechanical Turk workers (Turkers). In order to qualify for the hits, the Turker had to have an approval rate greater than 95% and have completed 50 approved hits. Each Turker was paid three cents per hit. The Turker had to mark all the subjective words/phrases in the sentence by indicating their start and end positions and say whether each subjective word/phrase was positive, negative, or neutral (subtask A). They also had to indicate the overall polarity of the sentence (subtask B).
For subtask A, we combined the annotations of each of the workers using intersection as indicated in the last row of Table 6. A word had to appear in 2/3 of the annotations in order to be considered subjective. Similarly, a word had to be labeled with a particular polarity (positive, negative, or neutral) 2/3 of the time in order to receive that label.
We also experimented with combining annotations by computing the union of the sentences, and taking the sentence of the worker who annotated the most hits, but we found that these methods were not as accurate. Table 4 shows the lower, average, and upper bounds for all the hits by computing the bounds for each hit and averaging them together. This gives a good indication about how well we can expect the systems to perform. For example, even if we used the best annotator each time, it would still not be possible to get perfect accuracy.
For subtask B, the polarity of the entire sentence was determined based on the majority of the labels. If there was a tie, the sentence was discarded. In order to reduce the number of sentences lost, we combined the objective and the neutral labels, which Turkers tended to mix up. Table 4 shows the average bound for subtask B by computing the bounds for each hit and averaging them together. Since the polarity is chosen based on the majority, the upper bound is 100%.
|Worker 1||I would love to watch Vampire Diaries :) and some Heroes! Great combination||9/13|
|Worker 2||I would love to watch Vampire Diaries :) and some Heroes! Great combination||11/13|
|Worker 3||I would love to watch Vampire Diaries :) and some Heroes! Great combination||10/13|
|Worker 4||I would love to watch Vampire Diaries :) and some Heroes! Great combination||13/13|
|Worker 5||I would love to watch Vampire Diaries :) and some Heroes! Great combination||11/13|
|Intersection||I would love to watch Vampire Diaries :) and some Heroes! Great combination|
For both subtasks, the participating systems were required to perform a three-way classification – a particular marked phrase (for subtask A) or an entire message (for subtask B) was to be classified as positive, negative, or objective. For each system, we computed a score for predicting positive/negative phrases/messages vs. the other two classes.
For instance, to compute positive precision, , we find the number of phrases/messages that a system correctly predicted to be positive, and we divide that number by the total number of messages it predicted to be positive. To compute recall, for the positive class, , we find the number of messages correctly predicted to be positive and we divide that number by the total number of positive messages in the gold standard.
The overall score for each system run is then given by the average of the F1-scores for the positive and negative classes: .
Note that ignoring does not reduce the task to predicting positive vs. negative labels only (even though some participants have chosen to do so) since the gold standard still contains neutral labels which are to be predicted: and would suffer if these examples are labeled as positive and/or negative instead of neutral.
We provided participants with a scorer. In addition to outputting the overall F-score, it produced a confusion matrix for the three prediction classes (positive, negative, and objective), and it also validated the data submission format.
The results for subtask A are shown in Tables 7 and 8 for Twitter and for SMS messages, respectively; those for subtask B are shown in Table 9 for Twitter and in Table 10 for SMS messages. Systems are ranked by their scores for the constrained runs; the ranking based on scores for unconstrained runs is shown as a subindex.
For both subtasks, there were teams that only submitted results for the Twitter test set. Some teams submitted both a constrained and an unconstrained version (e.g., AVAYA and teragram). As one would expect, the results on the Twitter test set tended to be better than those on the SMS test set since the SMS data was out-of-domain with respect to the training (Twitter) data.
Moreover, the results for subtask A were significantly better than those for subtask B, which shows that it is a much easier task, probably because there is less ambiguity at the phrase-level.
Table 7 shows that subtask A, Twitter, attracted 23 teams, who submitted 21 constrained and 7 unconstrained systems. Five teams submitted both a constrained and an unconstrained system, and two other teams submitted constrained systems that are on the boundary between being constrained and unconstrained.
One system was semi-supervised, and the rest were supervised. The supervised systems used classifiers such as SVM (8 systems), Naive Bayes (7 systems), and Maximum Entropy (3 systems). Other approaches used include an ensemble of classifiers, manual rules, and a linear classifier. Two of the systems chose not to predict neutral as a possible classification label.
The average F1-measure on the Twitter test set was 74.1% for constrained systems and 60.5% for unconstrained ones; this does not mean that using additional data does not help, it just shows that the best teams only participated with a constrained system. NRC-Canada had the best constrained system with an F1-measure of 88.9%, and AVAYA had the best unconstrained one with F1=87.4%.
Table 8 shows the results for the SMS test set, where 20 teams submitted 19 constrained and 7 unconstrained systems (again, this included two teams that submitted boundary systems, marked accordingly). The average F-measure on this test set was 70.8% for constrained systems and 65.7% for unconstrained systems. The best constrained system was that of GU-MLT-LT with an F-measure of 88.4%, and AVAYA had the best unconstrained system with an F1 of 85.8%.
Table 9 shows that subtask B, Twitter, attracted 38 teams, who submitted 36 constrained and 15 unconstrained systems (and two boundary ones).
The average F1-measure was 53.7% for the constrained and 54.6% for the unconstrained systems.
These averages are much lower than those for subtask A, which indicates that subtask B is harder, probably because a message can contain parts expressing both positive and negative sentiment.
Once again, NRC-Canada had the best constrained system with an F1-measure of 69%, followed by teragram, which had the best unconstrained system with an F1-measure of 64.9%.
As Table 10 shows, the average F1-measure on the SMS test set was 50.2% for constrained and 50.3% for unconstrained systems. NRC-Canada had the best constrained system with an F1=68.5%, and AVAYA had the best unconstrained one with F1-measure of 59.5%.
Overall, the results achieved by the best teams were very strong, especially for the simpler subtask A:
F1=88.93, NRC-Canada on subtask A, Twitter;
F1=88.37, GU-MLT-LT on subtask A, SMS;
F1=69.02, NRC-Canada on subtask B, Twitter;
F1=68.46, NRC-Canada on subtask B, SMS.
We can see that the strongest team overall was that of NRC-Canada, which was ranked first on three of the four conditions; and it was second on subtask A, SMS. There were two other teams that were strong across both tasks and on both test sets: GU-MLT-LT and AVAYA. Three other teams, namely teragram, BOUNCE and KLUE, were ranked in the top-3 in at least one subtask and test set.
We have seen that most participants restricted themselves to the provided data and submitted constrained systems. Indeed, the best systems for each of the two subtasks and for each of the two testing datasets were constrained systems; of course, this does not mean that additional data would not be useful. Curiously, in some cases where a team submitted a constrained and unconstrained run, the unconstrained run actually performed worse.
Not surprisingly, most systems were supervised; there were only five semi-supervised systems, and there was only one unsupervised system. One additional team declared their system as unsupervised since it was not making use of the training data; we still classified it as supervised though since it did use supervision – in the form of manual rules.
Most participants predicted all three labels (positive, negative and neutral), even though some participants opted for not predicting neutral, which made some sense since the final F1-score was averaged over the positive and the negative predictions only.
The most popular classifiers included SVM, MaxEnt, linear classifier, Naive Bayes; in some cases, manual rules or ensembles of classifiers were used.
A variety of features were used, including word-related (e.g., words, stems, -grams, word clusters), word-shape (e.g., punctuation, capitalization), syntactic (e.g., POS tags, dependency relations), Twitter-specific (e.g., repeated characters, emoticons, URLs, hashtags, slang, abbreviations), and sentiment-related (e.g., negation); one team also used discourse relations. Almost all participants relied heavily of various sentiment lexicons, the most popular ones being MPQA and SentiWordNet, as well as AFINN and Bing Liu’s Opinion Lexicon; some participants used their own lexicons – preexisting or built from the provided data.
Given that Twitter messages are noisy, most participants did some preprocessing, including tokenization, stemming, lemmatization, stopword removal, normalization/removal of URLs, hashtags, users, slang, emoticons, repeated vowels, punctuation; some even did pronoun resolution.
We have described a new task that entered SemEval-2013: task 2 on Sentiment Analysis on Twitter. The task has attracted a very high number of participants: 149 submissions from 44 teams.
We believe that the datasets that we have created as part of the task and which we have released to the community555http://www.cs.york.ac.uk/semeval-2013/task2/ under a Creative Commons Attribution 3.0 Unported License,666http://creativecommons.org/licenses/by/3.0/ will be found useful by researchers beyond SemEval.
The authors would like to thank Kathleen McKeown for her insight in creating the Amazon Mechanical Turk annotation task.
Funding for the Amazon Mechanical Turk annotations was provided by the JHU Human Language Technology Center of Excellence and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), through the U.S. Army Research Lab. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of IARPA, the ODNI or the U.S. Government.
Journal of Machine Learning Research - Proceedings Track17, pp. 5–11. Cited by: §1.