DeepAI
Log In Sign Up

YASO: A New Benchmark for Targeted Sentiment Analysis

12/29/2020
by   Matan Orbach, et al.
ibm
0

Sentiment analysis research has shifted over the years from the analysis of full documents or single sentences to a finer-level of detail – identifying the sentiment towards single words or phrases – with the task of Targeted Sentiment Analysis (TSA). While this problem is attracting a plethora of works focusing on algorithmic aspects, they are typically evaluated on a selection from a handful of datasets, and little effort, if any, is dedicated to the expansion of the available evaluation data. In this work, we present YASO – a new crowd-sourced TSA evaluation dataset, collected using a new annotation scheme for labeling targets and their sentiments. The dataset contains 2,215 English sentences from movie, business and product reviews, and 7,415 terms and their corresponding sentiments annotated within these sentences. Our analysis verifies the reliability of our annotations, and explores the characteristics of the collected data. Lastly, benchmark results using five contemporary TSA systems lay the foundation for future work, and show there is ample room for improvement on this challenging new dataset.

READ FULL TEXT VIEW PDF

page 4

page 5

05/09/2022

A Dataset and BERT-based Models for Targeted Sentiment Analysis on Turkish Texts

Targeted Sentiment Analysis aims to extract sentiment towards a particul...
12/30/2020

DynaSent: A Dynamic Benchmark for Sentiment Analysis

We introduce DynaSent ('Dynamic Sentiment'), a new English-language benc...
12/03/2019

An Annotated Dataset of Coreference in English Literature

We present in this work a new dataset of coreference annotations for wor...
04/14/2022

Challenges for Open-domain Targeted Sentiment Analysis

Since previous studies on open-domain targeted sentiment analysis are li...
07/09/2018

Towards Enhancing Lexical Resource and Using Sense-annotations of OntoSenseNet for Sentiment Analysis

This paper illustrates the interface of the tool we developed for crowd ...
11/03/2021

End-to-End Annotator Bias Approximation on Crowdsourced Single-Label Sentiment Analysis

Sentiment analysis is often a crowdsourcing task prone to subjective lab...
12/03/2021

Exploratory Data Analysis of Urdu Poetry

The study presented here provides numerical insight into ghazal – the mo...

1 Introduction

Understanding sentiment has been a major research area in NLP for decades, from the classification of review documents as expressing a positive or negative sentiment towards their topic (e.g. in Pang et al. (2002)), through a similar analysis of single sentences, and, in recent years, going deeper into identifying the sentiment expressed towards single words or phrases. For example, the sentence "it’s a useful dataset with a complex download procedure" has a positive sentiment expressed towards dataset, and a negative one conveyed towards download procedure. These are commonly referred to as the "targets" of the sentiment, and the task of Targeted Sentiment Analysis (TSA) is aimed at finding such targets and their corresponding sentiments in texts. TSA can be divided into two subtasks: target extraction (TE), focused on identifying all sentiment targets in a given text; and sentiment classification (SC), focused on determining the sentiment expressed towards a specific candidate target in a given text. The TSA outputs may then be used by downstream applications or in various kinds of quantitative or qualitative analyses.

Recent TSA works evaluate performance on a selection from a handful of datasets. The most widely used ones are the datasets accompanying the SemEval (SE) shared tasks on aspect-based sentiment analysis (ABSA) by Pontiki et al. (2014, 2015, 2016) (henceforth SE’14, SE’15 and SE’16 respectively). Others include opinions on city neighborhoods (Saeidi et al., 2016) and tweets (Dong et al., 2014). Alas, no new datasets have emerged over the last few years, presumably due to the difficulty in collecting high quality TSA labeled data at scale.

(a) Target candidates annotation
(b) Sentiment annotation
Figure 1: An example of the UI of our two-phase annotation scheme: Target candidate annotation (top) allows multiple target candidates to be marked in one sentence. Next, all marked candidates are passed through sentiment annotation (bottom), which separately collects the sentiment of every candidate.

Broadly, while a common evaluation standard allows for an easy comparison between different methods, it comes at the risk of overfitting the available test sets when they are extensively reused over time, and is sensitive to issues which may be hidden within a specific dataset or split. The smaller the set of available evaluation datasets is, the higher are those risks. Beyond that, as ongoing work improves results on existing evaluation sets, the headroom for improvements decreases, and new challenging datasets are needed. Since the set of evaluation datasets available for TSA research is small, expanding it can help mitigate the aforementioned risks and make further headroom for new algorithmic advancements. We therefore set that as our goal: to create a new TSA evaluation benchmark.

To pursue this endeavour, we begin by presenting a new two-phased annotation scheme for collecting TSA labeled data. First, annotators are shown single sentences, and are asked to mark target candidates – terms that have a sentiment conveyed towards them – within each sentence. Next, each target candidate is shown to several annotators (in the context of its containing sentence) who are asked to determine its sentiment (if any). This scheme is exemplified in Figure 1.

The two phases are complementary – the first is recall oriented, without any strict quality control. The second guarantees precision by integrating measures which ensure that sentiment labels assigned in this phase are correct. Each phase on its own is simple, thus allowing the use of crowd workers who can swiftly annotate large corpora with no special training – an advantage over the complex or pages long guidelines used for the collection of the SE datasets,111For example, see the guidelines of SE’14 and SE’15.which required annotation by linguists or students.

Our shceme is applied to the annotation of movie, product, and business reviews from several existing datasets (collected for other tasks). Several analyses are performed on the collected data: (i) its reliability is established through a manual analysis of a sample; (ii) the collected annotations are compared with existing labeled data, when available; (iii) differences from existing datasets are characterized; and (iv) an alternative to our suggested scheme is explored. All collected data are released through our website.222 www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Targeted Sentiment Analysis

To establish benchmark performance on the new evaluation dataset, five state-of-the-art (SOTA) TSA systems were reproduced, using their available codebases. In our view, the primary use of the new datasets is evaluation, so our focus is on a cross-domain setup – these baselines were trained on data from Pontiki et al. (2014), and evaluated on our new data.

In summary, our main contributions are (i) a new annotation scheme for collecting TSA labeled data; (ii) a new dataset with targets and sentiment annotations of sentences from several sources, collected using this new scheme; (iii) a detailed analysis of the produced annotations, validating their reliability; and reporting (iv) cross-domain benchmark results on the new dataset with several SOTA baseline systems.

2 Related work

As mentioned above the TSA task is composed of two subtasks, TE and SC. Pipelined TSA systems run two subtask-specific models one after the other (e.g. Karimi et al. (2020)), while joint, or end-to-end, systems have a single model for the whole task, which is typically regarded as an end-to-end sequence labeling problem Li and Lu (2019).

ABSA (Liu, 2012) is a variant of TSA, which focuses on targets that are entities (e.g. products), a part of the entity, or its attribute. Like TSA, it involves two main subtasks, aspect term extraction (e.g. Ma et al. (2019); Yu et al. (2019)), and aspect-level sentiment classification (e.g. Xu et al. (2020a)). It may also include additional subtasks – detecting aspect categories (e.g. in Tulkens and van Cranenburgh (2020)) and their corresponding sentiments. Aspect categories may not be explicitly mentioned in the text but rather implicitly related to a mentioned aspect. For example, the aspect category price is negative in the sentence "The restaurant is good but very expensive". Although ABSA results in a more elaborate output than TSA, which may be useful for some downstream applications, its labeling requires manually curated per domain aspect lists, as well as complex guidelines.

Another related task is opinion term extraction with the goal of detecting terms in texts that express a sentiment. These can be co-extracted with their corresponding targets (Wang et al., 2016, 2017), or as triplets of target term, sentiment and opinion term (Peng et al., 2020; Xu et al., 2020b). The labeling required for these tasks, namely, of opinion terms in sentences, is out-of-scope for this work.

SE’14 dataset (Pontiki et al., 2014) is an ABSA dataset which contains sentences annotated with targets and their sentiments from two reviews domains – laptops and restaurants. In this dataset, only aspects were annotated as targets, and not any other terms that have a sentiment expressed towards them. For example, computer in the sentence "good computer and fast" is not annotated in SE’14 as a target although it has a sentiment expressed towards it.

The possible sentiment labels in SE’14 were positive, negative, mixed,333Named "conflict" in that work. and neutral, the latter indicating an aspect that has no sentiment expressed towards it. Mixed stands for both a positive and a negative sentiment, for example, the sentiment towards car in "a beautiful yet unreliable car". Later on, Pontiki et al. (2015) added a new domain (hotels), moved to annotating full reviews, and updated the annotation guidelines to allow the entities themselves to be labeled as targets. In Pontiki et al. (2016) annotations on non-English texts were added, as well as more English test data for the restaurants and laptops domains.

Other TSA datasets include the Twitter data of Dong et al. (2014), the Sentihood dataset Saeidi et al. (2016) extracted from discussions on urban neighbourhoods, and BabyCare Yang et al. (2018), a dataset in Chinese collected from baby care forums. Pei et al. (2019) surveys their properties. Mitchell et al. (2013) uses an annotation scheme similar to ours, where given targets are annotated for their sentiment by crowd-workers, yet the annotated targets are limited to automatically detected names entities.

Algorithmically, earlier works on TSA (Tang et al., 2016, 2016; Ruder et al., 2016; Ma et al., 2018; Huang et al., 2018; He et al., 2018) have utilized pre-transformer models, such as LSTMs Schmidhuber and Hochreiter (1997), with word embedding representations (see surveys by Schouten and Frasincar (2015); Zhang et al. (2018)). Recent works have shifted to using pre-trained language models (Sun et al., 2019; Song et al., 2019; Zeng et al., 2019; Phan and Ogunbona, 2020). Generalization to unseen domains has also been explored, with pre-training that includes domain-specific data (Xu et al., 2019; Rietzler et al., 2020), adds sentiment-related objectives (Tian et al., 2020), or combines instance-based domain adaptation Gong et al. (2020).

3 Data

Two types of data were considered as annotation inputs. First is TSA datasets, containing sentences which have an existing annotation of targets and their sentiments. Annotating such sentences allows a reliability estimation of our approach, by comparing the newly collected data to the existing one. For this purpose,

sentences were randomly sampled from the test set of each SE’14 domain (laptops or restaurants).

The second type of data, forming the bulk of the annotation input, is review datasets, which, to the best of our knowledge, have not been previously annotated for TSA. Sentences from several datasets were sampled and used as annotation input, as follows:

Yelp

A dataset of 8M user review documents discussing more than 200k businesses. The sample included review documents, each containing between to sentences, each with a length of between to tokens. The reviews were split into sentences.

Amazon

A multilingual dataset in languages with 210k reviews per language. The English test set was sampled in the same manner as Yelp, yielding sentences from full reviews.

Sst

The Stanford Sentiment Treebank (SST) contains 11,855 movie review sentences extracted from Rotten Tomatoes (Socher et al., 2013). sentences, with a minimum length of tokens, were randomly sampled from its test set.

Opinosis

A corpus of 7,086 user review sentences from Tripadvisor (hotels), Edmunds.com (cars) and Amazon (electronics) (Ganesan et al., 2010). Each sentence discusses a topic comprised from a product name and an aspect of the product (e.g. "performance of Toyota Camry”). At least sentences were randomly sampled from each of the topics in the dataset, yielding sentences.

4 YASO: A New TSA Evaluation Dataset

Figure 2: The process for creating YASO, the new TSA evaluation dataset.

Next, we detail the process of creating the new TSA evaluation dataset, which we name YASO, for the first letters of the names of the input datasets we annotate. An input sentence was first passed through two phases of annotation, followed by several post-processing steps. Figure 2 depicts an overview of that process, as context to the details given below.

4.1 Annotation

Target candidates annotation

Each sentence sampled from the input datasets was tokenized (using spaCy by Honnibal and Montani (2017)) and shown to annotators who were asked to mark target candidates by selecting corresponding token sequences within the sentence. Then, they were instructed to select the sentiment expressed towards the candidate – positive, negative, or mixed.

Selecting multiple non-overlapping target candidates in one sentence was allowed, each with its own sentiment (see Figure (a)a). To avoid clutter and maintain a reasonable number of detected candidates, selecting overlapping spans was prohibited.

Sentiment annotation

To verify the correctness of the target candidates, each was highlighted within its containing sentence, and presented to annotators who were asked to determine its corresponding sentiment, independently of the sentiment selected in the first phase. For cases in which the annotator believed a candidate was wrongly identified and has no sentiment expressed towards it, an additional option of "none" was added to the three original labels (see Figure (b)b).

To control for annotation quality, test questions with an a-priori known answer were integrated in-between the regular questions. A per-annotator accuracy was computed on these questions, and under-performers were excluded. Annotation was done in batches, with test questions initially taken from unanimously answered questions labeled by two of the authors, and later on from unanimous answers in completed batches.

All annotations were done by a group of crowd annotators who took part in past annotations done by our team, using the Appen platform. The guidelines for each phase are included in the Appendix.

4.2 Post-processing

The sentiment label of each candidate was determined by majority vote from the available crowd answers, and the percentage of annotators who chose that majority label was defined as the confidence in the annotation. A threshold defined on these confidence values (set to 0.7 based on an analysis detailed below), separated the annotations to high-confidence ones (with a confidence greater or equal to the threshold) and low-confidence ones (with a confidence strictly below the threshold).

A target candidate is considered as valid when annotated with high-confidence as having a sentiment (i.e. its sentiment label is not "none"). The valid targets were clustered by considering overlapping spans as being in the same cluster, and merging two clusters if they contained a shared target.

4.3 Results

Confidence

The per-dataset distribution of the confidence in the annotations is depicted in Figure (a)a. For each confidence bin, one of the authors manually annotated a random sample of target candidates for their sentiments, and computed a per-bin annotation error rate (see Table 1). Based on this analysis, the threshold for the confidence in the annotations was set to , since the estimated annotation error rate below such a confidence value was high. In total, between %-% of all annotations – dark red in Figure (a)a – were considered as low-confidence.

(a) Annotation confidence distribution
(b) Sentiment labels distribution
(c) Cluster size distribution
(d) Clusters per sentence distribution
Figure 3: Per-dataset statistics of the collected annotations, showing the distributions of: LABEL:sub@fig:confidence_all_historgam The confidence in the sentiment annotation of each target candidate; LABEL:sub@fig:high_confidence_answers_histograms The sentiment labels of targets annotated with high-confidence (HC); LABEL:sub@fig:group_size_high_confidence_histogram The number of valid targets within each targets cluster; LABEL:sub@fig:targets_per_sentence The number of clusters in each annotated sentence. The datasets are marked as: SE14-L (L), SE14-R (R), Yelp (Y), Amazon (A), SST (S) and Opinosis (O).
Bin
Error % % % %
Table 1: The annotation error rate for each confidence bin, estimated by manually labeling a sample of 30 annotations from each bin (across all datasets).

Sentiment labels

Observing the distribution of sentiment labels annotated with high-confidence (Figure (b)b), hardly any targets are annotated as mixed, and in all datasets (except Amazon) there are more positive labels than negative ones. As many as % of the target candidates are labeled as not having a sentiment in this phase (black in Figure (b)b), demonstrating the need for the second annotation phase.

Clusters

In theory, a targets cluster can include targets with different sentiments, in which case our process assigns the majority sentiment to the cluster. However, in practice, all overlapping targets are annotated with the same sentiment by the crowd, adding support for the quality of the sentiment annotation. Thus, the sentiment of the cluster is simply the sentiment of its targets.

Figure (c)c depicts the distribution of the number of valid targets contained in each cluster. As can be seen, the majority of clusters contain a single target. Of the of clusters that contain two targets, follow the pattern "the/this/a/their <T>" for some term T, e.g. color and the color. In addition, of the clusters contain or more targets, mostly stemming from conjunctions or lists of targets. For example, in "Her office routine and morning routine are wonderful.", one cluster was identified with targets.444These targets are: morning routine, office routine, Her office routine, Her office routine and morning routine, office routine and morning routine, routine.

The distribution of the number of clusters identified in each annotated sentence is depicted in Figure (d)d. Around % of the sentences have one cluster identified within, and as many as % have two or more clusters (for Opinosis). Between % to % of the sentences contain no clusters, i.e. no term with a sentiment towards it was detected. Exploring the connection between the number of identified clusters, and properties of the annotated sentences, such as their length, is an interesting direction for future work.

Summary

Table 2 summarizes the statistics of the collected data, detailing the counts of annotated sentences and target candidates, targets annotated with high-confidence, valid targets and target clusters, per dataset. It also shows the average pairwise inter-annotator agreement, computed with Cohen’s Kappa (Cohen, 1960), which was in the range considered as moderate agreement (substantial agreement for SE14-R) according to Landis and Koch (1977).

Dataset #S #TC K #HC #VT #TC
SE14-L
SE14-R
Yelp
Amazon
SST
Opinosis
Total -
Table 2: Per-dataset annotation statistics: The number of annotated sentences (#S) and target candidates annotated within those sentences (#TC); The average pairwise inter-annotator agreement (K); The number of targets annotated with high confidence (#HC), and as valid targets; (#VT); The number of clusters formed from the valid targets (#TC). See §4.

The YASO dataset which we release includes sentences and annotated target candidates. For completeness and to enable further analysis, the dataset includes all candidate targets, not just valid ones. Each target is marked with its confidence, sentiment label (including raw annotation counts), and span.

5 Analysis

Next, several questions pertaining to the collected data and its annotation scheme are explored.

Is the sentiment annotation phase mandatory?

Amazon Opinosis SST Yelp SE14-L SE14-R
A TC C TC C TC C TC C TC C TC C
1 % % % % % % % % % % % %
2 % % % % % % % % % % % %
3 % % % % % % % % % % % %
4 % % % % % % % % % % % %
5 % % % % % % % % % % % %
Table 3: An analysis of the sentiment label selected in the first and second annotation phases. For each dataset the table shows (i) the percentage of target candidates (TC) identified by multiple annotators (A) in the first annotation phase; the percentage of the TC candidates that their first phase sentiment was verified as correct in the second phase (C). See §5.

Each annotator marking a target candidate in the first annotation phase also selected its sentiment. Table 3 compares this "first-phase" sentiment, which is based on to answers, to the final sentiment label assigned in the following sentiment annotation phase, which is always based on answers. As expected, when more annotators mark a candidate and choose the same sentiment in the first phase, that sentiment is often verified as correct in the second phase.

These results suggest that some candidates may be exempt from the second phase of sentiment annotation, thus reducing annotation costs, for example, candidates marked in the first phase by at least annotators have a correctly identified sentiment in more than % of cases (see three bottom rows of Table 3). Excluding these from the sentiment annotation phase could have saved between % (for Amazon) to % (for SE14-R) of the required annotations, and is thus an attractive optimization for future similar annotation efforts. Still, sentiment annotation is required for most candidates – those which are initially identified by one or two annotators.

Common Exclusive
Domain Ag. Dis. YASO SE Total
Laptops 41 5 64 11 121
Restaurants 93 5 38 9 145
Table 4: A comparison of our annotations to labels from the SE’14 dataset, on randomly sampled sentences from each of its domains. For targets having labels in both datasets, the labels agree in most cases, and disagree in some (columns Ag. and Dis.). Targets having a label in only one datasets (YASO or SE) are further analyzed in §5.
Category L R Examples
Entities 14 6 Apple, iPhone, Culinaria
Product 13 6 laptop, this bar, this place
Other 10 11 process, decision, choice
Indirect 24 11 it, she, this, this one, here
Error 3 4
Table 5: A categorization of valid targets in our data that are not part of SE’14, for the laptops (L) and restaurants (R) domains. The categories are detailed in §5.

What are the differences from SE’14?

Table 4 compares the collected clusters for the sentences sampled from SE’14, to the annotations provided within that dataset. For this comparison, the SE’14 "neutral" annotations were considered as "none" sentiment labels, since they refer to terms that have no sentiment expressed towards them. The comparison was performed by pairing each cluster, based solely on its span, with overlapping SE’14 annotations (when such annotations existed) and comparing the corresponding sentiments within each pair. In most cases, the sentiments were the same.

Notably, a significant number of clusters is exclusively present in YASO – it does not overlap any SE’14 annotation. Manually analysing such clusters, a few were identified as annotation errors, and the others were observed to belong to one of the following categories: (i) Entities, such as company names; (ii) Product terms like computer or restaurant; (iii) Other terms that are not product aspect, such as decision in "I think that was a great decision to buy"; (iv) Indirect references, including pronouns, such as It in "It was delicious and large!". This difference is expected as such terms are by definition excluded from SE’14. In contrast, they are included in YASO since by design it includes all spans people consider as having a sentiment. This makes YASO more complete, while enabling those interested to discard terms as needed for downstream applications. The per-domain frequency of each category, along with additional examples, is given in Table 5.

A similar analysis performed on the target that were exclusively found in SE’14 (i.e. not paired with any of the clusters), showed that cases were annotation errors. Three were due to complex expressions with an implicit or unclear sentiment. For example, in "Yes, they’re a bit more expensive then typical, but then again, so is their food.", the sentiment of food is unclear (and labeled as positive in SE). The rest of the SE’14 errors were due to wrong target/sentiment labels. From the other labels not paired with any cluster, three were YASO annotation errors, and the others were annotated but with low-confidence.

6 Baseline Systems

To establish benchmark results on the new data, the following five TSA systems were reproduced using their available codebases.

Bat

Karimi et al. (2020): A pipelined system with domain-specific language models (Xu et al., 2019) augmented with adversarial data.

Lcf

Yang et al. (2020): A joint model based on BERT-SPC, a BERT sentence pair classification model adapted to ABSA in Song et al. (2019), with domain adaptation and a local context focus mechanism.

Yelp Amazon SST Opinosis
System Train TE SC TSA TE SC TSA TE SC TSA TE SC TSA
BAT Lap.
Res.
BERT-E2E Lap.
Res.
HAST_MCRF Lap.
Res.
LCF Lap.
Res.
RACL Lap.
Res.
Table 6: Benchmark results on the new dataset with five SOTA systems, trained on data from one SE’14 domain (laptops – Lap. or restaurants – Res.). The reported metric is for target extraction (TE) and the entire task (TSA), and macro- for sentiment classification (SC).

Racl

Chen and Qian (2020): An end-to-end multi-task learning and relation propagation system. We used the RACL-GloVe variant, based on pre-trained word embeddings.

Bert-E2e

Li et al. (2019): A BERT-based end-to-end system of sequence labeling with a unified tagging scheme. We used the simplest architecture of a linear classification layer for the downstream model.

Hast_mcrf

A pipeline of two systems: (i) HAST

– Truncated History-Attention (THA) and Selective Transformation Network (STN), for capturing aspect detection history and opinion summary

(Li et al., 2018) (ii) MCRF-SA

– multiple CRF-based structured attention model for extracting aspect-specific opinion spans and then sentiment classification

(Xu et al., 2020a).

7 Benchmark Results

As the main purpose of our new dataset is evaluation, we experiment in a cross-domain setup, training each system on the training set of each of the SE’14 domains, yielding ten models overall. Using these models, the results were separately evaluated for the TE and SC subtasks, as well as the full TSA task. As a pre-processing step, any predicted target that its span is equal to the span of a target candidate annotated with low-confidence was excluded from the evaluation, since it is unclear what is its true label.

Metrics

A predicted target and a targets cluster are span-matched, if the span of the cluster contains the span of the prediction. Similarly, they are fully-matched if in addition to being span-mathced, their sentiments are the same. Predictions that were not span-matched to some cluster were assumed to be errors for the TE task, since their span was not detected at all during annotation. Predictions that were not fully-matched to some cluster are errors for the full task. Using span-matches for the TE task, and full-matches for the full TSA task, precision is the percentage of target predictions that match a cluster, and recall is the percentage of clusters that were matched to at least one prediction.

For SC, evaluation was restricted to predictions that were span-matched to a targets cluster. For each possible sentiment label, precision is the percentage of predictions with that sentiment that their span-matched cluster shares their sentiment, and recall is the percentage of clusters with that sentiment that match at least one prediction of that same sentiment. Lastly, macro- () was calculating by averaging the of the positive and negative sentiment labels, ignoring mixed since it was scarcely in the data.

Results

Table 6 presents the results of our evaluation. The best results for TE and the full tasks, on three of the four datasets (Yelp, SST and Opinosis), are achieved by BAT (trained on the restaurants domain). For SC, BERT-E2E was the best performing model on three datasets. Generally, results for SC are relatively high, while TE results by some models may be very low.

TSA task performance is lowest for SST, perhaps due to its domain of movie reviews which is furthest of all datasets from the product reviews training data. Interestingly, it is also the dataset on which humans disagree most (see Figure (a)a).

In principle, using predictions of automatic target extraction as target candidates can reduce annotation costs. However, the TE recall of the available automatic systems are low (even when combining all five systems; data not shown), and hence automatic TE cannot yet replace the target candidates annotation phase.

8 Conclusions

We presented a new paradigm for creating labeled data for TSA, based on a two-phased annotation scheme, and applied it to collect a new large-scale and diverse evaluation dataset, YASO, that is released as part of this work. The reliability of our annotations has been verified through a manual analysis of sampled annotations, as well as a comparison against existing labeled data. Further analysis has shown that the second annotation phase is crucial for good precision.

Our paradigm can be easily applied to new domains, and unlike the SemEval datasets does not require any domain-specific knowledge, such as manually curated per-domain aspect lists. Furthermore, our paradigm does not require elaborate guidelines, nor professional annotators, since we take the simple approach of collecting all terms lay-people consider as having a sentiment. Our post-processing ensures the high quality of the resulting data. One limitation of this approach is that further analysis is required if one desires to understand the sentiment towards specific aspects.

Finally, cross-domain benchmark results established with five contemporary TSA systems show there is ample headroom for improvement on the new dataset. More importantly, this new dataset allows future TSA research to evaluate performance beyond the extensively explored SemEval datasets, which are mainly focused on two domains. Some options for improving the presented results are training on multiple domains or datasets, adapting pre-trained models to the target domains in an unsupervised manner, exploring various data augmentation techniques, or utilizing multi-task or weak-supervision algorithms.

References

  • Z. Chen and T. Qian (2020) Relation-aware collaborative learning for unified aspect-based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3685–3694. External Links: Link, Document Cited by: §6.
  • J. Cohen (1960) A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20 (1), pp. 37–46. Cited by: §4.3.
  • L. Dong, F. Wei, C. Tan, D. Tang, M. Zhou, and K. Xu (2014)

    Adaptive recursive neural network for target-dependent Twitter sentiment classification

    .
    In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, Maryland, pp. 49–54. External Links: Link, Document Cited by: §1, §2.
  • K. Ganesan, C. Zhai, and J. Han (2010) Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In Proceedings of the 23rd International Conference on Computational Linguistics, pp. 340–348. Cited by: §3.
  • C. Gong, J. Yu, and R. Xia (2020) Unified feature and instance based domain adaptation for aspect-based sentiment analysis. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Online, pp. 7035–7045. External Links: Link, Document Cited by: §2.
  • R. He, W. S. Lee, H. Ng, and D. Dahlmeier (2018) Effective attention modeling for aspect-level sentiment classification. In COLING, Cited by: §2.
  • M. Honnibal and I. Montani (2017)

    spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing

    .
    Note: To appear Cited by: §4.1.
  • B. Huang, Y. Ou, and K. M. Carley (2018) Aspect level sentiment classification with attention-over-attention neural networks. In Social, Cultural, and Behavioral Modeling - 11th International Conference, SBP-BRiMS 2018, Washington, DC, USA, July 10-13, 2018, Proceedings, R. Thomson, C. L. Dancy, A. Hyder, and H. Bisgin (Eds.), Lecture Notes in Computer Science, Vol. 10899, pp. 197–206. External Links: Link, Document Cited by: §2.
  • A. Karimi, L. Rossi, A. Prati, and K. Full (2020) Adversarial training for aspect-based sentiment analysis with bert. arXiv preprint arXiv:2001.11316. External Links: Link, 2001.11316 Cited by: §2, §6.
  • J. R. Landis and G. G. Koch (1977) The measurement of observer agreement for categorical data. biometrics, pp. 159–174. Cited by: §4.3.
  • H. Li and W. Lu (2019) Learning explicit and implicit structures for targeted sentiment analysis. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China. External Links: Link, Document Cited by: §2.
  • X. Li, L. Bing, P. Li, W. Lam, and Z. Yang (2018) Aspect term extraction with history attention and selective transformation. In

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    ,
    IJCAI’18, pp. 4194–4200. External Links: ISBN 9780999241127 Cited by: §6.
  • X. Li, L. Bing, W. Zhang, and W. Lam (2019) Exploiting BERT for end-to-end aspect-based sentiment analysis. In

    Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

    ,
    Hong Kong, China, pp. 34–41. External Links: Link, Document Cited by: §6.
  • B. Liu (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5 (1), pp. 1–167. Cited by: §2.
  • D. Ma, S. Li, F. Wu, X. Xie, and H. Wang (2019) Exploring sequence-to-sequence learning in aspect term extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3538–3547. External Links: Link, Document Cited by: §2.
  • Y. Ma, H. Peng, and E. Cambria (2018) Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive lstm. In AAAI, Cited by: §2.
  • M. Mitchell, J. Aguilar, T. Wilson, and B. Van Durme (2013) Open domain targeted sentiment. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1643–1654. External Links: Link Cited by: §2.
  • B. Pang, L. Lee, and S. Vaithyanathan (2002)

    Thumbs up? sentiment classification using machine learning techniques

    .
    In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 79–86. External Links: Link, Document Cited by: §1.
  • J. Pei, A. Sun, and C. Li (2019) Targeted sentiment analysis: a data-driven categorization. arXiv preprint arXiv:1905.03423. External Links: 1905.03423 Cited by: §2.
  • H. Peng, L. Xu, L. Bing, F. Huang, W. Lu, and L. Si (2020) Knowing what, how and why: a near complete solution for aspect-based sentiment analysis. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05), pp. 8600–8607. External Links: Link, Document Cited by: §2.
  • M. H. Phan and P. O. Ogunbona (2020) Modelling context and syntactical features for aspect-based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3211–3220. External Links: Link, Document Cited by: §2.
  • M. Pontiki, D. Galanis, H. Papageorgiou, I. Androutsopoulos, S. Manandhar, M. AL-Smadi, M. Al-Ayyoub, Y. Zhao, B. Qin, O. De Clercq, V. Hoste, M. Apidianaki, X. Tannier, N. Loukachevitch, E. Kotelnikov, N. Bel, S. M. Jiménez-Zafra, and G. Eryiğit (2016) SemEval-2016 task 5: aspect based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 19–30. External Links: Link, Document Cited by: §1, §2.
  • M. Pontiki, D. Galanis, H. Papageorgiou, S. Manandhar, and I. Androutsopoulos (2015) SemEval-2015 task 12: aspect based sentiment analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 486–495. External Links: Link, Document Cited by: §1, §2.
  • M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androutsopoulos, and S. Manandhar (2014) SemEval-2014 task 4: aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, pp. 27–35. External Links: Link, Document Cited by: §1, §1, §2.
  • A. Rietzler, S. Stabinger, P. Opitz, and S. Engl (2020) Adapt or get left behind: domain adaptation through BERT language model finetuning for aspect-target sentiment classification. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4933–4941 (English). External Links: Link, ISBN 979-10-95546-34-4 Cited by: §2.
  • S. Ruder, P. Ghaffari, and J. G. Breslin (2016) A hierarchical model of reviews for aspect-based sentiment analysis. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 999–1005. External Links: Link, Document Cited by: §2.
  • M. Saeidi, G. Bouchard, M. Liakata, and S. Riedel (2016) SentiHood: targeted aspect based sentiment analysis dataset for urban neighbourhoods. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 1546–1556. External Links: Link Cited by: §1, §2.
  • J. Schmidhuber and S. Hochreiter (1997) Long short-term memory. Neural Comput 9 (8), pp. 1735–1780. Cited by: §2.
  • K. Schouten and F. Frasincar (2015) Survey on aspect-level sentiment analysis. IEEE Transactions on Knowledge and Data Engineering 28 (3), pp. 813–830. Cited by: §2.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §3.
  • Y. Song, J. Wang, T. Jiang, Z. Liu, and Y. Rao (2019) Attentional encoder network for targeted sentiment classification. CoRR abs/1902.09314. External Links: Link, 1902.09314 Cited by: §2, §6.
  • C. Sun, L. Huang, and X. Qiu (2019) Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 380–385. External Links: Link, Document Cited by: §2.
  • D. Tang, B. Qin, X. Feng, and T. Liu (2016) Effective LSTMs for target-dependent sentiment classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3298–3307. External Links: Link Cited by: §2.
  • D. Tang, B. Qin, and T. Liu (2016) Aspect level sentiment classification with deep memory network. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 214–224. External Links: Link, Document Cited by: §2.
  • H. Tian, C. Gao, X. Xiao, H. Liu, B. He, H. Wu, H. Wang, and F. Wu (2020) SKEP: sentiment knowledge enhanced pre-training for sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4067–4076. External Links: Link, Document Cited by: §2.
  • S. Tulkens and A. van Cranenburgh (2020) Embarrassingly simple unsupervised aspect extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3182–3187. External Links: Link, Document Cited by: §2.
  • W. Wang, S. J. Pan, D. Dahlmeier, and X. Xiao (2016) Recursive neural conditional random fields for aspect-based sentiment analysis. arXiv preprint arXiv:1603.06679. Cited by: §2.
  • W. Wang, S. J. Pan, D. Dahlmeier, and X. Xiao (2017) Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
  • H. Xu, B. Liu, L. Shu, and P. Yu (2019) BERT post-training for review reading comprehension and aspect-based sentiment analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2324–2335. External Links: Link, Document Cited by: §2, §6.
  • L. Xu, L. Bing, W. Lu, and F. Huang (2020a) Aspect sentiment classification with aspect-specific opinion spans. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3561–3567. External Links: Link Cited by: §2, §6.
  • L. Xu, H. Li, W. Lu, and L. Bing (2020b) Position-aware tagging for aspect sentiment triplet extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 2339–2349. External Links: Link, Document Cited by: §2.
  • H. Yang, B. Zeng, J. Yang, Y. Song, and R. Xu (2020) A multi-task learning model for chinese-oriented aspect polarity classification and aspect term extraction. arXiv preprint arXiv:1912.07976. External Links: Link, 1912.07976 Cited by: §6.
  • J. Yang, R. Yang, C. Wang, and J. Xie (2018) Multi-entity aspect-based sentiment analysis with context, entity and aspect memory. In AAAI, Cited by: §2.
  • J. Yu, J. Jiang, and R. Xia (2019) Global inference for aspect and opinion terms co-extraction based on multi-task neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (1), pp. 168–177. External Links: Document Cited by: §2.
  • B. Zeng, H. Yang, R. Xu, W. Zhou, and X. Han (2019) LCF: a local context focus mechanism for aspect-based sentiment classification. Applied Sciences 9 (16). External Links: Link, ISSN 2076-3417, Document Cited by: §2.
  • L. Zhang, S. Wang, and B. Liu (2018) Deep learning for sentiment analysis: a survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (4), pp. e1253. Cited by: §2.

Appendix A Target Candidates Annotation

Below are the guidelines for the labeling task of detecting potential targets and their sentiment.

General instructions

In this task you will review a set of sentences. Your goal is to identify items in the sentences that have a sentiment expressed towards them.

Steps

  1. Read the sentence carefully.

  2. Identify items that have a sentiment expressed towards them.

  3. Mark each item, and for each selection choose the expressed sentiment:

    1. Positive: the expressed sentiment is positive.

    2. Negative: the expressed sentiment is negative.

    3. Mixed: the expressed sentiment is both positive and negative.

  4. If there are no items with a sentiment expressed towards them, proceed to the next sentence.

Rules & Tips

  • Select all items in the sentence that have a sentiment expressed towards them.

  • It could be that there are several correct overlapping selections. In such cases, it is OK to choose only one of these overlapping selections.

  • The sentiment towards a selected item(s) should be expressed from other parts of the sentence, it cannot come from within the selected item (see Example #2 below).

  • Under each question is a comments box. Optionally, you can provide question-specific feedback in this box. This may include a rationalization of your choice, a description of an error within the question or the justification of another answer which was also plausible. In general, any relevant feedback would be useful, and will help in improving this task.

Examples

Here are a few example sentences, categorized into several example types. For each sentence, the examples show item(s) which should be selected, and the sentiment expressed towards each such item. Further explanations are provided within the examples, when needed. Please review the examples carefully before starting the task.

  1. Basics

    Example #1.1: The food was good.
    Correct answer: The food was good.
    Explanation: The word good expresses a positive sentiment towards food.

    Example #1.2: The food was bad.
    Correct answer: The food was bad.
    Explanation: The word bad expresses a negative sentiment towards food.

    Example #1.3: The food was tasty but expensive.
    Correct answer: The food was tasty but expensive.
    Explanation: tasty expresses a positive sentiment, while expensive expresses a negative sentiment, so the correct answer is Mixed.

    Example #1.4: The food was served.
    Correct answer: Nothing should be selected, since there is no sentiment expressed in the sentence.

  2. Sentiment location

    Example #2.1: I love this great car.
    Correct answer #1: I love this great car.
    Correct answer #2: I love this great car.
    Explanation: The word love expresses a positive sentiment towards great car or car.
    Note: It is OK to select only one of the above options, since they overlap.

    Example #2.2: I have a great car.
    Correct answer: I have a great car.
    Explanation: The word great expresses a positive sentiment towards car.
    Note: Do NOT select the item great car, because there is NO sentiment expressed towards great car outside of the phrase great car itself. The only other information is that i have a item, which does not convey a sentiment towards it.

  3. Multiple selections in one sentence

    Example #3.1: The food was good, but the atmosphere was awful.
    Correct answer: The food was good, but the atmosphere was awful.
    Explanation: the word good expresses a positive sentiment towards food, while the word awful expresses a negative sentiment towards atmosphere.
    Note: Both items should be selected!

    Example #3.2: The camera has excellent lens.
    Correct answer: The camera has excellent lens.
    Explanation: The word excellent expresses a positive sentiment towards lens. • An excellent lens is a positive thing for a camera to have, thus expressing a positive sentiment towards camera.
    Note: Both items should be selected!

    Example #3.3: My new camera has excellent lens, but its price is too high.
    Correct answer: My new camera has excellent lens, but its price is too high.
    Explanation: The word excellent expresses a positive sentiment towards lens, while the words too high expresses a negative sentiment towards price. There is a positive sentiment towards the camera, due to its excellent lens, and also a negative sentiment, because its price is too high, so the sentiment towards camera is Mixed.
    Note: All three items should be selected. Other acceptable selections with a Mixed sentiment are new camera or My new camera. Since they overlap, it is OK to select just one of them.

  4. Sentences without any expressed sentiments
    Below are some examples of sentences without any expressed sentiment in them. For such sentences, nothing should be selected.

    Example #4.1: Microwave, refrigerator, coffee maker in room.
    Example #4.2: I took my Mac to work yesterday.

  5. Long selected items
    There is no restriction on the length of a select item, so long as there is an expressed sentiment towards it in the sentence (which does not come from within the marked item).

    Example #5.1: The food from the Italian restaurant near my office was very good.
    Correct answer #1: The food from the Italian restaurant near my office was very good.
    Correct answer #2: The food from the Italian restaurant near my office was very good.
    Correct answer #3: The food from the Italian restaurant near my office was very good.
    Correct answer #4: The food from the Italian restaurant near my office was very good.
    Explanation: the words very good express a positive sentiment towards
    emphfood.

    Note: It is also a valid choice to select food along with its details description: food from the Italian restaurant near my office, or add the prefix The to the selection (or both). The selection must be a coherent phrase. food from the is not a valid selection. Since these selections all overlap, it is OK to select one of them.

Appendix B Sentiment Annotation

Below are the guidelines for labeling the sentiment of identified target candidates.

General instructions

In this task you will review a set of sentences, each containing one marked item. Your goal is to determine the sentiment expressed in the sentence towards the marked item.

Steps

  1. Read the sentence carefully.

  2. Identify the sentiment expressed in the sentence towards the marked item, by selecting one of these four options:

    1. Positive: the expressed sentiment is positive.

    2. Negative: the expressed sentiment is negative.

    3. Mixed: the expressed sentiment is both positive and negative.

    4. None: there is no sentiment expressed towards the item.

  3. If there are no items with a sentiment expressed towards them, proceed to the next sentence.

Rules & Tips

  • The sentiment should be expressed towards the marked item, it cannot come from within the marked item (see Example #2 below).

  • A sentence may appear multiple times, each time with one marked item. Different marked items may have different sentiments expressed towards each of them in one sentence (see Example #3 below)

  • Under each question is a comments box. Optionally, you can provide question-specific feedback in this box. This may include a rationalization of your choice, a description of an error within the question or the justification of another answer which was also plausible. In general, any relevant feedback would be useful, and will help in improving this task.

Examples

Here are a few examples, each containing a sentence and a marked item, along with the correct answer and further explanations (when needed). Please review the examples carefully before starting the task.

  1. Basics

    Example #1.1: The food was good.
    Answer: Positive

    Example #1.2: The food was bad.
    Answer: Negative

    Example #1.3: The food was tasty but expensive.
    Answer: Mixed
    Explanation: tasty expresses a positive sentiment, while expensive expresses a negative sentiment, so the correct answer is Mixed.

    Example #1.4: The food was served.
    Answer: None

  2. Sentiment location

    Example #2.1: I love this great car.
    Answer: Positive
    Explanation: There is a positive sentiment expressed towards great car outside of the marked item car – in the statement that I love the car.

    Example #2.2: I love this great car.
    Answer: Positive
    Explanation: There is a positive sentiment expressed towards car outside of the marked item car – in the word great and the statement that I love the car.

    Example #2.3: I have a great car.
    Answer: Positive
    Explanation: There is a positive sentiment (great) expressed towards car outside of the marked item car.

  3. Different marked items in one sentence

    Example #3.1: The food was good, but the atmosphere was awful.
    Answer: Positive
    The food was good, but the atmosphere was awful.
    Answer: Negative

    Example #3.2: The camera has excellent lens.
    Answer: Positive
    The camera has excellent lens.
    Answer: Positive

    Example #3.3: My new camera has excellent lens, but its price is too high.
    Answer: Mixed
    Explanation: There is a positive sentiment towards the camera, due to its excellent lens, and also a negative sentiment, because its price is too high, so the correct answer is Mixed.
    My new camera has excellent lens, but its price is too high.
    Answer: Positive
    My new camera has excellent lens, but its price is too high.
    Answer: Negative

  4. Marked items without a sentiment

    Below are some examples of marked items without an expressed sentiment in the sentence. In cases where there is a expressed sentiment towards other words in the same sentence, it is exemplified as well.

    Example #4.1: Microwave, refrigerator, coffee maker in room.
    Answer: None

    Example #4.2: Note that they do not serve beer, you must bring your own.
    Answer: None

    Example #4.3: The cons are more annoyances that can be lived with.
    Answer: None
    Explanation: While the marked item contains a negative sentiment, there is no sentiment towards the marked item.

    Example #4.4: working with Mac is so much easier, so many cool features.
    Answer: None
    working with Mac is so much easier, so many cool features.
    Answer: Positive
    working with Mac is so much easier, so many cool features.
    Answer: Positive

    Example #4.5: The battery life is excellent- 6-7 hours without charging.
    Answer: None
    The battery life is excellent- 6-7 hours without charging.
    Answer: Positive

    Example #4.6: I wanted a computer that was quiet, fast, and that had overall great performance.
    Answer: None

  5. “the” can be a part of a marked item

    I feel a little bit uncomfortable in using the Mac system.
    Answer: Negative
    I feel a little bit uncomfortable in using the Mac system.
    Answer: Negative
    I feel a little bit uncomfortable in using the Mac system.
    Answer: None

  6. Long marked items

    There is no restriction on the length of a marked item, so long as there is an expressed sentiment towards it in the sentence (which does not come from within the marked item).

    The food from the Italian restaurant near my office was very good.
    Answer: Positive
    The food from the Italian restaurant near my office was very good.
    Answer: Positive
    The food from the Italian restaurant near my office was very good.
    Answer: None

  7. Idioms

    A sentiment may be conveyed with an idiom – be sure you understand the meaning of an input sentence before answering. When unsure, look up potential idioms online.

    The laptop’s performance was in the middle of the pack, but so is its price.
    Answer: None
    Explanation: in the middle of the pack does not convey a positive nor a negative sentiment, and certainly not both (so the answer is not "mixed" as well).