Disentangling Aspect and Opinion Words in Target-based Sentiment Analysis using Lifelong Learning

02/16/2018 ∙ by Shuai Wang, et al. ∙ Association for Computing Machinery University of Illinois at Chicago 0

Given a target name, which can be a product aspect or entity, identifying its aspect words and opinion words in a given corpus is a fine-grained task in target-based sentiment analysis (TSA). This task is challenging, especially when we have no labeled data and we want to perform it for any given domain. To address it, we propose a general two-stage approach. Stage one extracts/groups the target-related words (call t-words) for a given target. This is relatively easy as we can apply an existing semantics-based learning technique. Stage two separates the aspect and opinion words from the grouped t-words, which is challenging because we often do not have enough word-level aspect and opinion labels. In this work, we formulate this problem in a PU learning setting and incorporate the idea of lifelong learning to solve it. Experimental results show the effectiveness of our approach.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Target-based sentiment analysis (TSA) is an important topic in sentiment analysis [Jiang et al.2011, Vo and Zhang2015, Wang et al.2016a]. A target can be a product aspect or entity, and is also referred to as aspect in the literature. In this paper, we use target as an aspect category and aspect words as related mentions of the target. We focus on a fine-grained TSA (or FTSA) task: given a target name, identifying its aspect words and opinion words in a given domain corpus. For example, one is interested in opinions about the target “screen” of a camera, and wants to find out all related aspect and opinion words mentioned in reviews. We may find aspect words like “LCD,” “display” and “resolution,” and opinion related words like “scratched,” “blurry” and “bubbly.”

This problem is challenging, especially when there is no labeled data and we want to perform it in any domain (reviews of any product). In practice, it is not feasible to manually annotate all possible targets beforehand for every domain. Designing an unsupervised or semi-supervised method for the task is thus needed. For this goal, we developed a two-stage approach which does not require manual labeling.

(a) Grouping
(b) Disentangling
Figure 1: Two-stage approach to Fine-grained TSA (FTSA)

Stage one is defined as target-related words extraction/grouping. That is, given a target name, we first identify the target-related words (called t-words) in a corpus. For instance, when the target is screen, the t-words “display”, “LCD”, “scratched” and “bubbly” are extracted (as shown in Figure (a)a

). We can achieve this by using the semantic representation of words obtained from distributional representation learning 

[Levy et al.2015]. Specifically, given a target name, we extract its semantically similar words from its semantic representation as t-words. This grouping stage is very similar to the (unsupervised) aspect extraction in aspect-level sentiment analysis, so many existing approaches [Mukherjee and Liu2012, Liu2012] exploiting the learned semantics can be utilized too, like topic modeling [Blei et al.2003].

One key issue of most semantics learning techniques for our task (FTSA) is that, they will inevitably couple the target-related aspect words (called t-aspect words) and opinion words (called t-opinion words). We interpret its cause from a linguistic perspective. Most semantics learning models are developed based on the idea of distributional hypothesis: linguistic items occurring with similar contexts have similar meanings [Harris1954], so they in fact group two different types of semantic similarity together, namely, conceptual and associative similarity. Conceptual similarity means two words are conceptually similar (likely replaceable), like “dog” and “canine”. Associative similarity means two words tend to appear in similar contexts, like “dog” and “bark.” The distinction between them is well-known in cognitive science [Tversky1977], and it has also been discussed in NLP [Kiela et al.2015, Levy et al.2015]. In regard to sentiment analysis, we can see that t-aspect words “display” and “LCD” and t-opinion words “scratched” and “bubbly” are all mixed based on our given example (Figure (a)a).

In spite of the discussed drawback, we argue that the semantics-based models are still suitable to achieve our final goal (i.e., FTSA), with the reason being three-fold. First, the mixture benefits the t-words grouping, where both two types of semantic correlation can be jointly extracted. To be concrete, aspect words like “display” could be found because of the conceptual similarity (similar to “screen”) and opinion words like “scratched” could also be discovered due to associative similarity (associated with “screen”). Second, those existing or emerging aspect extraction models can be utilized, which paves the way for better FTSA. Third, those semantics-based models are usually learned in an unsupervised or semi-supervised manner, which meets our need. However, when we take advantage of those semantics-based models for stage one, we have to overcome their aforementioned drawbacks for performing FTSA, which leads to our stage two.

Stage two is defined as: Given a list of target-related words (t-words), separating them into target-related aspect words (t-target words) and target-specific opinion words (t-opinion words). Figure (b)b shows an example. Notice that the list of t-words is assumed to be given, which is grouped by an existing semantics-based learning technique, so we refer to this problem as disentangling aspect and opinion words from extracted/grouped target-related words, to distinguish it from other related sentiment analysis problems (see Section 2).

An intuitive solution to this problem is to model it as a word-level binary classification task. That is, to build a classifier for learning and predicting t-aspect and t-opinion words. However, this is difficult in practice, because this means that we need both aspect and opinion word-level labels for every domain, which requires intensive human efforts. Noticing this, we instead formulate the classification problem in a PU (Positive-Unlabeled) learning setting. The idea is to use general/common opinion words (treating them as positive examples) to distill other opinion words from unlabeled words. However, a notable issue in this PU setting is that the errors from false positive (FP) examples (wrongly predicted opinion words) can be propagated, resulting in more errors and degenerating its performance. To address this issue, we exploit the idea of

lifelong machine learning

 [Thrun1998, Chen and Liu2016] and incorporate it into the PU learning process. We name it as Lifelong PU learning (LPU). It works by accumulating the knowledge learned from (past) multiple domains, and uses it to restrict the propagation of FP examples and to ensure the reliability of the newly learned opinion words. Our experimental results show its effectiveness.

The main contributions of this paper are summarized as follows: (1) It proposes to perform the fine-grained target-based sentiment analysis (TFSA) task in a two-stage manner, which does not require manual labeling. (2) It proposes a lifelong PU (LPU) learning approach to solving the problem of disentangling target-specific aspect and opinion words from word extraction/grouping. To the best of our knowledge, none of the existing studies has employed the lifelong PU technique. (3) Experimental results conducted on real-world review datasets with two general aspect extraction techniques show its effectiveness and extensibility.

2 Related Work

Target-based sentiment analysis Target-based sentiment analysis (TSA) aims at analyzing the sentiment on a specific given target. Most of the previous studies [Jiang et al.2011, Vo and Zhang2015] focused on the target-dependent sentiment classification task, which is to classify the sentiment polarity on a sentence towards a given target, for example, to determine whether a tweet from Twitter shows positive or negative sentiment towards a company. Our task is not to classify a single sentence, but to identify all aspect and opinion words in a given corpus with a specified target name. Wang et al. wang2016targeted proposed a targeted topic model to generate target-related topics. However, their work dealt with neither opinions nor the word disentangling problem.

Aspect-level sentiment analysis Our work is also related to the widely studied aspect-level sentiment analysis (ASA)  [Liu2012]. To show the difference and relatedness of ASA to our problem (FTSA), we categorize previous studies into three general groups. The first group uses linguistic tools, association structures or hand-crafted rules for aspect or opinion word extraction, or co-extraction [Hu and Liu2004, Qiu et al.2009, Xu et al.2013]. The second group models the aspect and opinion word identification as a supervised sequence labeling problem [Irsoy and Cardie2014, Wang et al.2017]. These two groups require intensive human labors for feature engineering, pattern design or manual labeling. The third group, in contrast, does not rely on human involvement, and it exploits the distributional semantics, e.g., topic modeling, word embedding, or their variants  [Mukherjee and Liu2012, Wang et al.2016b, Tixier et al.2016]. We have discussed the suitability of these techniques for our task in Section 1. Note that our focus is on disentangling (separating) t-aspect and t-opinion words from the t-words grouped by them. Some of the studies in the third group also considered separating the aspect and opinion words, by using POS features with manual or auto labels. It is a purely syntax-based solution. We will compare and analyze it in our experiments. Most importantly, most existing methods performed full analyses on all aspects, while our task is target-oriented.

Semantic space and representation

Semantics-based learning models project words to a semantic space and represent each word as a dense vector. Such semantics-bearing vectors can be created by matrix factorization (e.g. LSI) 

[Deerwester et al.1990] and topic modeling (e.g., LDA) [Blei et al.2003]. Their word vectors have been used to tackle some word-level classification problems [Maas et al.2011, Pu et al.2015]. Recently, neural word embeddings [Mikolov and Dean2013, Pennington et al.2014] emerge to show better semantic representation for words and have improved many NLP tasks [Turian et al.2010, Collobert et al.2011].

Lifelong machine learning Our work is also related to lifelong learning [Thrun1998, Chen and Liu2016]. In the sentiment analysis context, several lifelong learning models have been proposed for improving topic quality [Wang et al.2016b], aspect extraction [Liu et al.2016] and document-level sentiment classification [Chen et al.2015]. But they are not for the TSA task and they are not applicable to the word disentangling problem. Additionally, we need to formulate our problem in a PU learning setting and we use no labeled data. We also incorporate lifelong learning into the PU learning process. To the best of our knowledge, none of the previous studies have employed the lifelong PU technique.

3 Stage one: Grouping

In this stage, we group the target-related words (t-words) for a specified target name. The basic idea is to extract its semantically correlated words based on the vector representation of the target in a learned semantic space. Specifically, we use the neural word embedding model [Mikolov and Dean2013] to learn word vectors for a given domain corpus, resulting in an embedding matrix where and are the size of vocabulary and vector dimension. Then a semantic similarity matrix is calculated based on the dot product of and . After that, when a user-specified target comes, the nearest neighbors of the target word will be returned as t-words, based on their similarity values in . Notice that other semantics learning models can be used in the same way [Deerwester et al.1990, Pennington et al.2014]. Probabilistic topic models [Blei et al.2003] can be used as well, by searching the corresponding topic for the given target and returning the topical words. Notice that this stage is to some extent similar to (unsupervised) aspect extraction in aspect-level sentiment analysis [Liu2012] and many of its models can be used at this stage. That main difference is that, in our setting, the target/aspect name is specified, so we do not have to perform a full extraction of all aspects covered by a given corpus (e.g., performing clustering in word embedding space), but only to focus on the given target by returning its nearest neighbors.

4 Stage Two: Disentangling

4.1 PU Learning using Word Vectors

This stage separates the given t-words into t-aspect words and t-opinion words. As discussed in Section 1, in order to provide a general approach without manual labeling for every domain, we formulate this disentangling problem as a binary classification task in a PU learning setting [Li and Liu2005].

Clearly, in addition to aspect and opinion words, a domain vocabulary also contains other words like general/background words. However, as indicated in [Mukherjee and Liu2012, Wang et al.2016b], those words do not have a seriously bad effect as they are unlikely to be semantically similar to a given target. Therefore, we assume most of the non-opinion words that are semantically correlated to a given target, are aspect words. This assumption holds well in practice, as shown in previous studies [Mukherjee and Liu2012, Wang et al.2016b] and also in our experiments.

PU learning is a type of semi-supervised learning method, which learns a binary classifier using only positive and unlabeled examples (with no negative examples). Here

represents a set of data examples with positive labels. In our task, the opinion words from an opinion lexicon will be the words in

, such as “good”, “bad” and “angry”. In terms of , it denotes the set of data examples with unknown labels. In our case, other words that are not in the lexicon are in . Note

in fact contains both true opinion words and non-opinion words. With word vectors as features and a set of general opinion words as positive labels, we can build a PU classifier. In our work, we use logistic regression for classification, as it generates a probabilistic score of a word for being in the positive class (i.e., opinion word). In this way, some words from

with high prediction scores can be found as new (likely) opinion words, and we can extract more words iteratively using the PU classifier.

However, a notable issue in this PU setting is that the errors from false positive (FP) examples (wrongly predicted opinion words) can be propagated, thus degenerating its performance. In order to address it, we exploit the idea of lifelong machine learning [Thrun1998, Chen and Liu2016] and incorporate it into the PU learning process. The idea is to exploit the past domain classification knowledge to increase the correctness or reliability of the newly found opinion words.

4.2 Lifelong Machine Learning

Lifelong machine learning [Thrun1998, Chen and Liu2016] or lifelong learning for short, works by retaining the knowledge learned from the past tasks and uses it to help future learning, i.e., to help the current or coming task. It mimics how we human beings learn. With regard to sentiment analysis, we (human beings) can learn many opinion expressions in our lives across different domains/areas, which enables us to better understand and identify opinion words in a new domain.

In a similar way, our system retains the newly learned opinion words every time it has finished processing one domain (one task), treating them as knowledge and accumulating them in a knowledge base. The system accumulates such knowledge continuously from domain to domain. So in any time it has processed domains and starts to process the th domain, the accumulated knowledge will be used to help generate more reliable opinion words that are suitable for the th domain. Based on this general idea, we develop a lifelong PU (LPU) learning algorithm.

4.3 Lifelong PU learning (LPU)

Our proposed LPU algorithm consists of four main steps: knowledge accumulation, current domain setup, knowledge mining and preparation, and restricted PU iterations. The overall algorithm is given in Algorithm 1 (Alg. 1).

Step 1: Knowledge Accumulation (lines 1-8) This step follows the traditional classification process but with knowledge retention for building a knowledge base from past domains. Specifically, for each domain (task ), we first obtain its vocabulary and semantic representation of words (line 3). With a general opinion lexicon, we then have the lexicon-based opinion words , i.e., positive examples, and unlabeled examples (line 4). A PU classifier is trained (line 5) and used to predict the probabilistic class scores of words in and to find new opinion words (line 6). After that, we retain as knowledge for constructing a knowledge base (line 7).

Notice that in practice, we do not need to repeat this step every time we have/start a new domain/task. Instead, this step is performed naturally and continuously with domains being processed, from task 1 to task . Because we simply keep retaining their results, so when a new domain comes (task ), the is already constructed and ready for use.

Step 2: Current Domain Setup (lines 9-13) This step is for the setup of processing the current domain. The vocabulary words and their semantic representation , lexicon-based opinion words (positive examples) and unlabeled words of the current domain are first obtained (lines 10-11). Then we build a hash table to store the nearest neighbors111Simply using top 10 neighbors works consistently well for different domains in our experiments. for all words, which can be easily constructed from the similarity matrix (see Section 3). With the table established (line 12), which is a one-time effort, the similarity query becomes a lookup operation. This not only helps in the current step 2, but also plays a role in the following step 4, as we will see shortly. Based on , we can find the nearest neighbors for the lexicon-based opinion words and we call it reliable neighbors (line 13). This is an initial constraint, which is also intuitive, as the candidate/unlabeled words similar to the opinion words from the lexicon are believed to be more reliable/likely opinion words.

Step 3: Knowledge Mining and Preparation (lines 14-19). This step is for mining knowledge and making preparation for later use. With the knowledge accumulated from many past domains and stored in , we can extract the reliable knowledge (line 15). Here we adopt the data mining technique of frequent itemset mining (FIM) [Agrawal et al.1994], because a candidate word that frequently appears in many different domains as a predicted opinion word is naturally more trustworthy. The intersection of the reliable neighbors and reliable knowledge initializes , the newly learned sentiment (line 19). Lines 16-18 define other variables that are used in step 4, where denotes the sentiment knowledge for current domain , indicates the reliable learned sentiment (opinion words) during the PU learning iteration, and the records the newly-predicted opinion words in an ongoing iteration.

Step 4: Restricted PU Iterations (lines 21-31). This step performs iterations of PU learning with constraints. Unlike the unconstrained self-bootstrapping approach, the expansion of the newly-predicted opinion words as positive examples in LPU is controlled and only the reliable ones will be used further. The initialized new opinion words have already been restricted (see step 2) and used here as initial reliable sentiment . During the iterative learning, it keeps being updated (line 23) by adding only reliable opinion words (line 27). We develop two ways of expanding new reliable opinion words. One way is to learn from the reliable knowledge (line 25) and another way is to learn from its self-predicted results (line 26). Notice that both ways of expansion are restricted by the defined reliability score shown in Algorithm 2 (Alg. 2). This score is calculated based on the number of identified positive neighbors of a candidate word, which is also used for ranking. In Alg. 2, denotes the candidate word set and records positive examples. The identified positive neighbors are from the intersection of positive examples (provided by ) and the neighbors of a candidate word (provide by ). In each iteration, only the top ranked words will be trusted/added as new positive examples. When the maximum iteration is met or there is no more new opinion words that the system can learn, the iterative learning process stops and all newly-detected opinion words are returned (line 31).

Input: Current domain corpus
Past domain corpora ={}
Opinion words in lexicon
Maximum learning iteration
Number of learned words in one iteration
Output: All newly-extracted opinion words in
1:  // Step 1. Knowledge Accumulation
2:  for each domain corpus  do
4:     ,
7:      // sentiment knowledge base
8:  end for
9:  // Step 2. Current Domain Setup
11:  ,
12:  Create a hash-table to store top neighbors of all words
14:  // Step 3. Knowledge Mining and Preparation
16:   // senti-knowledge for domain
17:   // reliable learned opinion words
18:   // current positive prediction (opinion words)
19:   // newly learned sentiment
20:  // Step 4. Restricted PU Iterations
22:  while  or is not empty do
23:      // updating reliable sentiment
30:  end while
Algorithm 1 Lifelong PU (LPU) Learning
1:   // counts positive neighbors for every word in A
2:  for each a word  do
4:  end for
5:  return
Algorithm 2

5 Experiments

5.1 Candidate Methods for Comparison

Adjective Extraction (ADJ): This baseline simply regards all adjective words as opinion words. This is a simple but widely used solution. We performed POS tagging and extracted all adjectives. No classifier is used for ADJ.
Part-Of-Speech (POS): The POS features have been shown very effective for aspect and opinion extraction tasks. This is a representative syntax-based approach used in many related works [Mukherjee and Liu2012, Wang et al.2016b]. Here every word is represented by the POS features of its context, i.e., will be represented as [,,]. It is used as word representation for building a classifier.
Latent Semantic Indexing (LSI): LSI is a standard matrix factorization technique to construct latent semantic vectors/features. Its factorized word-feature correlation matrix can be used as word vector representation [Pu et al.2015].
Latent Dirichlet Allocation (LDA): LDA [Blei et al.2003] is a classic topic model which discovers hidden topics from documents and groups words into topics. Similar to LSI, the term-topic matrix is used as the word vector representation to build a classifier [Maas et al.2011].
Non-Lifelong Learning (NLL): This baseline follows our approach but with no lifelong learning. It uses the word vectors learned by neural word embeddings to build a classifier.
Lifelong PU (LPU): This is our proposed lifelong PU learning algorithm introduced in Alg. 1.
Lifelong PU minor (LPU-): This is a LPU variant that does not make risky self-prediction exploration and relies more on the past mined knowledge. In other words, it considers the first type of reliable sentiment only but without the second one (see lines 25 and 26 in Alg. 1). This can be viewed as a conservative version of LPU.

5.2 Experimental Setup

Data We use a large corpus of Amazon reviews from 20 different domains provided by [McAuley et al.2015]222http://jmcauley.ucsd.edu/data/amazon/ and the full list is shown in Table 1. For training (all PU classifiers), a general opinion lexicon333http://www.cs.uic.edu/liub/FBS/sentiment-analysis.html is used so the words appeared in it are automatically labeled as P. For testing/evaluation, we manually label the aspect and opinion words. Specifically, three domains from different product categories are selected, namely, cellphone, beauty and office. For each domain, three targets are specified (see Table 1). In Table 1, the vocabulary size is the number of words after filtering the words with low occurrence (less than 5). We did not do any further preprocessing such as stemming or lemmatization, as the opinion words are also related to their grammatical forms.

Dataset Name Number of Reviews Vocabulary Size Words in Lexicon Targets for Evaluation
CellPhone 194,439 28,942 2,764 display, volumes, weight
Beauty 198,502 29,695 2,778 cleansers, fragrance, groomers
Office 53, 258 20,858 2,332 papers, clips, chairs
Full domain list apps for android, amazon instance video, automotive, baby, beauty, cd, cellphone, cloth, digital music, electronics,
grocery, health, kindle, tools/home improvement, home and kitchen, office product, pet supplies, sport, toy, video game

Table 1: Detailed information about the three domains for evaluation and the full domain list.

Parameters and Settings For every candidate method except ADJ, their word vectors/features are learned and used for classification. Specifically, for LSI and LDA, we obtained the term-feature matrix and term-topic matrix using gensim package444https://radimrehurek.com/gensim/. For NLL, LPU- and LPU, we used the skip-gram model [Mikolov and Dean2013]. The vector dimension is set to 200 as default and we maintain the same size for LDA and LSI. Logistic regression is used as the classifier in all methods. For LPU, we treat other 19 domains besides the current domain as the past domains to mine knowledge. Notice that for a current domain, only its domain data and the automatically accumulated knowledge will be used, and no other extra domain data will be available, which follows the lifelong learning experimental setting from existing works [Chen et al.2015, Wang et al.2016b]. We empirically set the minimum support to 5 for frequent opinion word mining. We set the maximum iterations to 10 and the number of words to learn in each iteration to 50. A bigger makes the learning faster as it considers more words in one iteration, whereas a smaller makes the learning slower. We will show the effect of in Section 5.5.

5.3 Quantitative Evaluation

We use accuracy as the metric, as our task is a binary classification problem and the distribution of aspect-opinion words is nearly balanced (close to 3:2 in our labeled data). However, as it is hard to know the exact number of all related words (t-words) to a given target, we use the accuracy@ (acc@) as our evaluation measure, where is set to 50, 100, and 150. Given a target, we first obtain its t-words (nearest neighbors in the semantic space), and then manually label words as opinion or aspect (non-opinion) words. With the annotation obtained, we apply every trained candidate model to classify those top words to calculate its acc@.

(a) Acc@150 for all models and targets
(b) Acc@100 for all models and targets
(c) Acc@50 for all models and targets
Figure 2: Acc@n for all models and targets.

The results are reported in Figure (a)a(b)b and (c)c. Based on them, we have the following observations:

  1. [topsep=0pt,leftmargin=*]

  2. LPU and LPU- outperform other baselines markedly. LPU improves the best baseline results by 8.29%, 7.11% and 4.00% in acc@150, acc@100, and acc@50. Likewise, LPU- improves the best baseline results by 7.55%, 6.44% and 5.77% in acc@150, acc@100, and acc@50. They demonstrate the effectiveness of lifelong learning.

  3. LPU achieves better performance than LPU- in acc@150, acc@100 but is inferior to LPU- in acc@50. This indicates that LPU is more accurate by considering a big (more t-words), but LPU- could be more suitable if we only focus on the top-ranked words.

  4. Among others baselines, we observe that NLL and POS perform the best. While POS explicitly reflects the contextual syntax, it is worth noting that the neural word embeddings used in NLL is also implicitly learned from word-context matrix [Levy et al.2015]. This implies that the syntactic information is very useful for word separation.

Target: Volumes (Domain: CellPhone)
Model Aspect Opinion
LPU volumes, bass, undistorted, volume, muddiness, shrill, trebles, distortion,
Gaga, pitches, sound, harshness, hissy, loud, sibilance, thumping,
cymbals, treble, eq, conf, thump, highs, soundstage, midrange,
LPU- bass, volume, Gaga, pitches, volumes, muddiness, undistorted, shrill,
sound, cymbals, treble, conf, trebles, distortion, hissy, loud,
Mids, reproduces, bitrate, Highs, sibilance, harshness, thumping, eq,
NLL volumes, muddiness, bass, undistorted, shrill, distortion, hissy, loud,
trebles, volume, Gaga, pitches, sibilance, midrange, equalization, tinny,
sound, harshness, cymbals, treble, distorted, piercing, muddy, louder,
POS volumes, muddiness, trebles, Gaga, bass, undistorted, shrill, distortion,
pitches, sibilance, cymbals, Mids, hissy, volume, loud, sound,
LiveAudio, highs, reproducing, Highs, harshness, treble, thumping, eq,
ADJ volumes, muddiness, bass, trebles, undistorted, shrill, distortion, hissy,
volume, Gaga, pitches, sound, loud, treble, loudest, tinny,
sibilance, harshness, cymbals, thumping, distorted, Treble, resonant, muddy,
Table 2: Results for target volumes in domain cellphone. Incorrect sentiment words are italicized and marked in red.

5.4 Qualitative Evaluation

This subsection shows some example results in Table 2. Since LSI and LDA have much poorer performances than others, we do not include their results here. The represented words are the top predicted aspect words and top predicted opinion words from the t-words of a given target. Incorrect opinion words are italicized and marked in red. As we can see, LPU and LPU- better distinguish aspect and opinion words. For example, the opinion word “muddiness” is extracted by them but not by other models. POS identifies many wrong opinion words like “sound” and “volume”. Although ADJ is good at extracting adjective opinion words, it misses other opinion words like “muddiness”, “sibilance” and “thumping”. NLL also misses many opinion words like ADJ.

Topic: Skin (Domain: Beauty)
Aspect Newly-Identified Opinion
face, skin, use, acne (-b), using, dry (-c,d), rid (-c,d), oily (-c,d),
just (-b), wash, feel (-b), make, drying (-a,c,d), mild (-c,d), notice (-c,d),
day, product (-b), lotion huge (-d), new (-c,d), ok (-c,d), younger (-c)
Table 3: Topic about skin in domain beauty. The “-” symbol indicates that the models following it do not identify the word.
Topic: Headset (Domain: CellPhone)
Aspect Newly-Identified Opinion
headset, sound, quality (-b), really (-c,d), long (-a), low (-d),
bluetooth, adapter, hear (-b), ear, away (-c,d), high (-d), short, quite (-c,d),
volume, headsets, way (-b) ok, idea (-c,d), close (-c,d)
Table 4: Topic about headset in domain cellphone. The “-” symbol indicates that the models following it do not identify the word.

5.5 Further Analysis

In order to further evaluate the generality and extensibility of our proposed approach, we applied it to another popular aspect extraction technique, topic modeling. Specifically, we run LDA [Blei et al.2003] for topic generation and then use our algorithm to separate aspect words and opinion words. It also produces reasonably good results as shown in Table 3 and  4. Notations in “-(a, b, c, d)” will be explained below.

We also investigated the effect of alleviating FP error propagation in LPU. We denote three types of iterative-learning models as (a), (b), and (c), and they learned with 10 iterations by considering their newly-identified opinion words as positive examples: (a) LPU, using Alg. 1; (b) A PU model selecting its predicted positive examples () in the current iteration as P for the next iteration; (c) A PU model that always combines the newly-predicted positive examples with the initial lexicon-based positive examples, without using the constraints in LPU. We also denote NLL as the model (d) for comparison purposes. It does not learn iteratively.

We now take a further look at Table 3 and  4. The “-” symbol indicates that the models following it do not identify the word. Note that here the opinion words from lexicon are excluded so we can see how those models perform in classifying the unlabeled words . We observe: 1. Model (c) misses many interesting opinion words like model (d), which indicates that its positive examples remain very similar during all iterations, i.e., it does not learn many new positive examples; 2. Model (b) mis-classifies many aspect words as opinion words as its FP errors propagate iteratively, i.e., the model is confused by the newly-added false positive examples. 3. Model (a), which is LPU, works robustly well.

We also report the effect of iterative learning of LPU with quantitative scores in Figure 3. The line in red color shows the averaged acc@150 scores in our labeled data. We can see the effectiveness and stability of LPU.

Figure 3: Iterative learning of LPU

6 Conclusion

This paper discussed the problem of disentangling t-opinion words and t-aspect words from the grouped t-words for fine-grained target-based sentiment analysis (FTSA). We formulated this problem in a PU learning setting and incorporated the lifelong learning idea to overcome the drawback of error propagation in PU learning. To achieve this, a novel lifelong PU learning (LPU) model was proposed. Our experimental results using real-world data demonstrated its effectiveness.


  • [Agrawal et al.1994] Rakesh Agrawal, Ramakrishnan Srikant, et al. Fast algorithms for mining association rules. In VLDB, volume 1215, pages 487–499, 1994.
  • [Blei et al.2003] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
  • [Chen and Liu2016] Zhiyuan Chen and Bing Liu. Lifelong machine learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    , 10(3):1–145, 2016.
  • [Chen et al.2015] Zhiyuan Chen, Nianzu Ma, and Bing Liu. Lifelong learning for sentiment classification. ACL, page 750, 2015.
  • [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. JMLR, 12(Aug):2493–2537, 2011.
  • [Deerwester et al.1990] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391, 1990.
  • [Harris1954] Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
  • [Hu and Liu2004] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In KDD, pages 168–177. ACM, 2004.
  • [Irsoy and Cardie2014] Ozan Irsoy and Claire Cardie.

    Opinion mining with deep recurrent neural networks.

    In EMNLP, pages 720–728, 2014.
  • [Jiang et al.2011] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Target-dependent twitter sentiment classification. In ACL, pages 151–160, 2011.
  • [Kiela et al.2015] Douwe Kiela, Felix Hill, and Stephen Clark. Specializing word embeddings for similarity or relatedness. In EMNLP, pages 2044–2048, 2015.
  • [Levy et al.2015] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211–225, 2015.
  • [Li and Liu2005] Xiao-Li Li and Bing Liu. Learning from positive and unlabeled examples with different data distributions. In ECML, pages 218–229. Springer, 2005.
  • [Liu et al.2016] Qian Liu, Bing Liu, Yuanlin Zhang, Doo Soon Kim, and Zhiqiang Gao. Improving opinion aspect extraction using semantic similarity and aspect associations. In AAAI, 2016.
  • [Liu2012] Bing Liu. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1–167, 2012.
  • [Maas et al.2011] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In ACL, pages 142–150, 2011.
  • [McAuley et al.2015] Julian McAuley, Rahul Pandey, and Jure Leskovec. Inferring networks of substitutable and complementary products. In KDD, pages 785–794. ACM, 2015.
  • [Mikolov and Dean2013] T Mikolov and J Dean. Distributed representations of words and phrases and their compositionality. NIPS, 2013.
  • [Mukherjee and Liu2012] Arjun Mukherjee and Bing Liu. Aspect extraction through semi-supervised modeling. In ACL, pages 339–348. ACM, 2012.
  • [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
  • [Pu et al.2015] Xiaojia Pu, Rong Jin, Gangshan Wu, Dingyi Han, and Gui-Rong Xue. Topic modeling in semantic space with keywords. In CIKM, pages 1141–1150. ACM, 2015.
  • [Qiu et al.2009] Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Expanding domain sentiment lexicon through double propagation. In IJCAI, volume 9, pages 1199–1204, 2009.
  • [Thrun1998] Sebastian Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998.
  • [Tixier et al.2016] Antoine J-P Tixier, Michalis Vazirgiannis, and Matthew R Hallowell. Word embeddings for the construction domain. arXiv preprint arXiv:1610.09333, 2016.
  • [Turian et al.2010] Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, pages 384–394, 2010.
  • [Tversky1977] Amos Tversky. Features of similarity. Psychological review, 84(4):327, 1977.
  • [Vo and Zhang2015] Duy-Tin Vo and Yue Zhang. Target-dependent twitter sentiment classification with rich automatic features. In IJCAI, pages 1347–1353, 2015.
  • [Wang et al.2016a] Shuai Wang, Zhiyuan Chen, Geli Fei, Bing Liu, and Sherry Emery. Targeted topic modeling for focused analysis. In KDD, pages 1235–1244. ACM, 2016.
  • [Wang et al.2016b] Shuai Wang, Zhiyuan Chen, and Bing Liu. Mining aspect-specific opinion using a holistic lifelong topic model. In WWW, pages 167–176. WWW, 2016.
  • [Wang et al.2017] Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In AAAI, pages 3316–3322, 2017.
  • [Xu et al.2013] Liheng Xu, Kang Liu, Siwei Lai, Yubo Chen, and Jun Zhao. Mining opinion words and opinion targets in a two-stage framework. In ACL, volume 1, pages 1764–1773, 2013.