Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

08/27/2018 ∙ by Stefanos Angelidis, et al. ∙ 2

We present a neural framework for opinion summarization from online product reviews which is knowledge-lean and only requires light supervision (e.g., in the form of product domain labels and user-provided ratings). Our method combines two weakly supervised components to identify salient opinions and form extractive summaries from multiple reviews: an aspect extractor trained under a multi-task objective, and a sentiment predictor based on multiple instance learning. We introduce an opinion summarization dataset that includes a training set of product reviews from six diverse domains and human-annotated development and test sets with gold standard aspect annotations, salience labels, and opinion summaries. Automatic evaluation shows significant improvements over baselines, and a large-scale study indicates that our opinion summaries are preferred by human judges according to multiple criteria.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Opinion summarization, i.e., the aggregation of user opinions as expressed in online reviews, blogs, internet forums, or social media, has drawn much attention in recent years due to its potential for various information access applications. For example, consumers have to wade through many product reviews in order to make an informed decision. The ability to summarize these reviews succinctly would allow customers to efficiently absorb large amounts of opinionated text and manufacturers to keep track of what customers think about their products Liu (2012).

Figure 1: Aspect-based opinion summarization. Opinions on image quality, sound quality, connectivity, and price of an LCD television are extracted from a set of reviews. Their polarities are then used to sort them into positive and negative, while neutral or redundant comments are discarded.

The majority of work on opinion summarization is entity-centric, aiming to create summaries from text collections that are relevant to a particular entity of interest, e.g., product, person, company, and so on. A popular decomposition of the problem involves three subtasks Hu and Liu (2004, 2006): (1) aspect extraction which aims to find specific features pertaining to the entity of interest (e.g., battery life, sound quality, ease of use) and identify expressions that discuss them; (2) sentiment prediction which determines the sentiment orientation (positive or negative) on the aspects found in the first step, and (3) summary generation which presents the identified opinions to the user (see Figure 1 for an illustration of the task).

A number of techniques have been proposed for aspect discovery using part of speech tagging Hu and Liu (2004), syntactic parsing Lu et al. (2009), clustering Mei et al. (2007); Titov and McDonald (2008b), data mining Ku et al. (2006), and information extraction Popescu and Etzioni (2005)

. Various lexicon and rule-based methods

Hu and Liu (2004); Ku et al. (2006); Blair-Goldensohn et al. (2008) have been adopted for sentiment prediction together with a few learning approaches Lu et al. (2009); Pappas and Popescu-Belis (2017); Angelidis and Lapata (2018). As for the summaries, a common format involves a list of aspects and the number of positive and negative opinions for each Hu and Liu (2004). While this format gives an overall idea of people’s opinion, reading the actual text might be necessary to gain a better understanding of specific details. Textual summaries are created following mostly extractive methods (but see Ganesan et al. 2010 for an abstractive approach), and various formats ranging from lists of words Popescu and Etzioni (2005), to phrases Lu et al. (2009), and sentences Mei et al. (2007); Blair-Goldensohn et al. (2008); Lerman et al. (2009); Wang and Ling (2016).

In this paper, we present a neural framework for opinion extraction from product reviews. We follow the standard architecture for aspect-based summarization, while taking advantage of the success of neural network models in learning continuous features without recourse to preprocessing tools or linguistic annotations. Central to our system is the ability to accurately identify aspect-specific opinions by using different sources of information freely available with product reviews (product domain labels, user ratings) and minimal domain knowledge (essentially a few aspect-denoting keywords). We incorporate these ideas into a recently proposed aspect discovery model

He et al. (2017) which we combine with a weakly supervised sentiment predictor Angelidis and Lapata (2018) to identify highly salient opinions. Our system outputs extractive summaries using a greedy algorithm to minimize redundancy. Our approach takes advantage of weak supervision signals only, requires minimal human intervention and no gold-standard salience labels or summaries for training.

Our contributions in this work are three-fold: a novel neural framework for the identification and extraction of salient customer opinions that combines aspect and sentiment information and does not require unrealistic amounts of supervision; the introduction of an opinion summarization dataset which consists of Amazon reviews from six product domains, and includes development and test sets with gold standard aspect annotations, salience labels, and multi-document extractive summaries; a large-scale user study on the quality of the final summaries paired with automatic evaluations for each stage in the summarization pipeline (aspects, extraction accuracy, final summaries). Experimental results demonstrate that our approach outperforms strong baselines in terms of opinion extraction accuracy and similarity to gold standard summaries. Human evaluation further shows that our summaries are preferred over comparison systems across multiple criteria.

2 Related Work

It is outside the scope of this paper to provide a detailed treatment of the vast literature on opinion summarization and related tasks. For a comprehensive overview of non-neural methods we refer the interested reader to Kim et al. (2011) and Liu and Zhang (2012). We are not aware of previous studies which propose a neural-based system for end-to-end opinion summarization without direct supervision, although as we discuss below, recent efforts tackle various subtasks independently.

Aspect Extraction

Several neural network models have been developed for the identification of aspects (e.g., words or phrases) expressed in opinions. This is commonly viewed as a supervised sequence labeling task; Liu et al. (2015)

employ recurrent neural networks, whereas

Yin et al. (2016) use dependency-based embeddings as features in a Conditional Random Field (CRF). Wang et al. (2016) combine a recursive neural network with CRFs to jointly model aspect and sentiment terms. He et al. (2017)

propose an aspect-based autoencoder to discover fine-grained aspects without supervision, in a process similar to topic modeling. Their model outperforms LDA-style approaches and forms the basis of our aspect extractor.

Sentiment Prediction

Fully-supervised approaches based on neural networks have achieved impressive results on fine-grained sentiment classification Kim (2014); Socher et al. (2013). More recently, Multiple Instance Learning (MIL) models have been proposed that use freely available review ratings to train segment-level predictors. Kotzias et al. (2015) and Pappas and Popescu-Belis (2017) train sentence-level predictors under a MIL objective, while our previous work (Angelidis and Lapata, 2018) introduced MilNet, a hierarchical model that is trained end-to-end on document labels and produces polarity-based opinion summaries of single reviews. Here, we use MilNet to predict the sentiment polarity of individual opinions.

Multi-document Summarization

A few extractive neural models have been recently applied to generic multi-document summarization.

Cao et al. (2015) train a recursive neural network using a ranking objective to identify salient sentences, while follow-up work Cao et al. (2017) employs a multi-task objective to improve sentence extraction, an idea we adapted to our task. Yasunaga et al. (2017)

propose a graph convolution network to represent sentence relations and estimate sentence salience. Our summarization method is tailored to the opinion extraction task, it identifies aspect-specific and salient units, while minimizing the redundancy of the final summary with a greedy selection algorithm

Cao et al. (2015); Yasunaga et al. (2017). Redundancy is also addressed in Ganesan et al. (2010) who propose a graph-based framework for abstractive summarization. Wang and Ling (2016) introduce an encoder-decoder neural method for extractive opinion summarization. Their approach requires direct supervision via gold-standard extractive summaries for training, in contrast to our weakly supervised formulation.

3 Problem Formulation

Let  denote a corpus of reviews on a set of products from a domain , e.g., televisions or keyboards. For every product , the corpus contains a set of reviews expressing customers’ opinions. Each review is accompanied by the author’s overall rating and is split into segments , where each segment is in turn viewed as a sequence of words . A segment can be a sentence, a phrase, or in our case an Elementary Discourse Unit (EDU; Mann and Thompson 1988) obtained from a Rhetorical Structure Theory (RST) parser Feng and Hirst (2012). EDUs roughly correspond to clauses and have been shown to facilitate performance in summarization Li et al. (2016)

, document-level sentiment analysis

Bhatia et al. (2015), and single-document opinion extraction Angelidis and Lapata (2018).

A segment may discuss zero or more aspects, i.e., different product attributes. We use to refer to the aspects pertaining to domain . For example, picture quality, sound quality, and connectivity are all aspects of televisions. By convention, a general aspect is assigned to segments that do not discuss any specific aspects. Let denote the set of aspects mentioned in segment ; marks the polarity a segment conveys, where  indicates maximally negative and  maximally positive sentiment. An opinion is represented by tuple , and represents the set of all opinions expressed in .

For each product , our goal is to produce a summary of the most salient opinions expressed in reviews , by selecting a small subset . We expect segments that discuss specific product aspects to be better candidates for useful summaries. We hypothesize that general comments mostly describe customers’ overall experience, which can also be inferred by their rating, whereas aspect-related comments provide specific reasons for their overall opinion. We also assume that segments conveying highly positive or negative sentiment are more likely to present informative opinions compared to neutral ones, a claim supported by previous work Angelidis and Lapata (2018).

We describe our novel approach to aspect extraction in Section 4 and detail how we combine aspect, sentiment, and redundancy information to produce opinion summaries in Section 5.

4 Aspect Extraction

Our work builds on the aspect discovery model developed by He et al. (2017), which we extend to facilitate the accurate extraction of aspect-specific review segments in a more realistic setting. In this section, we first describe their approach, point out its shortcomings, and then present the extensions and modifications introduced in our Multi-Seed Aspect Extractor (MATE) model.

4.1 Aspect-Based Autoencoder

The Aspect-Based Autoencoder (ABAE; He et al. 2017) is an adaptation of the Relationship Modeling Network Iyyer et al. (2016), originally designed to identify attributes of fictional book characters and their relationships. The model learns a segment-level aspect predictor without supervision by attempting to reconstruct the input segment’s encoding as a linear combination of aspect embeddings. ABAE starts by pairing each word with a pre-trained word embedding , thus constructing a word embedding dictionary , where V is the size of the vocabulary. The model also keeps an aspect embedding dictionary , where is the number of aspects to be identified and -th row is a point in the word embedding space. Matrix is initialized using the centroids from a -means clustering on the vocabulary’s word embeddings.

The autoencoder, first produces a vector

for review segment using an attention encoder that learns to attend on aspect words. A segment encoding is computed as the weighted average of word vectors:

(1)
(2)
(3)

where is the -th word’s attention weight, is a simple average of the segment’s word embeddings and attention matrix is learned during training.

Vector

is fed into a softmax classifier to predict a probability distribution over 

aspects:

(4)

where and are the classifier’s weight and bias parameters. The segment’s vector is then reconstructed as the weighted sum of aspect embeddings:

(5)

The model is trained by minimizing a reconstruction loss that uses randomly sampled segments as negative examples:222ABAE also uses a uniqueness regularization term that is not shown here and is not used in our Multi-Seed Aspect Extractor model.

(6)

ABAE is essentially a neural topic model; it discovers topics which will hopefully map to aspects, without any preconceptions about the aspects themselves, a feature shared with most previous LDA-style aspect extraction approaches Titov and McDonald (2008a); He et al. (2017); Mukherjee and Liu (2012). These models will set the number of topics to be discovered to a much larger number () than the actual aspects found in the data (). This requires a many-to-one mapping between discovered topics and genuine aspects which is performed manually.

Figure 2: Multi-Seed Aspect Extractor (MATE).

4.2 Multi-Seed Aspect Extractor

Dynamic aspect extraction is advantageous since it assumes nothing more than a set of relevant reviews for a product and may discover unusual and interesting aspects (e.g., whether a plasma television has protective packaging). However, it suffers from the fact that the identified aspects are fine-grained, they have to be interpreted post-hoc, and manually mapped to coarse-grained ones.

We propose a new weakly-supervised set-up for aspect extraction which requires little human involvement. For every aspect , we assume there exists a small set of seed words which are good descriptors of . We can think of these seeds as query terms that someone would use to search for segments discussing . They can be set manually by a domain expert or selected using a small number of aspect-annotated reviews. Figure 2 (top) depicts four television aspects (image, sound, connectivity and price) and three of their seeds in word embedding space. MATE replaces ABAE’s aspect dictionary with multiple seed matrices . Every matrix , contains one row per seed word and holds the seeds’ word embeddings, as illustrated by the set of matrices in Figure 2.

MATE still needs to produce an aspect matrix , in order to reconstruct the input segment’s embedding. We accomplish this by reducing each seed matrix to a single aspect embedding with the help of seed weight vectors  (), and concatenating the results, illustrated by the aspect matrix in Figure 2:

(7)
(8)

The segment is reconstructed as in Equation (5). Weight vectors can be uniform (for manually selected seeds), fixed, learned during training, or set dynamically for each input segment, based on the cosine distance of its encoding to each seed embedding. Our experiments showed that fixed weights, selected through a technique described below, result in most stable performance across domains. We only focus on this variant due to space restrictions (but provide more details in the supplementary material).

When a small number of aspect-annotated reviews are available, seeds and their fixed seed weights can be selected automatically. To obtain a ranked list of terms that are most characteristic for each aspect, we use a variant of the clarity scoring function which was first introduced in information retrieval Cronen-Townsend et al. (2002). Clarity measures how much more likely it is to observe word  in the subset of segments that discuss aspect , compared to the corpus as a whole:

(9)

where and are the -normalized tf-idf scores of  in the segments annotated with aspect  and in all annotated segments, respectively. Higher scores indicate higher term importance and truncating the ranked list of terms gives a fixed set of seed words, as well as their seed weights by normalizing the scores to add up to one. Table 1 shows the highest ranked terms obtained for every aspect in the televisions domain of our corpus (see Section 6 for a detailed description of our data).

4.3 Multi-Task Objective

MATE (and ABAE) relies on the attention encoder to identify and attend to each segment’s aspect-signalling words. The reconstruction objective only provides a weak training signal, so we devise a multi-task extension to enhance the encoder’s effectiveness without additional annotations.

 Aspect Top Terms
 Image picture color quality black bright
 Sound sound speaker quality bass loud
 Connectivity hdmi port computer input component
 Price price value money worth paid
 Apps & Interface netflix user file hulu apps
 Ease of Use easy remote setup user menu
 Customer Service paid support service week replace
 Size & Look size big bigger difference screen
 General tv bought hdtv happy problem
Table 1: Highest ranked words for the television corpus according to Equation (9).

We assume that aspect-relevant words not only provide a better basis for the model’s aspect-based reconstruction, but are also good indicators of the product’s domain. For example, the words colors and crisp, in the segment “The colors are perfectly crisp” should be sufficient to infer that the segment comes from a television review, whereas the words keys and type in the segment “The keys feel great to type on” are more representative of the keyboard domain. Additionally, all four words are characteristic of specific aspects.

Let denote the union of multiple review corpora, where is considered in-domain and the rest are considered out-of-domain. We use to denote the true domain of segment  and define a classifier that uses the vectors from our segment encoder as inputs:

(10)

where is a probability distribution over product domains for segment  and and are the classifier’s weight and bias parameters. We use the negative log likelihood of the domain prediction as the objective function, combined with the reconstruction loss of Equation (5) to obtain a multi-task objective:

(11)

where controls the influence of the classification loss. Note that the negative log-likelihood is summed over all segments in , whereas is only summed over the in-domain segments . It is important not to use the out-of-domain segments for segment reconstruction, as they will confuse the aspect extractor due to the aspect mismatch between different domains.

 Segment Salience
 1. The color and definition are perfect. [+]0.89
 2. Set up was extremely easy, [+]0.79
 3. Not worth $ 300. [-]0.75
 4. The sound on this is horrendous. [-]0.52
 5. The sound is TERRIBLE. [-]0.45
 6. Nice and bright with good colors. [+]0.44
Table 2: Most salient opinions according to scores from Equation (12) for an LCD TV.
 Domain Products Reviews EDUs Vocab
 Laptop Cases 2,040 (10) 42,727 (100) 602K (1,262) 30,443
 B/T Headsets 1,471 (10) 80,239 (100) 1.46M (1,344) 51,263
 Boots 4,723 (10) 77,593 (100) 987K (1,198) 30,364
 Keyboards 983 (10) 33,713 (100) 625K (1,396) 34,095
 Televisions 1,894 (10) 56,510 (100) 1.47M (1,483) 59,051
 Vacuums 1,184 (10) 68,266 (100) 1.50M (1,492) 46,259
Table 3: The OpoSum corpus. Numbers in parentheses correspond to the human-annotated subset.

5 Opinion Summarization

We now move on to describe our opinion summarization framework which is based on the aspect extraction component discussed so far, a polarity prediction model, and a segment selection policy which identifies and discards redundant opinions.

Opinion Polarity

Aside from describing a product’s aspects, segments also express polarity (i.e., positive or negative sentiment). We identify segment polarity with the recently proposed Multiple Instance Learning Network model (MilNet; Angelidis and Lapata 2018). Whilst trained on freely available document-level sentiment labels, i.e., customer ratings on a scale from 1 (negative) to 5 (positive), MilNet learns a segment-level sentiment predictor using a hierarchical, attention-based neural architecture.

Given review  consisting of segments , MilNet uses a CNN segment encoder to obtain segment vectors , each used as input to a segment-level sentiment classifier. For every vector , the classifier produces a sentiment prediction , where and are probabilities assigned to the most negative and most positive sentiment class respectively. Resulting segment predictions are combined via a GRU-based attention mechanism to produce a document-level prediction and the model is trained end-to-end on the reviews’ user ratings using negative log-likelihood.

The essential by-product of MilNet are segment-level sentiment predictions , which are transformed into polarities , by projecting them onto the  range using a uniformly spaced sentiment class weight vector.

Opinion Ranking

Aspect predictions and polarities , form the opinion set for every product . For simplicity, we set the predicted aspect-set to only include the aspect with the highest probability, although it is straightforward to allow for multiple aspects. We rank every opinion  according to its salience:

(12)

where the quantity in parentheses is the probability difference between the most probable aspect and the general aspect. The salience score will be high for opinions that are very positive or very negative and are also likely to discuss a non-general aspect.

Opinion Selection

The final step towards producing summaries is to discard potentially redundant opinions, something that is not taken into account by our salience scoring method. Table 3 shows a partial ranking of the most salient opinions found in the reviews for an LCD television. All segments provide useful information, but it is evident that segments 1 and 6 as well as 4 and 5 are paraphrases of the same opinions.

We follow previous work on multi-document summarization Cao et al. (2015); Yasunaga et al. (2017)

and use a greedy algorithm to eliminate redundancy. We start with the highest ranked opinion, and keep adding opinions to the final summary one by one, unless the cosine similarity between the candidate segment and any segment already included in the summary is lower than 

.

Aspect Extraction (F1) L. Bags B/T H/S Boots Keyb/s TVs Vac/s AVG
Majority 37.9 39.8 37.1 43.2 41.7 41.6 40.2
ABAE 38.1 37.6 35.2 38.6 39.5 38.1 37.9
ABAE 41.6 48.5 41.2 41.3 45.7 40.6 43.2
MATE 46.2 52.2 45.6 43.5 48.8 42.3 46.4
MATE+MT 48.6 54.5 46.4 45.3 51.8 47.7 49.1
Salience (MAP/P@5) L. Bags B/T H/S Boots Keyb/s TVs Vac/s AVG
MilNet 21.8 / 40.0 19.8 / 36.7 17.0 / 39.3 14.1 / 28.0 14.3 / 36.0 14.6 / 31.3 16.9 / 35.2
ABAE 19.9 / 48.5 27.5 / 49.7 13.8 / 28.1 19.0 / 44.9 16.8 / 42.4 16.1 / 34.0 18.8 / 41.3
MATE 23.0 / 57.1 30.9 / 50.7 15.4 / 31.9 21.0 / 43.1 18.7 / 44.7 19.9 / 44.0 21.5 / 45.2
MATE+MT 26.3 / 60.8 37.5 / 66.7 17.3 / 33.6 20.9 / 44.9 23.6 / 48.0 22.4 / 43.9 24.7 / 49.6
MilNet+ABAE 27.1 / 56.0 33.5 / 66.5 19.3 / 34.8 22.4 / 51.7 19.0 / 43.7 20.8 / 43.5 23.7 / 49.4
MilNet+MATE 28.2 / 54.7 36.0 / 66.5 21.7 / 39.3 24.0 / 52.0 20.8 / 46.1 23.5 / 49.3 25.7 / 51.3
MilNet+MATE+MT 32.1 / 69.2 40.0 / 74.7 23.3 / 40.4 24.8 / 56.4 23.8 / 52.8 26.0 / 53.1 28.3 / 57.8
Table 4: Experimental results for the identification of aspect segments (top) and the retrieval of salient segments (bottom) on OpoSum’s six product domains and overall (AVG).

6 The OpoSum Dataset

We created OpoSum, a new dataset for the training and evaluation of Opinion Summarization models which contains Amazon reviews from six product domains: Laptop Bags, Bluetooth Headsets, Boots, Keyboards, Televisions, and Vacuums. The six training collections were created by downsampling from the Amazon Product Dataset333http://jmcauley.ucsd.edu/data/amazon/ introduced in McAuley et al. (2015) and contain reviews and their respective ratings. The reviews were segmented into EDUs using a publicly available RST parser Feng and Hirst (2012).

To evaluate our methods and facilitate research, we produced a human-annotated subset of the dataset. For each domain, we uniformly sampled (across ratings) 10 different products with 10 reviews each, amounting to a total of 600 reviews, to be used only for development (300) and testing (300). We obtained EDU-level aspect annotations, salience labels and gold standard opinion summaries, as described below. Statistics are provided in Table 3 and in supplementary material.

Aspects

For every domain, we pre-selected nine representative aspects, including the general aspect. We presented the EDU-segmented reviews to three annotators and asked them to select the aspects discussed in each segment (multiple aspects were allowed). Final labels were obtained using a majority vote among annotators. Inter-annotator agreement across domains and annotated segments using Cohen’s Kappa coefficient was (, ).

Opinion Summaries

We produced opinion summaries for the 60 products in our benchmark using a two-stage procedure. First, all reviews for a product were shown to three annotators. Each annotator read the reviews one-by-one and selected the subset of segments they thought best captured the most important and useful comments, without taking redundancy into account. This phase produced binary salience labels against which we can judge the ability of a system to identify important opinions. Again, using the Kappa coefficient, agreement among annotators was (, ).444While this may seem moderate, Radev et al. (2003) show that inter-annotator agreement for extractive summarization is usually lower (). In the second stage, annotators were shown the salient segments they identified (for every product) and asked to create a final extractive summary by choosing opinions based on their popularity, fluency and clarity, while avoiding redundancy and staying under a budget of 100 words. We used ROUGE Lin and Hovy (2003) as a proxy to inter-annotator agreement. For every product, we treated one reference summary as system output and computed how it agrees with the rest. ROUGE scores are reported in Table 5 (last row).

7 Experiments

In this section, we discuss implementation details and present our experimental setup and results. We evaluate model performance on three subtasks: aspect identification, salient opinion extraction, and summary generation.

Implementation Details

Reviews were lemmatized and stop words were removed. We initialized MATE using 200-dimensional word embeddings trained on each product domain using skip-gram Mikolov et al. (2013) with default parameters. We used 30 seed words per aspect, obtained via Equation (9). Word embeddings , seed matrices and seed weight vectors were fixed throughout training. We used the Adam optimizer Kingma and Ba (2014) with learning rate

and mini-batch size 50, and trained for 10 epochs. We used 20 negative examples per input for the reconstruction loss and, when used, the multi-tasking coefficient

 was set to 10. Seed words and hyperparameters were selected on the development set and we report results on the test set, averaged over 5 runs.

Aspect Extraction

We trained aspect models on the collections of Table 3 and evaluated their predictions against the human-annotated portion of each corpus. Our MATE model and its multi-task counterpart (MATE+MT) were compared against a majority baseline and two ABAE variants: vanilla ABAE, where aspect matrix is initialized using -means centroids and fine-tuned during training; and ABAE, where rows of are fixed to the centroids of respective seed embeddings. This allows us to examine the benefits of our multi-seed aspect representation. Table 4 (top) reports the results using micro-averaged F1. Our models outperform both variants of ABAE across domains. ABAE improves upon the vanilla model, affirming that informed aspect initialization can facilitate the task. The richer multi-seed representation of MATE, however, helps our model achieve a 3.2% increase in F1. Further improvements are gained by the multi-task model, which boosts performance by 2.7%.

Opinion Salience

We are also interested in our system’s ability to identify salient opinions in reviews. The first phase of our opinion extraction annotation provides us with binary salience labels, which we use as gold standard to evaluate system opinion rankings. For every product , we score each segment using Equation (12) and evaluate the obtained rankings via Mean Average Precision (MAP) and Precision at the 5th retrieved segment (P@5).555A system’s salience ranking is individually compared against labels from each annotator and we report the average. Polarity scores were produced via MilNet; we obtained aspect probabilities from ABAE, MATE, and MATE+MT. We also experimented with a variant that only uses MilNet’s polarities and, additionally, with variants that ignore polarities and only use aspect probabilities.

Results are shown in Table 4 (bottom). The combined use of polarity and aspect information improves the retrieval of salient opinions across domains, as all model variants that use our salience formula of Equation (12) outperform the MilNet- and aspect-only baselines. When comparing between aspect-based alternatives, we observe that the extraction accuracy correlates with the quality of aspect prediction. In particular, ranking using MilNet+MATE+MT gives best results, with a 2.6% increase in MAP against MilNet+MATE and 4.6% against MilNet+ABAE. The trend persists even when MilNet polarities are ignored, although the quality of rankings is worse in this case.

Summarization ROUGE-1   ROUGE-2   ROUGE-L
Random  35.1 11.3 34.3
Lead  35.5 15.2 34.8
SumBasic  34.0 11.2 32.6
LexRank  37.7 14.1 36.6
Opinosis  36.8 14.3 35.7
Opinosis+MATE+MT  38.7 15.8 37.4
MilNet+MATE+MT  43.5 21.7 42.8
MilNet+MATE+MT+RD  44.1 21.8 43.3
Inter-annotator Agreement  54.7 36.6 53.9
Table 5: Summarization results on OpoSum.

Opinion Summaries

We now turn to the summarization task itself, where we compare our best performing model (MilNet+MATE+MT), with and without a redundancy filter (RD), against the following methods: a baseline that selects segments randomly; a Lead baseline that only selects the leading segments from each review; SumBasic, a generic frequency-based extractive summarizer Nenkova and Vanderwende (2005); LexRank, a generic graph-based extractive summarizer Erkan and Radev (2004); Opinosis, a graph-based abstractive summarizer that is designed for opinion summarization Ganesan et al. (2010). All extractive methods operate on the EDU level with a 100-word budget. For Opinosis, we tested an aspect-agnostic variant that takes every review segment for a product as input, and a variant that uses MATE’s groupings of segments to produce and concatenate aspect-specific summaries.

Inform. Polarity Coherence Redund.
Gold  2.04  8.70 10.93  6.11
This work  9.26  3.15  1.11  2.96
Opinosis -12.78 -10.00 -9.08 -9.45
Lead  1.48 -1.85 -2.96  0.37
Table 6: Best-Worst Scaling human evaluation.

Table 5 presents ROUGE-1, ROUGE-2 and ROUGE-L F1 scores, averaged across domains. Our model (MilNet+MATE+MT) significantly outperforms all comparison systems (; paired bootstrap resampling; Koehn 2004), whilst using a redundancy filter slightly improves performance. Assisting Opinosis with aspect predictions is beneficial, however, it remains significantly inferior to our model (see the supplementary material for additional results).

Product domain: Televisions
Product name: Sony BRAVIA 46-Inch HDTV

Human

Plenty of ports and settings. Easy hookups to audio and satellite sources. The sound is good and strong. This TV looks very good. and the price is even better. The on-screen menu/options is quite nice. and the internet apps work as expected. The picture is clear and sharp. which is TOO SLOW to stream HD video… The software and apps built into this TV. are difficult to use and setup. Their service is handled off shore making. communication a bit difficult. :(

LexRank

Get a Roku or Netflix box. I watch cable, Netflix, Hulu Plus, YouTube videos and computer movie files on it. Sound is good much better. DO NOT BUY! this SONY Bravia ‘ Smart ’ TV… and avoid the Sony apps at all costs. Because of these two issues, I returned the Sony TV. Also you can change the display and sound settings on each port. However, the streaming speed for netflix is just down right terrible. Most of the time I just quit. Since I do not own the cable box, So, I have the cable.

Opinosis

The picture and not bright at all even compared to my 6-year old sony lcd tv. It will not work with an hdmi. Connection because of a conflict with comcast’s dhcp. Being generous because I usuallly like the design and attention to detail of sony products). I am very disappointed with this tv for two reasons: picture brightness and channel menu. Numbers of options available in the on-line area of the tv are numerous and extremely useful. Wow look at the color, look at the sharpness of the picture, amazing and the amazing.

This work

Plenty of ports and settings and have been extremely happy with it. The sound is good and strong. The picture is beautiful. And the internet apps work as expected. And the price is even better. Unbelieveable picture and the setup is so easy. Wow look at the color, look at the sharpness of the picture. The Yahoo! widgets do not work. And avoid the Sony apps at all costs. Communication a bit difficult. :(
Figure 3: Human and system summaries for a product in the Televisions domain.

We also performed a large-scale user study. For every product in the OpoSum test set, participants were asked to compare summaries produced by: a (randomly selected) human annotator, our best performing model (MilNet+MATE+MT+RD), Opinosis, and the Lead baseline. The study was conducted on the Crowdflower platform using Best-Worst Scaling (BWS; Louviere and Woodworth 1991; Louviere et al. 2015), a less labour-intensive alternative to paired comparisons that has been shown to produce more reliable results than rating scales Kiritchenko and Mohammad (2017). We arranged every 4-tuple of competing summaries into four triplets. Every triplet was shown to three crowdworkers, who were asked to decide which summary was best and which one was worst according to four criteria: Informativeness (How much useful information about the product does the summary provide?), Polarity (How well does the summary highlight positive and negative opinions?), Coherence (How coherent and easy to read is the summary?) Redundancy (How successfully does the summary avoid redundant opinions?).

For every criterion, a system’s score is computed as the percentage of times it was selected as best minus the percentage of times it was selected as worst Orme (2009). The scores range from -100 (unanimously worst) to +100 (unanimously best) and are shown in Table 6. Participants favored our model over comparison systems across all criteria (all differences are statistically significant at using post-hoc HD Tukey tests). Human summaries are generally preferred over our model, however the difference is significant only in terms of coherence ().

Finally, Figure 3 shows example summaries for a product from our televisions domain, produced by one of our annotators and by 3 comparison systems (LexRank, Opinosis and our MilNet+MATE+MT+RD). The human summary is primarily focused on aspect-relevant opinions, a characteristic that is also captured to a large extent by our method. There is substantial overlap between extracted segments, although our redundancy filter fails to identify a few highly similar opinions (e.g., those relating to the picture quality). The LexRank summary is inferior as it only identifies a few useful opinions, and instead selects many general or non-opinionated comments. Lastly, the abstractive summary of Opinosis does a good job of capturing opinions about specific aspects but lacks in fluency, as it produces grammatical errors. For additional system outputs, see supplementary material.

8 Conclusions

We presented a weakly supervised neural framework for aspect-based opinion summarization. Our method combined a seeded aspect extractor that is trained under a multi-task objective without direct supervision, and a multiple instance learning sentiment predictor, to identify and extract useful comments in product reviews. We evaluated our weakly supervised models on a new opinion summarization corpus across three subtasks, namely aspect identification, salient opinion extraction, and summary generation. Our approach delivered significant improvements over strong baselines in each of the subtasks, while a large-scale judgment elicitation study showed that crowdworkers favor our summarizer over competitive extractive and abstractive systems.

In the future, we plan to develop a more integrated approach where aspects and sentiment orientation are jointly identified, and work with additional languages and domains. We would also like to develop methods for abstractive opinion summarization using weak supervision signals.

Acknowledgments

We gratefully acknowledge the financial support of the European Research Council (award number 681760).

References

  • Angelidis and Lapata (2018) Stefanos Angelidis and Mirella Lapata. 2018. Multiple Instance Learning Networks for Fine-Grained Sentiment Analysis. Transactions of the Association for Computational Linguistics, 6:17–31.
  • Bhatia et al. (2015) Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein. 2015. Better document-level sentiment analysis from RST discourse parsing. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    , pages 2212–2218, Lisbon, Portugal.
  • Blair-Goldensohn et al. (2008) Sasha Blair-Goldensohn, Kerry Hannan, Ryan McDonald, Tyler Neylon, George Reis, and Jeff Reynar. 2008. Building a sentiment summarizer for local service reviews. In Proceedings of the WWW Workshop on NLP Challenges in the Information Explosion Era (NLPIX), Beijing, China.
  • Cao et al. (2017) Ziqiang Cao, Wenjie Li, Sujian Li, and Furu Wei. 2017. Improving multi-document summarization via text classification. In

    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    , pages 3053–3059, San Francisco, California, USA.
  • Cao et al. (2015) Ziqiang Cao, Furu Wei, Li Dong, Sujian Li, and Ming Zhou. 2015. Ranking with recursive neural networks and its application to multi-document summarization. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pages 2153–2159, Austin, Texas, USA.
  • Cronen-Townsend et al. (2002) Steve Cronen-Townsend, Yun Zhou, and W. Bruce Croft. 2002. Predicting query performance. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, pages 299–306, New York, NY, USA.
  • Erkan and Radev (2004) Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22:457–479.
  • Feng and Hirst (2012) Wei Vanessa Feng and Graeme Hirst. 2012. Text-level discourse parsing with rich linguistic features. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 60–68, Jeju Island, Korea.
  • Ganesan et al. (2010) Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 340–348, Beijing, China.
  • He et al. (2017) Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel Dahlmeier. 2017.

    An unsupervised neural attention model for aspect extraction.

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 388–397, Vancouver, Canada.
  • Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceeding of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177, Seattle, Washington, USA.
  • Hu and Liu (2006) Minqing Hu and Bing Liu. 2006. Opinion extraction and summarization on the web. In Proceedings of the 21st National Conference on Artificial Intelligence, pages 1621–1624, Boston, Massachusettes, USA.
  • Iyyer et al. (2016) Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, and Hal Daumé III. 2016.

    Feuding families and former friends: Unsupervised learning for dynamic fictional relationships.

    In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1534–1544, San Diego, California.
  • Kim et al. (2011) Hyun Duk Kim, Kavita Ganesan, Parikshit Sondhi, and ChengXiang Zhai. 2011. Comprehensive review of opinion summarization. Technical report.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746–1751, Doha, Qatar.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kiritchenko and Mohammad (2017) Svetlana Kiritchenko and Saif Mohammad. 2017. Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 465–470, Vancouver, Canada.
  • Koehn (2004) Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP 2004, pages 388–395, Barcelona, Spain.
  • Kotzias et al. (2015) Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. 2015.

    From group to individual labels using deep features.

    In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597–606, Sydney, Australia.
  • Ku et al. (2006) Lun-Wei Ku, Yun-Ting Liang, and Hsin-Hsi Chen. 2006. Opinion extraction, summarization and tracking in news and blog corpora. In AAAI Syposium on Computational Approaches to Analysing Weblogs, pages 100–107, Palo Alto, California, USA.
  • Lerman et al. (2009) Kevin Lerman, Sasha Blair-Goldensohn, and Ryan McDonald. 2009. Sentiment summarization: Evaluating and learning user preferences. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’09, pages 514–522. Association for Computational Linguistics.
  • Li et al. (2016) Junyi Jessy Li, Kapil Thadani, and Amanda Stent. 2016. The role of discourse units in near-extractive summarization. In Proceedings of the SIGDIAL 2016 Conference, The 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 137–147, Los Angeles, California, USA.
  • Lin and Hovy (2003) Chin-Yew Lin and Eduard Hovy. 2003.

    Automatic evaluation of summaries using n-gram co-occurrence statistics.

    In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 71–78. Association for Computational Linguistics.
  • Liu (2012) Bing Liu. 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies, 5(1):1–167.
  • Liu and Zhang (2012) Bing Liu and Lei Zhang. 2012. A survey of opinion mining and sentiment analysis. Mining Text Data, Springer, pages 415–463.
  • Liu et al. (2015) Pengfei Liu, Shafiq Joty, and Helen Meng. 2015. Fine-grained opinion mining with recurrent neural networks and word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1433–1443, Lisbon, Portugal.
  • Louviere et al. (2015) Jordan J Louviere, Terry N Flynn, and Anthony Alfred John Marley. 2015. Best-worst scaling: Theory, methods and applications. Cambridge University Press.
  • Louviere and Woodworth (1991) Jordan J Louviere and George G Woodworth. 1991. Best-worst scaling: A model for the largest difference judgments. University of Alberta: Working Paper.
  • Lu et al. (2009) Yue Lu, ChengXiang Zhai, and Neel Sundaresan. 2009. Rated aspect summarization of short comments. In Proceedings of the 18th International Conference on World Wide Web, pages 131–140, Madrid, Spain.
  • Mann and Thompson (1988) William C. Mann and Sandra A. Thompson. 1988. Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse, 8(3):243–281.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52, Santiago, Chile.
  • Mei et al. (2007) Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and ChengXiang Zhai. 2007. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proceedings of the 16th International Conference on World Wide Web, pages 171–180, Banff, Alberta, Canada.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, Lake Tahoe, California, USA.
  • Mukherjee and Liu (2012) Arjun Mukherjee and Bing Liu. 2012. Modeling review comments. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 320–329, Jeju Island, Korea.
  • Nenkova and Vanderwende (2005) Ani Nenkova and Lucy Vanderwende. 2005. The impact of frequency on summarization. Technical report.
  • Orme (2009) Bryan Orme. 2009.

    Maxdiff analysis: Simple counting, individual-level logit, and HB.

    Technical report.
  • Pappas and Popescu-Belis (2017) Nikolaos Pappas and Andrei Popescu-Belis. 2017. Explicit document modeling through weighted multiple-instance learning. Journal of Artificial Intelligence Research, 58:591–626.
  • Popescu and Etzioni (2005) Ana-Maria Popescu and Oren Etzioni. 2005. Extracting product features and opinions from reviews. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 339–346, Vancouver, British Columbia, Canada.
  • Radev et al. (2003) Dragomir R. Radev, Simone Teufel, Horacio Saggion, Wai Lam, John Blitzer, Hong Qi, Arda Çelebi, Danyu Liu, and Elliott Drabek. 2003. Evaluation challenges in large-scale document summarization. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 375–382. Association for Computational Linguistics.
  • Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA.
  • Titov and McDonald (2008a) Ivan Titov and Ryan McDonald. 2008a. A joint model of text and aspect ratings for sentiment summarization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 308–316, Columbus, Ohio, USA.
  • Titov and McDonald (2008b) Ivan Titov and Ryan McDonald. 2008b. Modeling online reviews with multi-grain topic models. In Proceedings of the 17th International Conference on World Wide Web, pages 111–120, Beijing, China.
  • Wang and Ling (2016) Lu Wang and Wang Ling. 2016. Neural network-based abstract generation for opinions and arguments. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 47–57. Association for Computational Linguistics.
  • Wang et al. (2016) Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2016. Recursive neural conditional random fields for aspect-based sentiment analysis. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 616–626, Austin, Texas, USA.
  • Yasunaga et al. (2017) Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, and Dragomir Radev. 2017. Graph-based neural multi-document summarization. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 452–462, Vancouver, Canada.
  • Yin et al. (2016) Yichun Yin, Furu Wei, Li Dong, Kaimeng Xu, Ming Zhang, and Ming Zhou. 2016. Unsupervised word and dependency path embeddings for aspect term extraction. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2979–2985, New York, NY, USA.
  • Zhao et al. (2005) Ying Zhao, George Karypis, and Usama Fayyad. 2005. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141–168.