Unsupervised Multi-Document Opinion Summarization as Copycat-Review Generation

11/06/2019 ∙ by Arthur Bražinskas, et al. ∙ 14

Summarization of opinions is the process of automatically creating text summaries that reflect subjective information expressed in input documents, such as product reviews. While most previous research in opinion summarization has focused on the extractive setting, i.e. selecting fragments of the input documents to produce a summary, we let the model generate novel sentences and hence produce fluent text. Supervised abstractive summarization methods typically rely on large quantities of document-summary pairs which are expensive to acquire. In contrast, we consider the unsupervised setting, in other words, we do not use any summaries in training. We define a generative model for a multi-product review collection. Intuitively, we want to design such a model that, when generating a new review given a set of other reviews of the product, we can control the `amount of novelty' going into the new review or, equivalently, vary the degree of deviation from the input reviews. At test time, when generating summaries, we force the novelty to be minimal, and produce a text reflecting consensus opinions. We capture this intuition by defining a hierarchical variational autoencoder model. Both individual reviews and products they correspond to are associated with stochastic latent codes, and the review generator ('decoder') has direct access to the text of input reviews through the pointer-generator mechanism. In experiments on Amazon and Yelp data, we show that in this model by setting at test time the review's latent code to its mean, we produce fluent and coherent summaries.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Summary This restaurant is a hidden gem in Toronto. The food is delicious, and the service is impeccable. Highly recommend for anyone who likes French bistro.
Reviews We got the steak frites and the chicken frites both of which were very goodGreat service I really love this placeCôte de BoeufA Jewel in the big city French jewel of Spadina and Adelaide , JulesThey are super accommodatingmoules and frites are delicious Food came with tons of greens and fries along with my main course , thumbs uppp Chef has a very cool and fun attitude Great little French Bistro spotGo if you want French bistro food classics Great placethe steak frites and it was amazingBest Steak Fritesin Downtown Toronto Favourite french spot in the citycrème brule for dessert
Table 1: A summary produced by our model; colors encode its alignment to the input reviews. The reviews are truncated, and delimited with the symbol ‘’.

Summarization of user opinions expressed in online resources, such as blogs, reviews, social media, or internet forums, has drawn much attention due to its potential for various information access applications, such as creating digests, search, and report generation hu2004mining; angelidis2018summarizing; medhat2014sentiment. Although there has been significant progress recently in summarization non-subjective context (rush2015neural; nallapati2016abstractive; paulus2017deep; see2017get; liu2018generating)

, modern deep learning methods rely on large amounts of annotated data that are not readily available in the opinion-summarization domain and are expensive to produce. Moreover, the annotation would be needed for multiple domains as online reviews are inherently multi-domain  

blitzer2007biographies and summarization systems are highly domain-sensitive isonuma2017extractive. Thus, unsurprisingly, there was a long history of applying unsupervised and weakly-supervised methods to opinion summarization (e.g., (mei2007topic; titov2008modeling; angelidis2018summarizing)

), however, these approaches have primarily focused on extractive summarization, i.e. producing summaries by copying parts of the input reviews.

In this work, we instead consider abstractive summarization which involves generating new phrases, possibly rephrasing or using words that were not in the original text. Abstractive summaries are often claimed preferable to extractive ones, both in the non-subjective summarization context barzilay1999information and when summarizing subjective content carenini2008extractive; di2014hybrid. We focus on the unsupervised setting, in other words, we do not use any summaries for training. Unlike aspect-based summarization (liu2012sentiment), which rewards the diversity of opinions, we aim to generate summaries that reflect consensus in the reviews. Our summaries can be regarded as summarizing reviews reflecting dominant opinions, which can be useful for quick decision making. For example of the output see Table 1.

More specifically, we assume that we are provided with a large collection of reviews for various products and businesses and define a generative model of this collection. Intuitively, we want to design such a model that, when generating a review relying on a set of other reviews of the product,111For simplicity, we refer to both products (e.g., iPhone X) and businesses (e.g., a specific Starbucks branch) as products. we can control the ‘amount of novelty’ going into the new review or, equivalently, vary the degree of deviation from the input reviews. At test time, when generating summaries, we will force the novelty to be minimal, and produce a text reflecting consensus opinions.

We capture this intuition by defining a hierarchical variational autoencoder (VAE) model. Both products and individual reviews are associated with latent representations. The product representations can store, for example, the overall sentiment, common topics, and opinions expressed about the product. In contrast, latent representations of reviews depend on the product representations and capture the content of individual reviews. While at training time the latent representations are random variables, we fix them at their respective means at test time. As desired for summarization, these ‘average’ (or ‘copycat’) reviews differ in writing style from a typical review. For example, they do not contain irrelevant details that are common in customer reviews, such as mentioning the occasion or saying how many family members accompanied the reviewer. In order to encourage the summaries to include specifics, the review generator (‘decoder’) has direct access to the text of input reviews through the pointer-generator mechanism 

see2017get. In the example in Table 1, the model included specific information about the restaurant type and its location in the generated summary. As we will see in ablations, without this conditioning, the model performance drops substantially, as the summaries become more generic.

We evaluate our models on two datasets, Amazon product reviews and Yelp reviews of businesses. The only previous method dealing with unsupervised multi-document opinion summarization, as far as we are aware of, is MeanSum chu2019meansum. Similarly to our work, they generate consensus summaries and consider the Yelp benchmark. Whereas we rely on continuous latent representations, they treat the summary itself as a discrete latent representation of a product. Though this encodes the intuition that a summary should capture the key information about the product, using discrete latent sequences makes optimization challenging miao2016language; baziotis2019seq and chu2019meansum

have to use an extra training loss term and biased gradient estimators.

Our contributions can be summarized as follows:

  • we introduce a simple end-to-end approach to unsupervised abstractive summarization;

  • we demonstrate that the approach substantially outperforms the previous method, both when measured with automatic metrics and in human evaluation;

  • we provide a dataset of abstractive summaries for Amazon products.

2 Model and Estimation

As we discussed above, we approach the summarization task from the generative modeling perspective. We start with a high level description of our model, then in Sections 2.2 and 2.3 we show how we estimate the model and provide extra technical details. In Section 3, we will explain how the summaries are generated.

2.1 Overview of the Generative Model

Our text collection consists of review groups, with each group corresponding to a single product, and our model, Latent Summarizer (LSumm), captures this organization. LSumm can be regarded as a hierarchical extension of the vanilla text-VAE model bowman2015generating and uses two sets of latent variables. The continuous variable is used to capture ‘latent semantics’ of a group of reviews. Another continuous variable encodes latent semantics of each individual review in the group. The information stored in is used by the decoder to produce review text . The marginal log-likelihood for one group of reviews is given by

Intuitively, when generating a new review for a product, the latent representations and serve as (stochastic) hidden layers passing information from the previous reviews

to the decoder. This bottleneck is undesirable, as it will make it hard for the model to pass fine-grain information. For example, at generation time, the model should be reusing named entities (e.g., product names or technical parameters) from other reviews rather than ‘hallucinating’ or avoiding generating them at all, resulting in generic and non-informative text. We alleviate this issue by letting the decoder directly access other reviews. We can formulate this as an autoregressive model:


As we will discuss in Section 2.3, the conditioning is instantiated using the pointer-generator mechanism see2017get and, thus, will specifically help in generating rare words (e.g., named entities). As in this work we want our summarizer to equally rely on every review, we do not want to impose any order on the generation process (e.g., the temporal one). Instead, as illustrated in Figure 1, we let the decoder access all other reviews of the group. This is closely related to pseudolikelihood estimation besag1975statistical or Skip-Thought’s sentence embedding objective kiros2015skip. The final formulation that we aim to maximize for one group of reviews :


where corresponds to all other reviews in the same group as .

We will confirm in ablation experiments that both hierarchical modeling (i.e. using ) and the direct conditioning on other reviews are crucial for generating quality summaries.

Figure 1: The decoder relies on text of the other reviews in the group along with the target review’s latent code .

2.2 Model Estimation

As standard with VAEs and variational inference in general kingma2013auto, instead of directly maximizing the intractable marginal likelihood in Equation 2, we maximize its lower bound:222See the derivations in Appendix A.1.


The lower bound includes two ‘inference networks’, and

, which are neural networks parameterized with

and will be discussed in detail in Section 2.3. They approximate the corresponding posterior distributions of the generative model. The bound is maximized with respect to both the generative model’s parameters  and inference networks’ parameters . Due to Gaussian assumptions, the Kullback-Leibler (KL) divergence terms are available in the closed form, while we rely on the reparameterization trick kingma2013auto to compute gradients of the first term in Equation 3 (i.e. the reconstruction error).

The inference network predicting the posterior for a review-specific variable is needed only in training and is discarded afterwards. In contrast, we will exploit the product-specific inference network when generating summaries, as discussed in Section 3.

2.3 Design of Model Components

Figure 2: Production of the latent code for the review .

2.3.1 Text representations

A GRU encoder (cho2014learning) embeds review words to obtain hidden states . Those representations are reused across the system, e.g., in the inference networks and the decoder.

The full architecture used to produce the latent codes and is shown in Figure 2. We make Gaussian assumptions for all distributions (i.e. posteriors and priors). As in kingma2013auto, we use separate linear projections (LPs) to compute the means and diagonal log-covariances.

2.3.2 Prior and posterior

We set the prior over group latent codes to the standard normal distribution,

. In order to compute the approximate posterior , we first predict the contribution (‘importance’) of each word in each review to the code of the group:


where is the length of and

is a feed-forward neural network (FFNN)

333We use FFNNs with the tanh non-linearity in several model components. Whenever a FFNN is mentioned in the subsequent discussion, the stated architecture is assumed. which takes as input concatenated word embeddings and hidden states of the GRU encoder, , and returns a scalar.

Next, we compute the intermediate representation with the weighted sum: . The formulation is both permutation-invariant and allows for an arbitrary number of reviews of arbitrary lengths (zaheer2017deep).

Finally, we compute the Normal distribution’s parameters by the linear projections of the intermediate representation as shown below:


2.3.3 Prior and posterior

To compute the prior on the review code , , we linearly project the product code . Similarly, to compute the parameters of the approximate posterior = , we concatenate the last encoder’s state of the review and , and perform linear projections.

2.3.4 Decoder

To compute the distribution , we use an auto-regressive GRU decoder with the attention mechanism (bahdanau2014neural) and an pointer-generator network.

We compute the context vector

by attending to all the encoder’s hidden states of the other reviews of the group, where the decoder’s hidden state is used as a query. The hidden state of the decoder is computed using the GRU cell as


The cell inputs the previous hidden state , as well as concatenated word embedding , context vector , and latent code . We found that the model is less prone to the ‘posterior collapse’ (bowman2015generating), i.e. not utilizing the latent code , if the latent code is passed as input at every time-step.

Finally, we compute the word distributions using the pointer-generator network :


The pointer-generator network computes two internal word distributions that are hierarchically aggregated into one distribution (morin2005hierarchical)

. One distribution assigns probabilities to words being generated using a fixed vocabulary, and the other one to be copied directly from the other reviews

. In our case, the network helps to preserve details, especially, to generate rare tokens.

3 Summary Generation

Given the reviews , we generate a summary that reflects common information using trained components of the model. Formally, we could sample a new review from

As we argued in the introduction and will revisit in experiments, a summary, or a ‘summarizing review’, should be generated relying on the mean of the reviews latent code. Consequently, instead of sampling from , we set it to

. Also, in terms of evaluation metrics, we found beneficial not to sample

as well but to rely on the mean predicted by the inference network . We also found beam search to result in higher quality summaries than sampling from .

4 Experimental Setup

4.1 Datasets

Our experiments were conducted on customer online reviews about businesses from Yelp Dataset Challenge and Amazon product reviews (he2016ups). The datasets present different challenges to abstractive summarization systems. With Yelp, a summarization system needs to properly abstract away detailed information, and distill a wide range of controversial opinions often expressed in a highly informal way. For instance, we would expect a good summary to discard details about a wide range of menu items mentioned in the reviews (e.g, exact prices, sauces, toppings), and distill common opinions about the menu overall. On the contrary, on Amazon, users often provide extra details that are important for decision making and thus should be preserved. For example, for the review: “The player has no USB ports, and the power cable gets too hot and starts to melt”, we would prefer not to generate a phrase about attributes overall, but instead to preserve the specifics. In other words, different levels of abstraction are expected from a good summary on the datasets.

We used similar pre-processing to that of chu2019meansum. Specifically, we selected only businesses and products with a minimum of 10 reviews, and thee minimum and maximum length of 20 and 70 words respectively, popular groups above the percentile were removed. And each group was set to contain 8 reviews during training. The final data statistics on Yelp were comparable to (chu2019meansum) and shown in Table 2. From the Amazon dataset, we selected 4 categories: Electronics; Clothing, Shoes and Jewelry, Home and Kitchen; Health and Personal Care. The final statistics are shown in Table 3.

Training Validation
Businesses 38,776 4,311
Reviews 1,012,290 113,373
Table 2: Yelp data statistics after the pre-processing.
Training Validation
Product 183,103 240,819
Reviews 4,566,519 9,639
Table 3: Amazon data statistics after the pre-processing.

For evaluation we used 200 human-generated abstractive summaries with 8 reviews for each (chu2019meansum), the set was split in half for validation and testing. We used Amazon Mechanical Turk and created 3 summaries for each of 60 products based on 8 Amazon reviews. We used 28 products for development and 32 for testing. The summary of the creation process is given in Appendix A.4.

4.2 Experimental Details

For sequential encoding and decoding, we used GRUs (cho2014learning) with 600-dimensional hidden states. The word embeddings dimension was set to 200, and they were shared across the model (press2016using). The vocabulary size was set to 50,000 most frequent words, and an extra 30,000 were allowed in the extended vocabulary, the words were lower-cased. We used the Moses’ (koehn2007moses) reversible tokenizer and truecaser. Xavier uniform initialization (glorot2010understanding) of 2D weights was used, and 1D weights were initialized with the scaled normal noise (). We used the Adam optimizer (kingma2014adam), and set the learning rate to 0.0008 and 0.0001 on Yelp and Amazon, respectively. For summary decoding, we used length-normalized beam search of size 5, and relied on latent code means. In order to overcome “posterior collapse” (bowman2015generating) we applied cycling annealing (liu2019cyclical) with for both the and related KL terms in Equation 3

, with a new cycle over approximately every 2 epochs of the training set. The maximum annealing scalar was set to 1 for

-related KL term in on both datasets, and 0.3 and 0.65 for -related KL-term on Yelp and Amazon, respectively. For ROUGE computations we used the script provided by chu2019meansum, and the reported automatic metrics are based on F1.

The dimensions of the variables and were set to 600, and the posterior’s scoring neural network had a 300-dimensional hidden layer and the tanh non-linearity.

The decoder’s attention mechanism used a single layer neural network with a 200-dimensional hidden layer, and the tanh non-linearity. The copy gate in the pointer-generator network was computed with a 100-dimensional single-hidden layer network, with the same non-linearity.

4.3 Baseline Models

Below we describe the models that were used as the baselines.

The Opinosis model is a graph-based abstractive summarizer (ganesan2010opinosis) that is designed for the generation of short opinions based on high redundancy of text. Though it is referred to as abstractive, it can only select words from the reviews.

LexRank is an unsupervised approach to text summarization based on graph-based centrality scoring of sentences similar to PageRank (page1999pagerank), and is closely related to tf-idf information retrieval. The main idea is that if one sentence is very similar to many others, it will likely be a sentence of great importance.

The MeanSum444We used the provided by the authors’ checkpoint on Yelp, as we obtained very similar ROUGE scores when the model had been retrained. model is the unsupervised abstractive summarization method chu2019meansum discussed in the introduction.

We also treated reviews of a group as independent from each other and trained a VAE model with the GRU encoder and decoder. For the summary decoding, we averaged the means of the posterior distributions of the reviews.

Also, we computed the clustroid review for each group as follows. We took each review from a group and computed ROUGE-L with respect to all other reviews. The review that was the most representative of the group in terms of the highest ROUGE score was selected as the clustroid review.

Similar to the clustroid review, we compared each review in a group with the actual summary using ROUGE-L. The review that had the best score was selected. We refer to this method as the oracle review as the selection is based on the actual summary. Furthermore, we selected a random review from each group to be used as the summary. Also, we constructed the summary by selecting the leading sentences from each review of a group.

5 Evaluation Results

5.1 Automatic Evaluation

R1 R2 RL
Ours 0.2947 0.0526 0.1809
MeanSum 0.2846 0.0366 0.1557
LexRank 0.2501 0.0362 0.1467
Opinosis 0.2488 0.0278 0.1409
VAE 0.2542 0.0311 0.1504
Clustroid 0.2628 0.0348 0.1536
Lead 0.2634 0.0372 0.1386
Random 0.2304 0.0244 0.1344
Oracle 0.2907 0.0527 0.1863
Table 4: ROUGE scores on the Yelp gold summary test set.
R1 R2 RL
Ours 0.3197 0.0581 0.2016
MeanSum 0.2920 0.0470 0.1815
LexRank 0.2874 0.0547 0.1675
Opinosis 0.2842 0.0457 0.1550
VAE 0.2287 0.0275 0.1446
Clustroid 0.2928 0.0441 0.1778
Lead 0.3032 0.0590 0.1578
Random 0.2766 0.0472 0.1695
Oracle 0.3398 0.0788 0.2160
Table 5: ROUGE scores on the Amazon gold summary test set.

The evaluation results on the Yelp and Amazon gold summaries from the test sets are shown in Table 4 and 5, respectively. Overall, our full model yields higher scores on both datasets.

We observed that the Amazon reviews are very broad in terms of micro categories of products. We speculate that the vanilla VAE struggles to properly represent the variety of micro categories under a single restrictive latent space prior as is evident from the results’ difference with other models on Amazon. Consequently, the averaged latent code can correspond to a semantically similar yet functionally different micro category. For example, the reviews about a sweater can result in a summary about socks. On the other hand, our model, allows each group to have its own prior conditional and access to other reviews during decoding, which yields a significant improvement over the VAE that is evident on both datasets, but especially on Amazon. Finally, qualitatively, we observed that the summaries produced by MeanSum are relatively fluent on the individual sentence level, but more often contain hallucination, i.e. information that is not present in the input reviews. Example summaries from our final model and baselines are provided in Appendix.

5.2 Human Evaluation

5.2.1 Best-Worst Scaling

Additionally, we performed a human evaluation using the Mechanical Turk platform. We sampled 50 businesses from the human-annotated Yelp test set and used all 32 test products from the Amazon set, and assigned 3 workers to evaluate each tuple containing summaries produced by MeanSum, Ours, LexRank, and the human annotators. The reviews and summaries were presented to the workers in random order and were judged using Best-Worst Scaling (louviere1991best; louviere2015best), that has shown to produce more reliable results than ranting scales (kiritchenko2017capturing). The truncated judgment criteria are described below, and the full ones are described in Appendix A.3, where the non-redundancy and coherence criteria were taken directly from dang2005overview.

Fluency: the summary sentences should be grammatically correct, easy to read and understand; Coherence: the summary should be well structured and well organized; Non-redundancy: there should be no unnecessary repetition in the summary; Opinion consensus: the summary should reflect common opinions expressed in the reviews; Overall: based on your own criteria (judgment) please select the best and the worst summary of the reviews.

For every criterion, a system’s score is computed as the percentage of times it was selected as best minus the percentage of times it was selected as worst (orme2009maxdiff). The scores range from -1 (unanimously worst) to +1 (unanimously best).

Fluency Coherence Non Red. Opinion Cons. Overall
Ours 0.5802 0.5161 0.4722 -0.0909 0.3818
MeanSum -0.5294 -0.4857 0.0270 -0.6235 -0.7468
LexRank -0.7662 -0.8293 -0.7699 0.3500 -0.5278
Gold 0.6486 0.8140 0.6667 0.3750 0.8085
Table 6: Human evaluation results in terms of the Best-Worst scaling on the Yelp dataset.
Fluency Coherence Non Red. Opinion Cons. Overall
Ours 0.4444 0.3750 0.0270 -0.4286 -0.1429
MeanSum -0.6410 -0.8667 -0.6923 -0.7736 -0.8305
LexRank -0.2963 -0.3208 -0.3962 0.4348 0.1064
Gold 0.3968 0.7097 0.7460 0.6207 0.7231
Table 7: Human evaluation results in terms of the Best-Worst scaling on the Amazon dataset.

As shown in Table 6, our model scores higher than the other models merely on all criteria indicating better quality of the summaries, and human preference of our summaries. Our scores with respect to the scores of the other approaches are significantly different on all the criteria, and insignificantly different on fluency from Gold, at using post-hoc HD Tukey test.

The result on Amazon are shown in Table 7, where our system outperforms other methods in terms of fluency, coherence, and non-redundancy. However, LexRank shows an advantage on opinion consensus, the same as on Yelp, and a slightly better result in terms of the overall criterion. Our scores are also significantly different on all the criteria at using post-hoc HD Tukey test.

Opinion consensus is a criterion that captures recall over common opinions, and we believe that it plays different roles on two datasets. On Yelp, LexRank even with the extraction of more opinions from the input is not preferred over our approach. On the other hand, while the score remains similar on opinion consensus, LexRank received a way higher preference overall on Amazon than on Yelp. We believe that it is related to the fact that a breadth of details is more important on Amazon, and as we observed is challenging for both abstractive systems.

5.2.2 Content Support

The ROUGE metric relies on unweighted n-gram overlap and can be insensitive to generation of facts and entities that are not present in the reviews. For example, a summarizer might produce a summary text that mentions a burger restaurant instead of a Chinese one, and its location being Toronto instead of Seattle, while the ROUGE scores will be marginally affected. However, users might experience a strong aversion to the summary containing information that is not reflected in the reviews.

To investigate how well the content of the produced summaries is supported by the input reviews, we performed a second human evaluation study. We used the same sets Yelp businesses and Amazon products as in the human evaluation in Section 5.2.1, and split MeanSum and our system’s summaries into sentences. Then, to each sentence, we assigned 3 Mechanical Turk workers to evaluate the sentence content support with respect to the reviews. The workers were advised to rate sentences using one of the following three options.

Full support: all the content is reflected in the reviews; Partial support: only some content is reflected in the reviews; No support: content is not reflected in the reviews.

Full Partial No
Ours 0.4450 0.3248 0.2301
MeanSum 0.2841 0.3066 0.4092
Table 8: Yelp summary sentence content support with respect to the input reviews, the results are normalized.
Full Partial No
Ours 0.3823 0.3395 0.2783
MeanSum 0.2441 0.3123 0.4436
Table 9: Amazon summary sentence content support with respect to the input reviews, the results are normalized.

The results in terms of percentages are shown in Table 8 and 9 on Yelp and Amazon, respectively. They indicate that our system generates summaries that better preserve information about the input reviews. Further, the results are supported with our summaries being significantly more preferred over MeanSum’s.

6 Analysis

6.1 Copy Mechanism

We analyzed what words the full model prefers to copy during summary generation on the Amazon dataset. Generally, the model copies around 3-4 tokens per summary, which are in the medium or low frequency. We observed a tendency to copy product-type specific words (e.g., shoes, and laptop), and exact product brands and names.

6.2 Ablations

R1 R2 RL
w/o 0.2866 0.0454 0.1863
w/o 0.2767 0.0507 0.1919
w/o 0.2926 0.0416 0.1739
Samping 0.2563 0.0434 0.1716
Full 0.3197 0.0581 0.2016
Table 10: Ablated model’s ROUGE scores on the Amazon gold summary test set. Also, sampling of and as opposed to using mean values.

To gain an insight into the importance of the individual components of the model, we ablated the model by removing the latent variables ( and , one at a time), and the attention mechanism over the other group’s reviews. The models were re-trained on the Amazon dataset. The results are shown in Table 10. They indicate that all components play a role, yet we observed that the most significant difference in terms of ROUGE was achieved when the variable was removed, and only remained. Visually, summaries obtained from the latter system are more wordy and look more like reviews, which is consequently reflected in the lower scores. The model without the attention (w/o ) often produces summaries that are less detailed than the full model because of the inability to directly copy details from the input. Finally, the smallest visual and ROUGE-L quality drop was observed when variable was removed.

We hypothesized in the introduction that using latent variable means would result in our model in generating summarizing text, whereas sampling would result in texts including many novel and potentially irrelevant details. In order to empirically test this hypothesis, we sampled the latent variables during summary generation, as opposed to using mean values (see Section 3). As we hypothesized, we observed that the summaries were wordier, less fluent, and less aligned to the input reviews, as is also reflected in the ROUGE scores shown in Table 10.

7 Related Work

Recently, extractive opinion summarization was tackled using a pipeline of weakly- and semi-supervised models (angelidis2018summarizing)

. First, the authors parse product reviews into segments, then in a weakly-supervised fashion learn to assign their sentiment polarity. Then, they induce aspect labels for segments relying on a small sample of gold summaries. Finally, they use an heuristic to construct a summarizing set of segments. Opinion summarization in Opinosis

(ganesan2010opinosis) was addressed for an unsupervised perspective. The model relies on redundancies in opinionated text and POS tags in order to generate short opinions. However, the model is not well suited for the generation of coherent long summaries. Also, though they position their approach as abstractive, it combines fragments of input text, and it cannot generate novel words and phrases. The extractive model LexRank (erkan2004lexrank) builds a sentence graph in order to determine the importance of sentences, and then selects a number of the most representative ones as a summary. Their approach is conceptually similar to the proposed consensus opinion summarization. isonuma2019unsupervised introduce an unsupervised approach to single review summarization, where they rely on latent discourse trees. As we already discussed above, the most related approach is MeanSum (chu2019meansum) which treats a summary as a discrete latent state of an autoencoder. In contrast, we define a hierarchical generative model of a review collection and use continuous latent codes.

8 Conclusions

In this work, we presented an abstractive summarizer of opinions, which does not use any summaries in training and is trained end-to-end on a large collection of reviews. The model compares favorably to the competitors, especially to the only other unsupervised abstractive multi-review summarization system. Also, we performed a human evaluation of the content of the generated summaries by considering their alignment with the reviews. We show that summaries produced by our model better reflect the content of the input reviews.


We thank Stefanos Angelidis for help with the data as well as Jonathan Mallinson, Serhii Havrylov, and other members of Edinburgh NLP group for discussion. The project was supported by the European Research Council (ERC Starting Grant 678254) and by the Dutch National Science Foundation (NWO VIDI 639.022.518).


Appendix A Appendices

a.1 Derivation Of The Lower Bound

To make the notation below less cluttered, we make a couple of simplifications: and .


a.2 Human Evaluation Setup

To perform the human evaluation experiments described in Sections 5.2.1 and 5.2.2 we combined both tasks into single Human Intelligence Tasks (HITs). Namely, the workers needed to mark sentences as described in Section 5.2.2, and then proceed to the task in Section 5.2.1. We explicitly asked then to re-read the reviews before each task.

For worker requirements we set 98% approval rate, 1000+ HITS, Location: USA, UK, Canada, and the maximum score on a qualification test that we designed. The test was asking if the workers are native English speakers, and verifying that they correctly understand the instructions of both tasks by completing a mini version of the actual HIT.

a.3 Full Human Evaluation Instructions

  • Fluency: The summary sentences should be grammatically correct, easy to read and understand.

  • Coherence: The summary should be well structured and well organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic.

  • Non-redundancy: There should be no unnecessary repetition in the summary. Unnecessary repetition might take the form of whole sentences that are repeated, or repeated facts, or the repeated use of a noun or noun phrase (e.g., ”Bill Clinton”) when a pronoun (”he”) would suffice.

  • Opinion consensus: The summary should reflect common opinions expressed in the reviews. For example, if many reviewers complain about a musty smell in the hotel’s rooms, the summary should include this information.

  • Overall: Based on your own criteria (judgment) please select the best and the worst summary of the reviews.

a.4 Amazon Summaries Creation

First, we sampled 15 products from each of the Amazon review categories: Electronics; Clothing, Shoes and Jewelry, Home and Kitchen; Health and Personal Care. Then, we selected 8 reviews from each product to be summaries. We used the same requirements for workers as for human evaluation in A.2. We assigned 3 workers to each product, and instructed them to read the reviews and produce a summary text. We followed the instructions provided in (chu2019meansum), and used the following points in our instructions:

  • The summary should reflect common opinions about the product expressed in the reviews. Try to preserve the common sentiment of the opinions and their details (e.g. what exactly the users like or dislike). For example, if most reviews are negative about the sound quality, then also write negatively about it. Please make the summary coherent and fluent in terms of sentence and information structure. Iterate over the written summary multiple times to improve it, and re-read the reviews whenever necessary.

  • Please write your summary as if it were a review itself, e.g. ’This place is expensive’ instead of ’Users thought this place was expensive’. Keep the length of the summary reasonably close to the average length of the reviews.

  • Please try to write the summary using your own words instead of copying text directly from the reviews. Using the exact words from the reviews is allowed, but do not copy more than 5 consecutive words from a review .

a.5 Latent Codes Analysis

mean Bought this for my Kindle Fire HD and it works great. I have had no problems with it. I would recommend it to anyone looking for a good quality cable.
Works fine with my Kindle Fire HD 8.9”. The picture quality is very good, but it doesn’t work as well as the picture. I’m not sure how long it will last, but i am very disappointed.
This is a great product. I bought it to use with my Kindle Fire HD and it works great. I would recommend it to anyone who is looking for a good quality cable for the price.
Good product, does what it is supposed to do. I would recommend it to anyone looking for a HDMI cable.
Rev 1 Love this HDMI cable , but it only works with HD Kindle and not the HDX Kindle which makes me kinda crazy . I have both kinds of Kindles but the HDX is newer and I can ’t get a cable for the new one . I guess my HD Kindle will be my Amazon Prime Kindle . It works great !
Rev 2 I got a kindle for Christmas . I had no idea how to work one etc . I discovered you can stream movies to your tv and this is the exact cable for it . Works great and seems like its good quality . A bit long though.
Rev 3 this is great for watching movies from kindle to tv . Now the whole family can enjoy rather than one person at a time . Picture quality isn ’t amazing , but it ’s good .
Rev 4 I just received this wire in the mail , and it does not work in the slightest . I am very displeased with this product .
Rev 5 Works great ! ! Now I can watch Netflix on my TV with my Kindle Fire HD … I love it and so will you !
Rev 6 Works awesome . Great item for the price.Got it very quickly . Was as described in the ad.Exactly what I was looking for.
Rev 7 I plugged it into my Kindle fire HD and into the TV and works perfectly . Have had no problems with it !
Rev 8 This is just what I was looking for to connect my Kindle Fire to view on our TV ! Great price too!
Table 11: Amazon summaries of the full model with sampled and mean assignment to . The assignment to was fixed, and was the mean value based on the approximate posterior .

We performed a qualitative analysis of the latent variable to shed additional light on what it stores and sensitivity of the decoder with respect to its input. Specifically, we computed the mean value for the variable using the approximate posterior , and then sampled from the prior .

First, we observed that the summaries produced using the mean of are more fluent. For example, in Table 11, the based summary states: “The picture quality is very good, but it doesn’t work aswell as the picture.”, where the second phrase could be rewritten in a more fluent matter. Also, we found that mean based summaries contain less details that are partially or not supported by the reviews. For example, in the table, based summary mentions Kindle Fire HD 8.9’, while the dimension is never mentioned in the reviews. Finally, different samples were observed to result in texts that contain different details about the reviews. For example, sample results in the summary that captures the picture quality, while that the item is good for its price. Overall, we observed that the latent variable stores content based information, that results in syntactically diverse texts, yet reflecting information about the same businesses or product.

Ours This place is the best Mexican restaurant i have ever been to. The food was delicious and the staff was very friendly and helpful. Our server was very attentive and made sure we were taken care of. We’ll be back for sure.
MeanSum A little on the pricey side but I was pleasantly surprised. We went there for a late lunch and it was packed with a great atmosphere, food was delicious and the staff was super friendly. Very friendly staff. We had the enchiladas with a few extra veggies and they were delicious! Will be back for sure!
LexRank We will definitely be going back for more great food! Everything we had so far was great. The staff was great and so nice! Good food! Great atmosphere!
Gold This place is simply amazing! Its the best Mexican spot in town. Their tacos are delicious and full of flavor. They also have chips and salsa that is to die for! The salsa is just delectable! It has a sweet, tangy flavor that you can’t find anywhere else. I highly recommend!
Rev 1 Classic style Mexican food done nicely! Yummy crispy cheese crisp with a limey margarita will will win my heart any day of the week! The classic frozen with a chambord float is my favorite and they do it well here.The salad carbon was off the chain- served on a big platter and worked for me as 2 full dinners.
Rev 2 For delicious Mexican food in north Phoenix, try La Pinata. This was our visit here and we were so stunned by the speed in which our food was prepared that we were sure it was meant for another table. The food was hot and fresh and well within our budget. My husband got a beef chimichanga and I got bean and cheese burrito, which we both enjoyed. Chips and salsa arrived immediately; the salsa tastes sweeter than most and is equally flavorful. We will be back!
Rev 3 Good food! Great atmosphere! Great patio. Staff was super friendly and accommodating! We will definately return!
Rev 4 This place was very delicious! I got the ranchero burro and it was so good. The plate could feed at least two people. The staff was great and so nice! I also got the fried ice cream it was good. I would recommend this place to all my friends.
Rev 5 We arrive for the first time, greeted immediately with a smile and seated promptly. Our server was fantastic, he was funny and fast. Gave great suggestions on the menu and we both were very pleased with the food, flavors, speed and accuracy of our orders. We will definitely be going back for more great food!
Rev 6 Well was very disappointed to see out favorite ice cream parlor closed but delightfully surprised at how much we like this spot!!Service was FANTASTIC TOP notch!! Taco was great lots of cheese. Freshly deep fried shell not like SO MANY Phoenix mex restaurants use! Enchilada was very good. My wife really enjoyed her chimichanga. My moms chilli reanno was great too. Everything we had so far was great. We will return. Highly recommended.
Rev 7 I’m only on the salsa and it’s just as fabulous as always. I love the new location and the decor is beautiful. Open 5 days and the place is standing room only. To the previous negative commentor, they are way took busy to fill an order for beans. Go across the street….you’ll be angry lol.
Rev 8 I just tried to make a reservation for 15 people in March at 11 am on a Tuesday and was informed by a very rude female. She said ”we do not take reservations” and I asked if they would for 15 people and she said ” I told you we don’t take reservations” and hung up on me. Is that the way you run a business? Very poor customer service and I have no intentions of ever coming there or recommending it to my friends.
Table 12: Yelp summaries produced by different models.
Ours This place is the worst service I’ve ever had. The food was mediocre at best. The service was slow and the waiter was very rude. I would not recommend this place to anyone who wants to have a good time at this location.
MeanSum I love the decor, but the food was mediocre. Service is slow and we had to ask for refills. They were not able to do anything and not even charge me for it. It was a very disappointing experience and the service was not good at all. I had to ask for a salad for a few minutes and the waitress said he didn’t know what he was talking about. All I can say is that the staff was nice and attentive. I would have given 5 stars if I could.
LexRank Food was just okay, server was just okay. The atmosphere was great, friendly server. It took a bit long to get a server to come over and then it took our server a while to get our bread and drinks. However there was complementary bread served.The Pizza I ordered was undercooked and had very little sauce.Macaroni Grill has unfortunately taken a dive. Went to dinner with 4 others and had another bad experience at the Macaroni Grill.
Gold I’m really not a fan of Macaroni Grill, well, at least THIS Macaroni Grill. The staff is slow and really doesn’t seem to car about providing quality service. It took well over 30 minutes to get my food and the place wasn’t even packed with people. I ordered pizza and it didn’t taste right. I think it wasn’t fully cooked. I won’t be coming back.
Rev 1 10/22/2011 was the date of our visit. Food was just okay, server was just okay. The manager climbed up on the food prep counter to fix a light. We felt like that was the most unsanitary thing anyone could do - he could have just come from the restroom for all we knew. Needless to say, lackluster service, mediocre food and lack of concern for the cleanliness of the food prep area will guarantee we will NEVER return.
Rev 2 We like the food and prices are reasonable. Our biggest complaint is the service. It took a bit long to get a server to come over and then it took our server a while to get our bread and drinks. They really need to develop a better sense of teamwork. While waiting for things there were numerous servers standing around gabbing. It really gave us the impression of ”Not my table.” ”Not my problem.” Only other complaint is they need to get some rinse aid for the dishwasher. I had to dry our bread plates when the hostess gave them to us.
Rev 3 Not enough staff is on hand the two times I have been in to properly pay attention to paying customers. I agree that the portions have shrunk over the years, and the effort is no longer there. It is convenient to have nearby but not worth my time when other great restaurants are around. Wish I could rate it better but it’s just not that good at all.
Rev 4 Went to dinner with 4 others and had another bad experience at the Macaroni Grill. When will we ever learn? The server was not only inattentive, but p o’d when we asked to be moved to another table. When the food came it was at best, luke warm. They had run out of one of our ordered dishes, but didn’t inform us until 20 minutes after we had ordered. Running out at 6:00 p.m.: Really? More delay and no apologies. There is no excuse for a cold meal and poor service. We will not go back since the Grill seems not to care and there are plenty of other restaurants which do.
Rev 5 The service is kind and friendly. However there was complementary bread served.The Pizza I ordered was undercooked and had very little sauce.Macaroni Grill has unfortunately taken a dive. Best to avoid the place or at the very least this location.
Rev 6 I know this is a chain, but Between this and Olive Garden, I would def pick this place. Service was great at this location and food not bad at all, although not excellent, I think it still deserves a good 4 stars
Rev 7 I had a 2 for 1 $9.00 express dinner coupon so we order up 2 dinners to go. The deal was 9 min or its free, it took 20, but since I was getting 2 meals for $9.00 I did not make a fuss. The actual pasta was fine and amount was fair but it had maybe a 1/4 of a chicken breast. The chicken tasted like it came from Taco Bell, VERY processed. The sauce straight from a can. I have had much better frozen dinners. My husband and I used to like Macaroni Grill it sad too see its food go so down hill.
Rev 8 The atmosphere was great, friendly server. Although the food I think is served from frozen. I ordered mama trio. The two of three items were great. Plate came out hot, couldn’t touch it. Went to eat lasagna and was ice cold in the center, nit even warm. The server apologized about it offered new one or reheat this one. I chose a new one to go. I saw her go tell manager. The manager didn’t even come over and say anything. I was not even acknowledged on my way out and walked past 3 people. I will not be going back. Over priced for frozen food.
Table 13: Yelp summaries produced by different models.
Ours My wife and i have been here several times now and have never had a bad meal. The service is impeccable, and the food is delicious. We had the steak and lobster, which was delicious. I would highly recommend this place to anyone looking for a good meal.
MeanSum Our first time here, the restaurant is very clean and has a great ambiance. I had the filet mignon with a side of mashed potatoes. They were both tasty and filling. I’ve had better at a chain restaurant, but this is a great place to go for a nice dinner or a snack. Have eaten at the restaurant several times and have never had a bad meal here.
LexRank Had the filet… Really enjoyed my filet and slobster. In addition to excellent drinks, they offer free prime filet steak sandwiches. I have had their filet mignon which is pretty good, calamari which is ok, scallops which aren’t really my thing, sour dough bread which was fantastic, amazing stuffed mushrooms. Very good steak house.
Gold The steak is the must have dish at this restaurant. One small problem with the steak is that you want to order it cooked less than you would at a normal restaurant. They have the habit of going a bit over on the steak. The drinks are excellent and the stuffed mushrooms as appetizers were amazing. This is a classy place that is also romantic. The staff pays good attention to you here.
Rev 1 The ambiance is relaxing, yet refined. The service is always good. The steak was good, although not cooked to the correct temperature which is surprising for a steakhouse. I would recommend ordering for a lesser cook than what you normally order. I typically order medium, but at donovan’s would get medium rare. The side dish menu was somewhat limited, but we chose the creamed spinach and asparagus, both were good. Of course, you have to try the creme brulee - Yum!
Rev 2 Hadn’t been there in several years and after this visit I remember why, I don’t like onions or shallots in my macaroni and cheese. The food is good but not worth the price just a very disappointing experience and I probably won’t go back
Rev 3 My wife and I come here every year for our anniversary (literally every year we have been married). The service is exceptional and the food quality is top-notch. Furthermore, the happy hour is one of the best in the Valley. In addition to excellent drinks, they offer free prime filet steak sandwiches. I highly recommend this place for celebrations or a nice dinner out.
Rev 4 I get to go here about once a month for educational dinners. I have never paid so don’t ask about pricing. I have had their filet mignon which is pretty good, calamari which is ok, scallops which aren’t really my thing, sour dough bread which was fantastic, amazing stuffed mushrooms. The vegetables are perfectly cooked and the mashed potatoes are great. At the end we get the chocolate mousse cake that really ends the night well. I have enjoyed every meal I have eaten there.
Rev 5 Very good steak house. Steaks are high quality and the service was very professional. Attentive, but not hovering. Classic menus and atmosphere for this kind of restaurant. No surprises. A solid option, but not a clear favorite compared to other restaurants in this category.
Rev 6 Had a wonderful experience here last night for restaurant week. Had the filet… Which was amazing and cooked perfectly with their yummy mashed potatoes and veggies. The bottle of red wine they offered for an additional $20 paired perfectly with the dinner. The staff were extremely friendly and attentive. Can’t wait to go back!
Rev 7 The seafood tower must change in selection of seafood, which is good, which is also why mine last night was so fresh fresh delicious. Its good to know that you can get top rate seafood in Phoenix. Bacon wrapped scallops were very good, and I sacrificied a full steak (opting for the filet medallion) to try the scallops. I asked for medium rare steak, but maybe shouldve asked for rare…my cousin had the ribeye and could not have been any happier than he was :) yum for fancy steak houses. Its an ultra romantic place to, fyi.the wait staff is very attentive.
Rev 8 Donovans, how can you go wrong. Had some guests in town and some fantastic steaks paired with some great cabernets. Really enjoyed my filet and lobster.
Table 14: Yelp summaries produced by different models.
Ours I love this tank. It fits well and is comfortable to wear. I wish it was a little bit longer, but I’m sure it will shrink after washing. I would recommend this to anyone.
MeanSum I normally wear a large so it was not what I expected. It’s a bit large but I think it’s a good thing. I’m 5 ’4 ”and the waist fits well. I’m 5 ’7 and this is a bit big.
LexRank I’m 5 ’4 ’and this tank fits like a normal tank top, not any longer. The only reason I’m rating this at two stars is because it is listed as a ’long’ tank top and the photo even shows it going well past the models hips, however I’m short and the tank top is just a normal length. I bought this tank to wear under shirts when it is colder out. I was trying to find a tank that would cover past my hips, so I could wear it with leggings.
Gold Great tank top to wear under my other shirts as I liking layering and the material has a good feel. There was a good choice of colors to pick from. Although, the top is a thin material I don’t mind since I wear it under something else.
Rev 1 The description say it long… NOT so it is average. That’s why I purchased it because it said it was long. This is a basic tank.I washed it and it didn’t warp but did shrink a little. Nothing to brag about.
Rev 2 I’m 5 ’4 ’and this tank fits like a normal tank top, not any longer. I was trying to find a tank that would cover past my hips, so I could wear it with leggings. Don’t order if you’re expecting tunic length.
Rev 3 This shirt is OK if you are layering for sure. It is THIN and runs SMALL. I usually wear a small and read the reviews and ordered a Medium. It fits tight and is NOT long like in the picture. Glad I only purchased one.
Rev 4 The tank fit very well and was comfortbale to wear. The material was thinner than I expected, and I felt it was probably a little over priced. I’ve bought much higher quality tanks for $5 at a local store.
Rev 5 The only reason I’m rating this at two stars is because it is listed as a ’long’ tank top and the photo even shows it going well past the models hips, however I’m short and the tank top is just a normal length.
Rev 6 I usually get them someplace out but they no longer carry them. I thought I would give these a try. I received them fast, although I did order a brown and got a black (which I also needed a black anyway). They were a lot thinner than I like but they are okay.
Rev 7 Every women should own one in every color. They wash well perfect under everything. Perfect alone. As I write I’m waiting on another of the same style to arrive. Just feels quality I don’t know how else to explain it, but I’m sure you get it ladies!
Rev 8 I bought this tank to wear under shirts when it is colder out. I bought one in white and one in an aqua blue color. They are long enough that the color peeks out from under my tops. Looks cute. I do wish that the neck line was a bit higher cut to provide more modest coverage of my chest.
Table 15: Amazon summaries produced by different models.
Ours This is the best acupressure mat I have ever used. I use it for my back pain and it helps to relieve my back pain. I have used it for several months now and it seems to work well. I would recommend it to anyone.
MeanSum I have used this for years and it works great. I have trouble with my knee pain, but it does help me to get the best of my feet. I have had no problems with this product. I have had many compliments on it and is still in great shape.
LexRank I ordered this acupressure mat to see if it would help relieve my back pain and at first it seemed like it wasn’t doing much, but once you use it for a second or third time you can feel the pain relief and it also helps you relax. its great to lay on to relax you after a long day at work. I really like the Acupressure Mat. I usually toss and turn a lot when I sleep, now I use this before I go to bed and it helps relax my body so that I can sleep more sound without all the tossing and turning.
Gold These acupressure mats are used to increase circulation and reduce body aches and pains and are most effective when you can fully relax. Consistence is key to receive the full, relaxing benefits of the product. However, if you are using this product after surgery it is responsible to always consult with your physician to ensure it is right for your situation.
Rev 1 Always consult with your doctor before purchasing any circulation product after surgery. I had ankle surgery and this product is useful for blood circulation in the foot. This increase in circulation has assisted with my ability to feel comfortable stepping down on the foot (only after doc said wait bearing was okay). I use it sitting down barefoot.
Rev 2 I really like the Acupressure Mat. I usually toss and turn a lot when I sleep, now I use this before I go to bed and it helps relax my body so that I can sleep more sound without all the tossing and turning.
Rev 3 I used the mat the first night after it arrived and every-other night since. After 2 ten minute sessions, I am sold. I have slept much better at night - I think it puts me in a more relaxed state, making it easier to fall asleep. A rather inexpensive option to relieving tension in my neck, upper back and shoulders.
Rev 4 This is the best thing! you can use socks if your feet are tender to walk on it or bare foot if you can take it. I use it every morning to walk across to jump start my body. when I think about it I will lay on it, it feels wonderful.
Rev 5 I love these spike mats and have recommended them to everyone that has had any kind of body ache. its great to lay on to relax you after a long day at work. Helps with pain in my back and pain in my legs. Its not a cure, but it sure helps with the healing process.
Rev 6 I wish I hadn’t purchased this item. I just can’t get use to it, it’s not comfortable. I have not seen any benefits from using it but that could be because I don’t relax or use it for long enough.
Rev 7 I run an alternative health center and use Acupressure pin mats from different sources to treat my patients, but this product is the patients choice, they are asking allways for this mat against other brands so I changed all of them for Britta, moreover the S & H was outstanding and really fast.
Rev 8 I ordered this acupressure mat to see if it would help relieve my back pain and at first it seemed like it wasn’t doing much, but once you use it for a second or third time you can feel the pain relief and it also helps you relax. I use it almost everyday now and it really helps. I recommed this product and this seller.
Table 16: Amazon summaries produced by different models.