ExplainIt: Explainable Review Summarization with Opinion Causality Graphs

by   Nofar Carmeli, et al.
Megagon Labs

We present ExplainIt, a review summarization system centered around opinion explainability: the simple notion of high-level opinions (e.g. "noisy room") being explainable by lower-level ones (e.g., "loud fridge"). ExplainIt utilizes a combination of supervised and unsupervised components to mine the opinion phrases from reviews and organize them in an Opinion Causality Graph (OCG), a novel semi-structured representation which summarizes causal relations. To construct an OCG, we cluster semantically similar opinions in single nodes, thus canonicalizing opinion paraphrases, and draw directed edges between node pairs that are likely connected by a causal relation. OCGs can be used to generate structured summaries at different levels of granularity and for certain aspects of interest, while simultaneously providing explanations. In this paper, we present the system's individual components and evaluate their effectiveness on their respective sub-tasks, where we report substantial improvements over baselines across two domains. Finally, we validate these results with a user study, showing that ExplainIt produces reasonable opinion explanations according to human judges.


page 1

page 2

page 3

page 4


OpinionDigest: A Simple Framework for Opinion Summarization

We present OpinionDigest, an abstractive opinion summarization framework...

Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

We present a neural framework for opinion summarization from online prod...

Aspect-Controllable Opinion Summarization

Recent work on opinion summarization produces general summaries based on...

Unsupervised Extractive Opinion Summarization Using Sparse Coding

Opinion summarization is the task of automatically generating summaries ...

Unsupervised Opinion Summarization Using Approximate Geodesics

Opinion summarization is the task of creating summaries capturing popula...

Beyond Opinion Mining: Summarizing Opinions of Customer Reviews

Customer reviews are vital for making purchasing decisions in the Inform...

Unsupervised Opinion Summarization with Noising and Denoising

The supervised training of high-capacity models on large datasets contai...

1. Introduction

Reviews are omnipresent in e-commerce websites of products or services, and heavily influence customers’ decisions. According to a recent study111https://fanandfuel.com/no-online-customer-reviews-means-big-problems-2017/, more than of customers read reviews before visiting a business or making a purchase. However, the overwhelming number of reviews make it difficult for customers to understand the prevalent opinions on their entities of interest without investing significant time in reading reviews.

Many websites summarize what previous customers say about an entity by listing a set of popular opinions found in reviews, thus attempting to minimize the information overload for the customers. Existing review summarization techniques (hu2004aaai; qiu2011coling; liu2012sentiment; pontiki2015semeval; pontiki2016semeval; angelidis2018summarizing; xu2019bert), however, are largely tailored to identifying summary-worth opinions about a predefined set of aspects or shoehorning the opinions of each aspect into a single polarity score that sums up reviewers’ attitude. There are two major limitations to these types of summaries. First, the predefined set of aspects tend to include only commonly occurring, high-level choices (e.g., “rooms” and “location” for hotel reviews) and may miss tail aspects such as “water pressure” or “thin walls”. Hence, customers may still need to read the reviews to find the relevant discussions on tail aspects. Second, these type of summaries are not trivially explainable. That is, existing summarization techniques cannot explain why the summary of “location” is described as “very good” or rated 8 out of 10. The onus is still on the customers to peruse the reviews to understand why.

Review #1, Score 85, Date: Feb. 30, 2020 Pros: Friendly and helpful staff. Great location. Walking distance to Muni bus stops. Not too far away from Fisherman’s Wharf, Aquatic Park. Cons: extremely noisy room with paper thin walls. Review #2, Score 80, Date: Feb. 30, 2021 Good location close to the wharf, aquatic park and the many other attractions. the rooms are ok but a bit noisy, loud fridge and AC.
Figure 1. Opinion Causality Graph (bottom) based on reviews and their extracted opinion phrases (top). Child opinion nodes explains their parent opinion nodes.

In an effort towards more explainable opinion summarization, we present ExplainIt, a novel summarization system that attempts to overcome the aforementioned limitations. Specifically, ExplainIt summarizes reviews into a novel graph-like representation, the Opinion Causality Graph (OCG). The proposed graph organizes the set of opinions phrases, including tail ones, according to their semantic similarity and pairwise causal relationships. It is a comprehensive and versatile semi-structured representation of the reviews from which one can, among others, (a) support structured review summarization at different levels of granularities or only on certain aspects of the graph to tailor to users’ needs, (b) explain the summarized opinions, (c) generate a fully textual summary, with the help of an additional abstractive summarization system (OpinionDigest). Furthermore, similar to knowledge bases, such graphs can be effectively used for facilitating search over opinion phrases or criteria (Li:2019:Opine).

Figure 1 shows an OCG that is generated from the hotel reviews at the top of the figure. Each node in the graph represents a set of semantically similar opinions phrases. Each opinion phrase consists of an opinion term, followed by an aspect term. For example, “good location”, where “good” is an opinion term and “location” is an aspect term. Each edge represents the causal relationship between opinion phrases. For example, “paper thin walls” explains “extremely noisy room”. With this OCG, summaries can be created for the entire graph (e.g., all aspects of the hotel) or only for portions of the graph, such as which attractions the hotel is in close proximity with. More importantly, this graph allows users or downstream applications to navigate aspects and opinions based on their specific needs and seek explanations to why the summary states that the hotel is “extremely noisy” or why it is in a “great location”. To the best of our knowledge, ExplainIt is the first system that can summarize reviews with the explicit ability to provide explanations.

In this paper, we describe the core challenge in developing ExplainIt, which is to construct an opinion causality graph from a given set of reviews. Once this graph is obtained, one can use data-to-text (konstas2013inducing; wiseman2017challenges) or graph-to-text (Koncel:2019:NAACL; OpinionDigest) models to generate textual summaries from such graphs. It is challenging to construct an opinion graph from reviews for two main reasons. First, the review sentences are inherently noisy, making mining the opinion phrases and the causal relationships between them difficult. Second, all the opinion phrases and their predicted causal relationships need to be integrated into the opinion graph, while taking into account potential inaccuracies, as well as nuanced text characteristics like paraphrasing. We make the following contributions:

  • We develop ExplainIt

    , a system that generates an opinion graph from a set of reviews about an entity. It (a) mines opinion phrases, (b) determines the causal relationships between them, and (c) canonicalizes semantically similar opinions into opinion clusters. Finally, we combine the inferred causal relationships and opinion clusters to generate the entity’s OCG. To train a causal relationship classifier, we design a crowdsourcing pipeline to acquire labeled data and treat the problem in a supervised manner. To the best of our knowledge, this is the first attempt to predict opinion explanations. Additionally, for our canonicalization component, we developed a fine-tuning model for opinion phrase embeddings to improve the clustering over opinion phrases.

  • We evaluate the performance of ExplainIt through a series of experiments. We experimentally show that our explanation classifier is better than a fine-tuned BERT model (devlin2018bert) and better than a re-trained textual entailment model (parikh-etal-2016-decomposable). We show that fine-tuned opinion phrase embeddings significantly improve existing clustering algorithms. Finally, our user study showed that human judges agreed with the predicted graph edges produced by our system in more than 77% of cases.

  • Our crowdsourced labeled datasets (in the hotel and restaurant domains) for two subtasks (mining causal relationships and canonicalizing semantically similar opinion phrases) that we use for training and evaluation will be made publicly available.

Outline  We give an overview of ExplainIt in Section 2. We present the component for mining causal relationships in Section 3. We demonstrate how we canonicalize similar opinion phrases in Section 4 and how we construct an opinion graph in Section 5. We evaluate ExplainIt in Section 6. We outline related work in Section 7 and conclude this paper in Section 8.

2. System Overview

Figure 2. Overview of ExplainIt.

2.1. Preliminaries

An opinion phrase is a tuple , where is the opinion term, and is the aspect term is referring to. For example, the last sentence of Review #1 in Figure 1, “Cons: extremely noisy room with paper thin walls”, contains two opinion phrases: (“extremely noisy”, “room”); and (“paper thin”, “walls”). An explanation is a relationship between two opinion phrases, where causes or explains . For example, (“paper thin walls” “extremely noisy room”) and (“close to attractions” “great location”) are two valid explanations.

Definition: An Opinion Causality Graph for a set of opinion phrases is such that (1) every opinion phrase belongs to exactly one node , (2) each node consists of semantically consistent opinion phrases, and (3) an edge represents a causal relationship from to . That is, the member phrases of explain the member phrases of .

For example, the bottom of Figure 1 depicts an opinion causality graph obtained from the opinion phrases mined from the given reviews. Observe that, a perfectly constructed node would only contain phrases which are paraphrases. It follows that two perfect nodes and will be connected with an edge if and only if , as is the case in the example of Figure 1. In practice, however, we often deal with imperfect nodes, containing semantically similar phrases. Consequently, we aim to construct an OCG, such that edges are drawn between nodes where a causal relationship is very likely, i.e., when a significant number of phrase-level causal relations exist.

2.2. System Components

We break down the construction of an opinion causality graph into the four components illustrated in Figure 2, whose high-level descriptions are given below.

Opinion Mining 

The first step is to mine opinion phrases from a set of reviews about an entity. For this, we can leverage Aspect-based Sentiment Analysis (ABSA) models 

(pontiki2015semeval; pontiki2016semeval)

and, in our pipeline, we use an open-source system 

(Li:2019:Opine). The system also predicts the aspect category and sentiment associated with every opinion phrase. As we describe in Section 4, we exploit these additional signals to improve the opinion canonicalization component.

Explanation Mining  Next, ExplainIt mines explanations between pairs of extracted opinion phrases, a task similar to textual entailment (parikh-etal-2016-decomposable). Existing textual entailment models are trained over open domain data and cannot take into account useful context information that may exist in reviews. We develop a supervised multi-task classifier, trained with domain-specific labeled data, to mine explanation pairs. Our model outperforms a fine-tuned BERT model (devlin2018bert) and competitive textual entailment models (rocktaschel2015reasoning; parikh-etal-2016-decomposable).

Opinion Canonicalization  This step groups semantically similar opinion phrases together (e.g., “not far away from Fisherman’s Wharf” and “close to the wharf”) to form a node in the opinion graph. This is necessary as reviews overlap significantly in content and, hence, contain many opinion paraphrases. We leverage clustering algorithms, similar to techniques used in entity canonicalization for open knowledge base construction (Vahishth:2018:CESI; Chen:2019:CanonicalizingKB), and use dense embeddings to determine the similarity between opinion phrases. Importantly, we consistently improve the clustering quality using our fine-tuned opinion phrase embedding model. Our model uses a multi-task objective and leverages, among others, previously predicted aspect categories, sentiment polarity scores, and explanations.

Opinion Causality Graph Generation  Finally, we present an algorithm to construct the final OCG from the predictions of previous components. The algorithm constructs an OCG, by connecting graph nodes according to the aggregated explanation predictions between their members. Our user study shows that our method produces graphs that are both accurate and intuitive.

Figure 3. Model architecture of our explanation classifier.

3. Mining Explanations

A significant task underlying the construction of an opinion graph is to determine when one opinion explains another. For example, “close to Muni bus stops” is an explanation of “convenient location”, but is not an explanation for “close to local attractions”. Similarly, “on a busy main thoroughfare” is an explanation for “very noisy rooms” but not necessarily an explanation for “convenient location”.

Given two opinions, the problem of classifying whether one implies the other is highly related to the problem of recognizing textual entailment (RTE) (Dagan:2005:PascalRTE), which is the task of predicting whether a piece of text can be inferred from another piece of text. However, we hypothesized that generic RTE models trained on open domain data might not perform well on the task. This was based on two observations: (a) domain-specific knowledge is often necessary to understand the nuances of opinion relationships; (b) in many cases, having access to the full review is crucial to judge potential explanations. Preliminary experiments on our review-domain data partially confirmed this. We evaluated a state-of-the-art RTE model (parikh-etal-2016-decomposable) trained on open-domain data (snli:emnlp2015) and recorded very low explanation accuracy of . In contrast, the same model trained on the opinion domain achieved accuracy of (Section 6).

In the remainder of the section, we first describe how we collect domain-specific data for training through crowd-sourcing and then present our multi-task classifier.

3.1. Training data

We use a two-phase procedure to collect two domain-specific training datasets for the hotel and restaurant domains. The goal of the first phase is to prune pairs of opinion phrases that are irrelevant to each other. In the second phase, the crowd workers are seeking to label the remaining relevant pairs of opinion phrases. That is, given a pair of relevant opinion phrases, we ask crowd workers to determine if one opinion phrase explains another. If the answer is yes, we ask them to label the direction of the explanation further. In both phases, we provide the review where the opinion phrases co-occur as context, to assist crowd workers in understanding the opinion phrases and make better judgments. Through this data collection process, we obtained labeled examples with positive examples (i.e., opinions in a causal relationship). For our experiment, we used a balanced dataset with examples.

3.2. Explanation Classifier

We observe that contextual information is very useful for mining explanations. For example, the two phrases “very noisy room” and “right above Kearny St”, appear to be irrelevant to each other since one is about room quietness and the other is about location. However, from the review context where they co-occur, “Our room was very noisy. It is right above Kearny St.”, we are able to conclude that “right above Kearny St” is actually an explanation for “noisy room”. Therefore, besides the two mined opinions, ExplainIt takes as input the review context where the two opinions co-occur.

To mine the explanations, we design a multi-task model (Figure 3) for two classification tasks: (1) Review classification: whether the review contains explanations; (2) Explanation classification: whether the first opinion phrase explains the second one. The intuition behind the design is we want the model to capture signals from the context and the opinion phrases. Our account for contextual information surrounding the opinions is a departure from methods used in open-domain RTE, which only consider two text fragments in isolation. Table 1 summarizes the notations used in this Section.

Notation Meaning
Opinion phrase, opinion term, and aspect term
Sequence length of a review
, Binary masks of the phrases and
Size of hidden states
Output hidden states from BiLSTM layer
Self-attention trainable weights
Alignment-attention trainable weights
Table 1. Notations for explanation classifier.

Input and Phrase Masks. The input to the classifier consists of a review with words, and two opinion phrases, and . For each phrase , we create a binary mask, , which allows the model to selectively use the relevant parts of the full review encoding:

We denote the binary masks for and as and respectively.

Encoding. We first encode tokens in the review,

, through an embedding layer, followed by a BiLSTM layer. We denote the output vectors from the BiLSTM layer as

, where

is a hyperparameter of the hidden layer dimension. We do not encode the two opinion phrases separately, but mask the review encoding using

and . Note that we can also replace the first embedding layer with one of the pre-trained models, e.g., BERT (devlin2018bert). Our experiment demonstrates that using BERT is able to further improve the performance by compared to a word2vec embedding layer.

Self-attention. There are common linguistic patterns for expressing explanations. A simple example is the use of connectives such as “because” or “due to”. To capture the linguistic features used to express explanations, we use the self-attention mechanism (bahdanau2014neural):


where , are three trainable parameters. We obtain the final sentence representation as .

Alignment attention. To determine whether the first opinion phrase explains the second opinion phrase, we follow the two-way word-by-word attention mechanism (rocktaschel2015reasoning) to produce a soft alignment between words in the given phrases. To align with , for each word , we get a weight vector over words in as follows:


where and are five trainable parameters; is the -th output hidden state of ; is the representation of the previous word; and

is the reversed binary mask tensor for

. The final presentation of opinion is obtained from a non-linear combination of ’s last hidden state and last output vector :


where are two trainable parameters. We follow a similar procedure to align opinion from and obtain its final representation .

Prediction and Training.

The probability distributions for the review classification (

) and relation classification () tasks are obtained from two softmax classifiers repectively:


where is the concatenation of the sentence’s and opinion phrases’ representations; and are the classifiers’ weights and biases respectively. Finally, we define the training objective as follows:


where and are the Cross-Entropy loss for the first and second classification task respectively; is a tunable hyper-parameter.

4. Canonicalizing Opinions

The goal of this component is to identify duplicate or very similar opinion phrases in order to build the OCG nodes as concisely as possible. Although reviews often contain semantically similar content, the same opinion might be expressed in varied ways. For example, “one block from beach”, “close to the pacific ocean”, “unbeatable beach access”, and “very close the sea” are different phrases used in hotel reviews to describe the same opinion.

Our canonicalization method proceeds in two stages: (1) fine-tuning opinion phrase embeddings and (2) clustering phrases using these embeddings. In the first step, ExplainIt learns opinion phrase representations better suited for clustering. Then, a clustering algorithm, like -means, groups similar opinion phrases together.

Our canonicalization method has two major benefits. First, our method allows us to leverage existing clustering algorithms, which we improve by simply replacing the original vector representations with our fine-tuned embeddings. Second, our method only utilizes weak supervision: our fine-tuning model is unsupervised with respect to the optimal clustering, but takes advantage of predictions from previous components, namely the opinions’ predicted aspect categories, sentiment polarity, and explanations.

4.1. Fine-tuning Opinion Phrase Embeddings

A popular choice for representing phrases is the average word embeddings of the phrase based on a pre-trained word embedding model (e.g., word2vec, GloVe). However, such representations are ill-suited for clustering opinion phrases. For example, with such embeddings, the opinion phrase “very close to ocean” will be placed closer to an irrelevant opinion phrase “very close to tram” than a semantically closer opinion phrase “2 mins walk beach” (see Figure 5 (a) for more real examples). This is because the average word embedding favors phrases which share many tokens (in this example: very, close, to).

To ensure that similarity is tuned towards semantics rather than word overlap, we propose a method that fine-tunes opinion phrase embeddings by minimizing a vector reconstruction loss, as well as additional losses based on predicted aspect category, polarity, and explanations. Furthermore, our method separately encodes the aspect and the opinion components of an opinion phrase, so it becomes easier to learn to distinguish “very close to tram” and “very close to ocean” using the aspects “tram” and “ocean”.

Figure 4

illustrates our fine-tuning model. The model encodes an opinion phrase into an embedding vector, which is used as input for a set of predictors. The total multi-task loss is used to fine-tune the model parameters, including the opinion phrase embeddings themselves. We describe each component and loss function next.

Figure 4. Model and loss functions for fine-tuning opinion phrase embeddings.

Input. The input to the model is a set of opinion phrases that are extracted from reviews about a single entity (e.g., a hotel), and a set of explanations for the opinion phrases. Recall that each opinion phrase222We omit the index of the opinion phrase and explanation below since it is clear from the context. consists two sequences of tokens of the opinion term and the aspect term. We use and to denote the aspect category and sentiment labels of a phrase, predicted by the ABSA model during the opinion mining stage.

Opinion Phrase Encoding. Given an opinion phrase , we first use an embedding layer and an attention-based auto-encoder (ABAE) (he2017unsupervised) to compute an aspect embedding and an opinion embedding . The aspect embedding is obtained by attending over the aspect term tokens :


where is the output of the embedding layer for word , is the average word embedding for words in , and is a trainable parameter used to calculate attention weights. We encode the opinion term into in a same manner. Then, we concatenate the two embedding vectors into a single opinion phrase embedding :


Reconstruction loss. Following the auto-encoding paradigm, the main idea behind the reconstruction loss is to fine-tune input vectors so that they are easily reconstructed from a representative matrix . Similar to (he2017unsupervised; angelidis2018summarizing), we set the rows of

using an existing clustering algorithm (e.g., k-means) over initial phrase embeddings, such that every row corresponds to a cluster centroid

333We freeze during training after initialization to facilitate training stability, as suggested in (angelidis2018summarizing).. To reconstruct the phrase vector , we first feed it to a softmax classifier to obtain a probability distribution over the rows of :


where are the weight and bias parameters of the classifier respectively. We get the reconstructed vector for opinion phrase as follows:


We use the triplet margin loss (balntas2016learning) as the cost function, which moves the input opinion phrase closer to the reconstruction , and further away from randomly sampled negative examples:


where are randomly selected negative examples. The sampling procedure tries to sample opinion phrases that are not similar to the input opinion phrase with respect to the probability distribution of Eq. (16). For an opinion phrase , the probability of another opinion phrase being selected as an negative example is inversely

proportional to the cosine similarity between

and .

Aspect category and polarity loss. We also leverage additional signals that we collected from the previous steps to obtain better representations. For example, we would like to avoid opinion phrases such as “friendly staff” and “unfriendly staff” from being close in the embedding space. Hence, we incorporate sentiment polarity information and, similarly, aspect category information to fine-tune opinion phrase embeddings accordingly.

In ExplainIt, we add two classification tasks to fine-tune the parameters. Specifically, we feed an opinion phrase embedding into two softmax classifiers to predict the probability distributions of the aspect category and the sentiment polarity respectively:


The distributions and are used to compute cross-entropy losses and against silver-standard aspect and sentiment labels, predicted for each extracted phrase during opinion mining.

Intra-cluster explanation loss. Our mined explanations should also provide additional signals to learn better embeddings. Essentially, if an opinion phrase explains an opinion phrase , they should belong to different clusters (i.e., they should not belong to the same cluster). To reduce intra-cluster explanations, we define the intra-cluster explanation loss

by the Kullback-Leibler divergence (KL) between the probability distributions

and :


where is a set of pairs of opinion phrases in the mined explanations, and is the KL divergence between two distributions. When pair and are in a causal relationship, we would like to penalize the case where is small (likely to be in the same cluster). As a result, we are able to push the embeddings of and of opinion phrases that have explanation relationship apart from each other to discourage intra-cluster explanations.

Training objective. We define the final loss function by combining the four loss functions defined above:


where , , and are three hyper-parameters to control the influence of each corresponding loss. In practice, we prepare two types of mini-batches; one for single opinion phrases and one for explanation pairs. For each training step, we create and use these mini-batches separately: we use the single phrase mini-batch to evaluate the reconstruction, aspect category, and polarity losses; we use the explanation mini-batch to evaluate the explanation loss. At the end of every training step, we accumulate the loss values following Eq. (22) and update the model parameters.

Canonicalized opinion nodes. After training the above fine-tuning model, we use existing clustering algorithms over the fine-tuned phrase vectors to obtain the final opinion clusters, which correspond to the nodes of the opinion causality graph. As we will show in Section 6, our two-stage fine-tuning approach consistently boosts the performance of clustering algorithms.

We could consider directly using a score distribution for clustering instead of applying a clustering algorithm to the fine-tuned opinion embeddings. However, we found that itself does not perform well compared to our two-stage approach. This is expected, as the classifier responsible for producing has only been trained via the reconstruction loss, whereas the phrase embeddings have used all four loss signals, thus producing much richer representations.

5. Generating Opinion Graphs

Based on mined explanations and canonicalized opinions from Sections 3 and 4, the final step for generating an opinion causality graph is to predict edges between nodes. In theory, when using perfectly accurate explanations and opinion clusters, generating such edges is trivial. Intuitively, when an opinion phrase explains another opinion phrase, opinion phrases that are paraphrases of the first phrase should also explain phrases that paraphrase the latter one. In other words, given a set of explanations and two groups of opinion phrases, and , there should be an edge from to if there exists an edge between two opinion phrases in and respectively:

For example, when we know that “close to the beach” “good location”, we are able to conclude (“close to the beach”, “near the beach”, “walking distance to the beach”) (“good location”, “great location”, “awesome location”).

However, in practice, our obtained explanations and nodes are not perfect. As a result, we may get a lot of false positive edges based on the above criteria. To minimize the false positives, we use a simple heuristic to further prune the edges. This is based on the observation that two groups of opinions seldom explain each other at the same time.

where and are the explanations from to and to respective; and and are the explanation probabilities obtained from our explanation mining classifier.

Deriving edges between canonicalized opinions is a difficult problem in general. There are many ways to optimize this step further, and we leave this as part of our future work.

Group Models Acc.
Two-way attention 74.78
RTE Decomposable attention 76.26
RTE-BERT 79.75
Sent BiLSTM-self-attention 75.41
Sent-BERT 81.79
Proposed Our model (GloVe) 82.20
Our model (BERT) 86.23
Table 2. Comparing the accuracy of explanation classifiers.
Homogeneity (Precision) Completeness (Recall) V-measure (F1) 3-11 k-means GMM Cor. Cluster. k-means GMM Cor. Cluster. k-means GMM Cor. Cluster. Hotel GloVe 0.6695 0.6785 0.7240 0.7577 0.7728 0.6756 0.7102 0.7219 0.6985 ABAE 0.6626 0.6628 0.6964 0.7609 0.7522 0.7113 0.7075 0.7039 0.6966 Our method 0.7073 0.7177 0.7460 0.8115 0.8184 0.8370 0.7551 0.7641 0.7848 Restaurant GloVe 0.5854 0.5509 0.5851 0.8168 0.7801 0.8103 0.6778 0.6413 0.6761 ABAE 0.5563 0.5553 0.6256 0.7927 0.7779 0.7819 0.6492 0.6432 0.6918 Our method 0.5920 0.5572 0.6158 0.8333 0.8111 0.8155 0.6877 0.6555 0.6985
Table 3. Opinion canonicalization performance on Hotel and Restaurant datasets.

6. Evaluation

We evaluate ExplainIt with three types of experiments. We use two review datasets for evaluation: a public Yelp corpus of restaurant reviews and a private Hotel corpus444Data was collected from multiple hotel booking websites. of hotel reviews. For the mining explanations and canonicalizing opinions, we perform automatic evaluation over crowdsourced gold labels555We release the labeled datasets at https://anonymous.. To evaluate the quality of the generated opinion graph, we conducted a user study.

6.1. Mining Explanations

6.1.1. Dataset and metric

Based on the data collection process we described in Section 3.2, we used a dataset with balanced examples in Hotel domain. We further split the labeled data into training, validation, and testing with ratio . We evaluate the models by their predication accuracy.

6.1.2. Methods.

We compare our explanation classifier model and baseline methods, which we categorized into three groups. The first group (RTE) consists of three different models for RTE, the second group (Sent) consists of two models for sentence classification, and the last group (Proposed) consists of different configurations of our model. We trained all the models on the same training data with Adam optimizer (kingma2014adam) (learning-rate=, , , and decay factor of ) for epochs. All models except BERT used the same word embedding model (glove.6B.300d(pennington-etal-2014-glove) for the embedding layers.

For the RTE group, the input to the models is a pair of opinion phrases. The review context information associated with the pairs is ignored.

Two-way attention

: The two-way attention model 

(rocktaschel2015reasoning) is a BiLSTM model with a two-way word-by-word attention, which is used in our proposed model. This can be considered a degraded version of our proposed model only with the alignment attention, which takes opinion phrases without context information.

Decomposable attention:: The decomposable attention model  (parikh-etal-2016-decomposable) is a widely used and the best non-pre-trained model for RTE tasks.

RTE-BERT: BERT (devlin2018bert) is a pre-trained self-attention model, which is known to achieve state-of-the-art performance in many NLP tasks. We fine-tuned the BERTbase model with our training data.

For the Sent group, we formulate the explanation classification problem as a single sentence classification task. We “highlight” opinion phrases in a review with special dummy symbols [OP1] and [OP2]. For example, “[OP1] Good location [OP1] with [OP2] easy access to beach [OP2]” highlights two opinion phrases: “good location” and “easy access to beach”. With this input format, we can train a sentence classification model that takes into account context information while it recognizes which are opinion phrases.

BiLSTM-self-attention: We trained a BiLSTM model with self-attention (Lin:2017:SelfAttentionLSTM), which was originally developed for sentence classification tasks. The model architecture can be considered a degraded version of our model without the two-way word-by-word attention. Note that the output label is not review classification but explanation classification in this model.

Sent-BERT: We fine-tuned the BERTbase model for the sentence classification task with the training data. Different from RTE-BERT, Sent-BERT takes an entire review (with opinion phrase markers) as the input so it can take context information into account.

The last group (Proposed) include two variations of our model:

Our model (GloVe): The default model with an embedding layer initialized with the GloVe (glove.6B.300d) model.

Our model (BERT): We replace the embedding layer with the BERTbase model to obtain contextualized word embeddings.

6.1.3. Result analysis

As shown in Table 2, our proposed model achieves significant improvement over baseline approaches: we largely outperform non-pre-trained textual entailment models and sentence classification models by to . Furthermore, to mine explanations, models that consider context information tend to perform better. We found that BERT over sentences is more accurate than BERT over opinion phrases only. Lastly, leveraging pre-trained model can further improve the performance: by replacing the embedding layer with BERT, the accuracy is further improved by .

6.2. Canonicalizing Opinions

6.2.1. Dataset and metrics

For both Hotel and Restaurant domains, we first exclude entities with too few/many reviews and randomly select entities from the remaining one. We also develop a non-trivial process to collect the gold clusters using crowdsourcing.

We evaluate the performance with three metrics: (1) homogeneity, (2) completeness, and (3) V-measure, in the same manner as precision, recall, and F1-score. Homogeneity measures the precision of each cluster and scores 1.0 if each cluster contains only members of a single class. Completeness measures the recall

of each true class and scores 1.0 if all members of a given class are assigned to the same cluster. The V-measure is the harmonic mean between homogeneity and completeness scores.

6.2.2. Methods

To understand the benefits of our fine-tuned opinion phrase embeddings, we evaluate whether they can consistently improve the performance of existing clustering algorithms. Here we select three representative clustering algorithms, k-means, Gaussian Mixture Models (GMM) 

(Bishop:2006:PRML), and Correlation Clustering over similarity score (Bansal:2004:CorrelationClustering; Elsner-schudy-2009-bounding:CorrelationClustering). We set and for k-means and GMM for Hotel and Restaurant datasets respectively; we set for Correlation Clustering for both datasets. We compared the following methods, which use the same word embedding model (glove.6B.300d):

GloVe: We first calculate the average word embedding for opinion term and aspect term, and then concatenate the aggregated average embedding as the final opinion phrase embedding.

ABAE: We fine-tune the word embedding model without additional labels (i.e., ), which can be considered an ABAE model (he2017unsupervised).

Our method: We fine-tune the word embedding model with supervision from aspect category, polarity, and explanation with the following hyper-parameters: . We obtained the explanations by our explanation classifier (Section 3)666We use the same trained model for both Hotel and Restaurant..

(a) Before embedding fine-tuning
(b) After embedding fine-tuning
Figure 5. Embedding space comparison. Color-coding denotes true cluster assignments. After fine-tuning, semantically similar opinions are significantly closer to each other and unrelated ones are further apart, allowing easier opinion canonicalization.

6.2.3. Result analysis

As shown in Table 3, our fine-tuned embeddings (Our method) achieve the best performance among all settings and consistently boost the performance of existing clustering algorithms compared to the baseline methods in both Hotel and Restaurant domains. In addition, we confirm that our model significantly benefits from the additional supervision from opinion and explanation mining as our method significantly improves the performance compared to ABAE, which does not use the supervision.

6.2.4. Embedding space visualization

We also present a qualitative analysis of how our fine-tuning model helps to canonicalize opinions. Figure 5 shows a two-dimensional t-SNE projection (maaten2008visualizing) of embeddings for a fraction of the opinion phrases about a hotel. The phrase vectors obtained before and after fine-tuning are shown on the left and right side of the figure, respectively. The color codes denote true cluster assignments.

We observe that, before fine-tuning, vectors appear more uniformly dispersed in the embedding space, and hence, the cluster boundaries are less prominent. Additionally, we annotated the figure with a number of particularly problematic cases. For example, we each of the following pairs of opinion phrases (“very close to tram”, “very close to ocean”), (“2 mins walk to beach”, “2 mins walk to zoo”), and (“good location near the ocean” “good location near the zoo”), appear in close proximity when they should belong to different clusters. After fine-tuning, the problematic pairs of vectors are now very clearly separated in their respective clusters.

6.3. Opinion Graph Quality: User Study

In addition to the experimental evaluation of the explanation mining and opinion canonicalizing modules, we designed a user study to verify the quality of the final opinion graphs produced by ExplainIt. Assessing the quality of an entire opinion graph at once is impractical due to its size and complexity. Instead, we broke down the evaluation of each generated graph into a series of pairwise tests, where human judges were asked to verify the causality relation (or lack thereof) between pairs of nodes in the graph.

More specifically, given a predicted graph about an entity, we sampled node pairs , so that we get a balanced number of pairs for which we predicted the existence or absence of a causal relation. For every pair, we present the two nodes to the user and show five member opinion phrases from each one. We further show the predicted relation between the nodes (“explains” or “does not explain”) and ask the users if they agree with it.

We generated examples for 10 hotels (i.e., their constructed opinion graphs), amounting to 166 node pairs in total. Every pair was shown to 3 judges, and we obtained a final judgment for it using a majority vote. The judges agreed with our predicted relation in 77.1% of cases.

7. Related Work

Opinion Summarization: There has been work of mining opinions from online reviews since (hu2004kdd; hu2004aaai). Those studies developed opinion extraction systems using association mining techniques to extract frequent noun phrases from reviews and then aggregates sentiment polarity scores. These form the aspect-based summaries of online product reviews. Opinion Observer (Liu:2005:OpinionObserver) extended the method to build a system that visualizes the polarity information of each aspect with bar plots as a summary. While ExplainIt is closely related to the topic of opinion summarization, it differs from prior work in two significant ways: First, ExplainIt predicts explanations between extracted opinion phrases to form an OCG. The resulting summary is an explainable OCG from which a textual summary can also be generated. Second, ExplainIt canonicalizes opinion phrases to exclude redundancy from the output while keeping not only frequent opinion phrases but also long-tail opinion phrases.

Text summarization is another approach that makes a summary of multiple reviews. Recent advances in neural summarization methods (cheng-lapata-2016-neural; Isonuma:2019:Unsupervised; liu-lapata-2019-hierarchical; Paulus:2017:DeepReinforced; See:2017:PointerGenerator) enable us to generate a textual summary from multiple documents. These techniques mostly rely on reference summaries to train such a model, which are cost-expensive. Among them, Opinosis (Ganesan:2010:Opinosis) uses a graph-based method for abstractive summarization, but the graph structures the syntactic relationship between words, not opinion phrases. Thus, the graph itself cannot be used as a summary. ExplainIt distinguishes itself from text summarization approaches with its OCG. The opinion phrases and explanations between them is a structured summary of the reviews from which a textual summary can also be generated.

Explanation Classifier: Recognizing Textual Entailment (RTE) (Dagan:2005:PascalRTE) and Natural Language Inference (NLI) (snli:emnlp2015) are the tasks to judge if given two statements, whether one statement can be inferred from the other. These tasks are usually formulated as a sentence-pair classification problem where the input is two sentences. A major difference between RTE models and our explanation classifier is that our classifier judges if an opinion phrase explains another opinion phrase in the same review text. The two opinion phrases may appear in the same sentence or may appear in different sentences. Hence, as described in Section 3, we added another task, i.e., explanation existence judgment, in addition to the opinion-phrase classification task to improve the performance using a multi-task learning framework.

Opinion Canonicalization: Aspect-based auto-encoder (he2017unsupervised) is an unsupervised neural model that clusters sentences while learning better representations of the words. It showed better performance than conventional topic models (e.g., LDA (Blei:2003:LDA), Biterm Topic Model (Yan:2013:BTM)) in aspect identification tasks. Our opinion phrase fine-tuning method extends their approach by (1) having an opinion phrase encoder that consists of two encoders for aspect and opinion terms, and (2) incorporating additional loss values such as the sentiment/polarity losses and explanation loss in addition to the reconstruction loss.

Opinion canonicalization is closely related to KB canonicalization (Galarraga:2014:CIKM:Canonicalizing; Vahishth:2018:CESI), which canonicalizes entities or relations (or both) by merging triples consisting of two entities and a relation, based on the similarity. Galárraga et al. (Galarraga:2014:CIKM:Canonicalizing) proposed several manually-crafted features777(Vahishth:2018:CESI) showed that word-embedding features using GloVe outperformed the methods in (Galarraga:2014:CIKM:Canonicalizing). Thus, we consider GloVe word embeddings as a baseline for the opinion canonicalization task. for clustering triples. CESI (Vahishth:2018:CESI) uses side information (e.g., entity linking, WordNet) to train better embedding representations for KB canonicalization. The difference from KB canonicalization is that ExplainIt does not rely on external structured knowledge such as WordNet or KBs, which were used for those models. This is mainly because it is not straightforward to construct a single KB that reflects a wide variety of subjective opinions written in reviews. Instead, we aim to construct an OCG for each entity.

8. Conclusion

We presented ExplainIt, the first review summarization system that generates explainable summaries with Opinion Causality Graphs (OCG) based on opinion phrases and causality relations. ExplainIt consists of a number of components including an explanation mining and an opinion canonicalization components, which we developed to construct OCGs from online reviews. We created and released labeled datasets for explanation mining and opinion phrase canonicalization tasks. Experimental results on the datasets show that our methods significantly performed better than baseline methods in both tasks of classifying explanations and canonicalizing opinion phrases by as much as and respectively. Our user study also confirmed that human judges agreed with the predicted graph edges produced by our system in more than 77% of the cases.