User reviews are precious for companies to improve the quality of their services or products since reviews express the opinions of their customers. Nowadays, a company can easily collect reviews from users via e-commerce platforms and recommender systems, but it is difficult to read through all the wordy user reviews. Therefore, distilling salient information from user reviews is necessary. To achieve such a goal, review summarization and sentiment classification are continuously explored by plenty of researchers. The objective of review summarization is to generate a short and concise summary that expresses the key opinions and sentiment of the review. Sentiment classification is the task of predicting the sentiment label which indicates the sentiment attitude of the review. For example, a sentiment label ranges from 1 to 5, where 1 indicates the most negative attitude and 5 indicates the most positive attitude.
|Review: very colorful and decent for prep and serving small kids. bright colors are nice and food colors don’t seem to transfer to the bowls. we use them most frequently for salad dressing when the kids have baby carrots or celery sticks. my complaint is that the bases are too small and consequently they tip over very easily. that’s ok for ranch dressing or ketchup, not so good for soy sauce. if you flip them inside out it gives them a wider base which makes them more stable. since they flex they make their own funnel … we have some stainless mini bowls which are almost an inch larger in diameter (but shallower) which makes them a little more handy …|
|Summary: very colorful and handy, but tippy and smaller than expected.|
|Sentiment label: 3 out of 5|
Figure 1 shows an example of a review with its summary and sentiment label. As shown in the summary, the user thinks the product is “colorful and handy, but tippy and smaller than expected”. The sentiment label of the review is 3 out of 5, indicating the review has a neutral sentiment attitude. We can observe that the sentiment tendency of the review summary is consistent with the sentiment label of the review, which indicates there exists a close tie between these two tasks (Ma et al., 2018). Hence, they can benefit each other. Specifically, the sentiment classification task can guide a summarization model to capture the sentiment attitude of the review. Meanwhile, the review summarization task can remove the less-informative information from the review, which can assist a sentiment classification model to predict the sentiment label of a review.
However, most existing methods can only solve either review summarization or sentiment classification problem. For review summarization, traditional extractive methods select important phrases or sentences from the input review text to produce an output summary. Recent abstractive methods for review summarization adopt the attentional encoder-decoder framework (Bahdanau et al., 2015)
to generate an output summary word by word, which can generate novel words that do not appear in the review and produce a more concise summary. For sentiment classification, early methods are based on support vector machines or statistical models, while recent methods employ recurrent neural network (RNN)(Hochreiter and Schmidhuber, 1997). Though some previous methods (Mane et al., 2015; Hole and Takalikar, 2013) can predict both the sentiment label and the summary for a social media text, the sentiment classification and summarization modules are trained separately and they rely on rich hand-crafted features.
Recently, Ma et al. (Ma et al., 2018)
proposed an end-to-end model that jointly improves the performance of review summarization and sentiment classification. In their model, an encoder first learns a context representation for the review, which captures the context and sentiment information. Based on this representation, the decoder iteratively computes a hidden state and uses it to generate a summary word from a predefined vocabulary until the end-of-sequence (EOS) token is generated. Meanwhile, an attention layer utilizes the decoder hidden states to attend the review to compute a weighted sum of the review context representation, which acts as a summary-aware context representation of the review. Next, the review context representation and the summary-aware context representation are fed to a max-pooling based classifier to predict the sentiment label. However, both of these representations collect sentiment information from the review only. Thus, the model does not fully utilize the sentiment information existing in the summary.
To effectively leverage the sentiment information in the review and the summary, we propose a dual-view model for joint review summarization and sentiment classification. In our model, the sentiment information in the review and the summary are modeled by the source-view and summary-view sentiment classifiers respectively. The source-view sentiment classifier uses the review context representation from the encoder to predict a sentiment label for the review, while the summary-view sentiment classifier utilizes the decoder hidden states to predict a sentiment label for the generated summary. The ground-truth sentiment label of the review will be used to compute a classification loss for both of these classifiers. In addition, we also introduce an inconsistency loss function to penalize the disagreement between these two classifiers.
By promoting the consistency between the source-view and summary-view classifiers, we encourage the sentiment information in the decoder states to be close to that in the review context representation, which helps the decoder to generate a summary that has the same sentiment attitude as the review. Moreover, the source-view and summary-view sentiment classifiers can learn from each other to improve the sentiment classification performance. This shares a similar spirit with the multi-view learning paradigm in semi-supervised learning(Xu et al., 2013). Our model, therefore, provides more supervision signals for both review summarization and sentiment classification without additional training data. During testing, we use the sentiment label predicted by the source-view classifier as the final classification output since the summary-view sentiment classifier will be affected by the exposure bias issue (Ranzato et al., 2016) in the testing stage (more details in Section 4.4).
We conduct extensive experiments to evaluate the performance of our dual-view model. Experiment results on four real-world datasets from different domains demonstrate that our model outperforms strong baselines on both review summarization and sentiment classification. The ablation study shows the effectiveness of each individual component of our model. Furthermore, we also compare the classification results of our source-view sentiment classifier, summary-view sentiment classifier, and a merged sentiment classifier that combines the former two classifiers.
We summarize our main contributions as follows: (1) we propose a novel dual-view model for jointly improving the performance of review summarization and sentiment classification; (2) we introduce an inconsistency loss to penalize the inconsistency between our source-view and summary-view sentiment classifiers, which benefits both review summarization and sentiment classification; and (3) experimental results on benchmark datasets show that our model outperforms the state-of-the-art models for the joint review summarization and sentiment classification task.
2. Related Work
Opinion Summarization. Review summarization belongs to the area of opinion summarization (Ganesan et al., 2010; Gerani et al., 2014; Li et al., 2019; Frermann and Klementiev, 2019; Tian et al., 2019; Musto et al., 2019). Early approaches for opinion summarization are extractive, i.e., they can only produce words that appear in the input document. Ganesan et al. (Ganesan et al., 2010)
proposed a graph-based approach. It first converts the input opinion text into a directed graph. Then it applies heuristic rules to score different paths that encode valid sentences and takes the top-ranked paths as the output summary. Hu and Liu(Hu and Liu, 2004)
proposed a method that first identifies product features mentioned in the reviews and then extracts opinion sentences for the identified features. An unsupervised learning method is proposed to extract a review summary by exploiting the helpfulness scores in reviews(Xiong and Litman, 2014).
On the other hand, abstractive approaches for opinion summarization can generate novel words that do not appear in the input document. Gerani et al. (Gerani et al., 2014) proposed a template filling strategy to generate a review summary. Wang and Ling (Wang and Ling, 2016) applied the attentional encoder-decoder model to generate an abstractive summary for opinionated documents. All of the above methods consider opinion summarization in multiple documents setting, while our work considers opinion summarization on single document setting. Li et al. (Li et al., 2019) studied the problem of personalized review summarization on single review setting. They incorporated a user’s frequently used words into the encoder-decoder model to generate a user-aware summary. In contrast, we focus on modeling the shared sentiment information between the tasks of review summarization and sentiment classification, which is orthogonal to the personalized review generation problem. Hsu et al. (Hsu et al., 2018) proposed a unified model for extractive and abstractive summarization with an inconsistency loss to penalize the disagreement between the extractive and abstractive attention scores. Compared to their model, which penalizes the inconsistencies between two different attention methods on the same view (i.e., the source input), we introduce an inconsistency loss to penalize the outputs of two different classifiers on different views (i.e., the source view and the summary view).
Review Sentiment Classification. Review sentiment classification aims at analyzing online consumer reviews and predicting the sentiment attitude of a consumer towards a product. The review sentiment classification tasks can be categorized into three groups: document-level, sentence-level, and aspect-level sentiment classification (Zhang et al., 2018; Cheng et al., 2017). We focus on document-level sentiment classification that is to assign an overall sentiment orientation to the input document (Zhang et al., 2018), which is usually treated as a kind of document classification task (Pang et al., 2002; Pang and Lee, 2005). Traditional methods focus on designing effective features that are used in either supervised learning methods (Qu et al., 2010; Gao et al., 2013) or graph-based semi-supervised methods (Goldberg and Zhu, 2006). Recently, neural network based methods which do not require hand-crafted features achieve state-of-the-art performance on this task. For example, Tang et al. (Tang et al., 2015) proposed a neural network based hierarchical encoding process to learn an effective review representation. Hierarchical attention mechanisms (Yang et al., 2016; Yin et al., 2017; Zhou et al., 2016) are also extensively explored in this task for constructing an effective representation of the review. Different from previous methods that are designed only for the review sentiment classification task, we propose a unified model for simultaneously generating the summary of the review and classifying the review sentiment.
Joint Text Summarization and Classification. There are different joint models for both text summarization and classification. Cao et al. (Cao et al., 2017) proposed a neural network model that jointly classifies the category and extracts summary sentences for a group of news articles, but it can only improve the performance of text summarization. Yang et al. (Yang et al., 2018)
proposed a joint model that uses domain classification as an auxiliary task to improve the performance of review summarization. Moreover, their model uses different lexicons to find out sentiment words and aspect words from the review text, and then incorporates them into the decoder via attention mechanism to generate aspect/sentiment-aware review summaries. The above two methods use a domain classification task to improve the performance of summarization, while our method jointly improves the performance of both review summarization and sentiment classification. Two models(Mane et al., 2015; Hole and Takalikar, 2013) were proposed to jointly extract a summary and predict the sentiment label for a social media post, but the summarization module and classification module of these models are trained separately and they require rich hand-crafted features. Recently, Ma et al. (Ma et al., 2018) proposed an end-to-end neural model that jointly improves the performance of the review summarization and sentiment classification tasks. However, their sentiment classifier only collects sentiment information from the review. Our dual-view model has a source-view sentiment classifier and a summary-view sentiment classifier to model the sentiment information in the review and the summary respectively. We also introduce an inconsistency loss to encourage the consistency between these two classifiers.
Notations. We use bold lowercase characters to denote vectors, bold upper case characters to denote matrices and calligraphy characters to denote sets. We use and
to denote a projection matrix and a bias vector in a neural network layer.
Problem definition. We formally define the problem of review summarization and sentiment classification as follows. Given a review text , we output the summary and sentiment label of the review text. The review text and its summary are sequences of words, i.e., and , where and denotes the numbers of word in and respectively. The sentiment label is an integer that indicates the sentiment attitude of the review text, where denotes the most negative sentiment and denotes the most positive sentiment.
4. Model Architecture
Our joint model consists of three major modules: (M1) shared text encoder module, (M2) summary decoder module, and (M3) dual-view sentiment classification module. The input review text is first encoded by the shared text encoder into context-aware representations , which forms a memory bank for the summary decoder module and the dual-view sentiment classification module. Then, the summary decoder uses the memory bank from the encoder to compute a sequence of hidden states and generates a review summary word by word. The ground-truth summary is used to compute a summary generation loss for the model. Our dual-view classification module consists of a source-view sentiment classifier and a summary-view sentiment classifier. The source-view sentiment classifier reads the encoder memory bank and predicts a sentiment label for the review, while the summary-view sentiment classifier uses the decoder hidden states to predict a sentiment label for the generated summary. We use the ground-truth sentiment label of the review to compute a sentiment classification loss for both of these sentiment classifiers. Besides, we also introduce an inconsistency loss function to penalize the disagreement between these two classifiers. We jointly minimize all the above loss functions by a multi-task learning framework. During testing, we use the sentiment label predicted by the source-view classifier as the output sentiment label. Figure 2 illustrates the overall architecture of our model. We describe the details of each module as follows.
4.2. Shared Text Encoder (M1)
The shared text encoder converts the review text into a memory bank for the sentiment classification and summary decoder modules. First, the encoder reads an input sequence and uses a lookup table to convert each input word to a word embedding vector . To incorporate the contextual information of the review text into the representation of each word, we feed each embedding vector
to a bi-directional Gated-Recurrent Unit (GRU)(Cho et al., 2014)
to learn a shallow hidden representation. More specifically, a bi-directional GRU consists of a forward GRU that reads the embedding sequence from to and a backward GRU that reads from to :
where and denotes the hidden states of the forward GRU and backward GRU respectively. We concatenate and to form the shallow hidden representation for , i.e., .
Next, we pass the shallow hidden representations to another bi-directional GRU to model more complex interactions among the words in the review text:
where and . We concatenate and to form
. Then we apply a residual connection from the shallow hidden representationto , which is standard technique to avoid gradient vanishing problem in deep neural networks (He et al., 2016), as shown in the following equation:
is a hyperparameter. We regard the final encoder representationsas the memory bank for the later summary decoder and dual-view classification modules.
4.3. Summary Decoder (M2)
The decoder uses a forward GRU to generate an output summary step by step. The architecture of our summary decoder follows the decoder of the pointer generator network (See et al., 2017). At each decoding step , the forward GRU reads the embedding of the previous prediction and the previous decoder hidden state to yield the current decoder hidden state:
where , is the embedding of the start token. To gather relevant information from the document, a neural attention layer (Bahdanau et al., 2015) is then applied to compute an attention score between the current decoder hidden state and each of the vectors in the encoder memory bank :
where are model parameters. The attention score is next used to compute a weighted sum of the memory bank vectors and produce an aggregated vector , which acts as the representation of the input sequence at time : After that, we use and the decoder hidden state
to compute a probability distribution over the words in a predefined vocabulary, as shown in the following equation,
where denotes the partial sequence of previous generated words, , and are trainable parameters.
Although we can directly use as our final prediction distribution, the decoder cannot generate out-of-vocabulary (OOV) words. To address this problem, we adopt the copy mechanism from (See et al., 2017) to empower the decoder to predict OOV words by directly copying words from the input review. In the copy mechanism, we first compute a soft gate between generating a word from the predefined vocabulary according to and copying a word from the source text according to the attention distribution:
where and are trainable parameters and
denotes the logistic sigmoid function. Next, we define a dynamic vocabularyas the union of and all the words appear in the review . Finally, we can compute a probability distribution over the words in the dynamic vocabulary as follows:
where we use to denote and to denote for brevity. In Eq. (11), we define for all (OOV words). If does not appear in , the copying probability, , will be zero.
Summary generation loss function. We use the negative log-likelihood of the ground-truth summary as the loss function for the review summarization task:
where denotes the number of words in the ground-truth review summary .
Inference. In the testing stage, we use beam search to generate the output summary from the summary decoder. This is a standard technique to approximate the output sequence that achieves the highest generation probability.
4.4. Dual-view Sentiment Classification Module (M3)
We propose a dual-view sentiment classification module to learn a sentiment label for the review. It consists of a source-view sentiment classifier and a summary-view sentiment classifier.
Source-view sentiment classifier. The source-view sentiment classifier utilizes the encoder memory bank to predict a sentiment label for the review. Since not all words in the review contribute equally to the prediction of sentiment label, we apply the attention mechanism (Bahdanau et al., 2015) to aggregate sentiment information from the encoder memory bank into a sentiment context vector. An additional glimpse operation (Vinyals et al., 2016) is incorporated in this aggregation process since the glimpse operation has been shown that it can improve the performance of several attention-based classification models (Bello et al., 2016; Chen and Bansal, 2018).
First, a trainable query vector attends the encoder memory bank and produce a glimpse vector as follows:
where , , , are trainable model parameters.
Then, the glimpse vector attends the memory bank again to compute a review sentiment context vector : where is the corresponding attention weight.
where , , , are model parameters. The sentiment label with the highest probability is treated as the predicted sentiment label for the review.
Summary-view sentiment classifier. The summary-view sentiment classifier uses the decoder hidden states to predict a sentiment label for the generated summary. Since each decoder hidden state is used to generate a summary word, we treat the decoder hidden states as the representation for the generated summary. Then, we apply attention mechanism (Bahdanau et al., 2015) with glimpse operation (Vinyals et al., 2016) to compute a sentiment context vector for the generated summary. The architecture of the summary-view sentiment classifier is the same as the source-view sentiment classifier but with another set of parameters. The only difference is that it uses the decoder hidden states as the input instead of the encoder memory bank, i.e., all the terms in the equations of the source-view sentiment classifier are replaced by . Then, the summary-view sentiment classifier outputs a probability distribution over the sentiment label for the generated summary: .
Sentiment classification loss function. We use negative log-likelihood as the classification loss function for both the source-view sentiment classifier and the summary-view sentiment classifier:
where denotes the ground-truth sentiment label, and denote the classification loss for the source-view and the summary-view sentiment classifiers respectively.
|Dataset||Training||Validation||Testing||Ave. RL||Ave. SL||Copy Ratio||Rating 1||Rating 2||Rating 3||Rating 4||Rating 5|
Inconsistency loss function. We introduce an inconsistency loss to penalize the disagreement between the source-view sentiment classifier and the summary-view sentiment classifier. The intuition is that the review summary should have the same sentiment attitude as the input review. We define our inconsistency loss function as the Kulllback-Leibler (KL) divergence between and :
Since the summary-view sentiment classifier uses the decoder hidden states to predict the sentiment label for the generated summary, the inconsistency loss encourages the sentiment information in the decoder states to be close to that in the encoder memory bank. Thus, it helps the decoder to generate a summary that has a consistent sentiment with the review. Moreover, the source-view and summary-view sentiment classifiers can learn from each other to improve the sentiment classification performance.
Inference. In the testing stage, we use the sentiment label predicted by the source-view classifier as the final classification prediction. The reason is that the decoder suffers from the well-known exposure bias problem (Ranzato et al., 2016) during testing, which affects the performance of the summary-view classifier when inference. We conduct experiments to analyze the influence of the exposure bias on the classification performance and provide more discussions in Section 6.3.
4.5. Multi-task Training Objective
We adopt a multi-task learning framework to jointly minimize the review summarization loss, source-view sentiment classification loss, summary-view sentiment classification loss, and inconsistency loss. The objective function is shown as follows.
where are hyper-parameters that controls the weights of these four loss functions. We set after fine-tuning on the validation datasets. Thus, each module of our joint model can be trained end-to-end.
5. Experimental Setup
In this work, we conduct experiments on four real-world datasets from different domains. The datasets are collected from the Amazon 5-core review repository (McAuley et al., 2015). We adopt product reviews from the following four domains as our datasets: Sports & Outdoors; Movies & TV; Toys & Games; and Home & Kitchen. In our experiments, each data sample consists of a review text, a summary, and a rating. We regard the rating as a sentiment label, which is an integer in the range of . For text preprocessing, we lowercase all the letters and tokenize the text using Stanford CoreNLP (Manning et al., 2014). We append a period to a summary sentence if it is not ended properly. To reduce the noise in these datasets, we filter out data samples when the review length is less than 16 or longer than 800, or the summary length is less than 4. We randomly split each dataset into training, validation, and testing sets. The statistics of these datasets are shown in Table 1.
5.2. Evaluation Metrics
For review summarization, we use ROUGE score (Lin, 2004)
as the evaluation metric, which is a standard evaluation metric in the field of text summarization(See et al., 2017; Li et al., 2017). Following (Ma et al., 2018), we use ROUGE-1, ROUGE-2, and ROUGE-L scores to evaluate the qualities of the generated review summaries. ROUGE-1 and ROUGE-2 measure the overlapping uni-grams and bi-grams between the generated review summary and the ground-truth review summary . ROUGE-L measures the longest common subsequence between the generated summary and the ground-truth review summary, we refer the readers to (Lin, 2004) for details.
For sentiment classification, we use the macro score and the balanced accuracy (Brodersen et al., 2010) as the evaluation metrics. We denote the macro score as “M. ” and the balanced accuracy as “B. Acc”. From Table 1, we can observe that the class distribution of the sentiment labels is imbalanced. Therefore, we do not use overall accuracy as the evaluation metric. To compute macro score, we first calculate the precision and recall for each individual sentiment class
. Next, we compute the macro-averaged precision and recall as follows,, . The macro
score is the harmonic mean ofand . The balanced accuracy is a variant of the accuracy metric for imbalanced datasets (Brodersen et al., 2010; Kelleher et al., 2015). It is defined as the macro-average of the recall obtained on each class, i.e., .
Our baselines are categorized into three groups: (1) summarization-only models; (2) sentiment-only models; and (3) joint models. We use the following summarization-only models as our review summarization baselines.
PGNet (See et al., 2017): A popular summarization model which is based on the encoder-decoder framework with attention and copy mechanisms.
The following sentiment-only model are employed as one of our sentiment classification baselines.
BiGRU+Attention: A bi-directional GRU layer (Cho et al., 2014) first encodes the input review into hidden states. Then it uses an attention mechanism (Bahdanau et al., 2015) with glimpse operation (Vinyals et al., 2016) to aggregate information from the encoder hidden states and produce a vector. The vector is then fed through a two-layer feed-forward neural network to predict the sentiment label.
DARLM: (Zhou et al., 2018) The state-of-the-art model for sentence classification.
We also use the following joint models as the baselines of both review summarization and sentiment classification.
HSSC (Ma et al., 2018): The state-of-the-art model for jointly improving review summarization and sentiment classification.
Max (Ma et al., 2018): A bi-directional GRU layer first encodes the input review into hidden states. These hidden states are then shared by a summary decoder and a sentiment classifier. The sentiment classifier uses max pooling to aggregate the the encoder hidden states into a vector, which is then fed through a two-layer feed-forward neural network to predict the sentiment label.
Max+copy: We also incorporate the copy mechanism (See et al., 2017) into the Max model as another strong baseline.
5.4. Implementation Details
We train a word2vec (Mikolov et al., 2013) with a dimension of 128 (i.e., ) on the training set of each dataset to initialize the word embeddings of all the models including baselines. All the initialized embeddings are fine-tuned by back-propagation during training. The vocabulary is defined as the 50,000 most frequent words in the training set. The hidden sizes , , , and are set to 512. The in the residual connection is set to 0.5. The initial hidden states of the GRU cells in shared text encoder are zero vectors, while the initial hidden states of the GRU cell in the summary decoder is set to . A dropout layer with is applied at the two-layer feed-forward neural networks in source-view and summary-view sentiment classifiers. The hidden sizes of the GRU cells and the feed-forward neural networks in our baselines are set to 512, which is the same as our model. During training, we truncate the input review text to 400 tokens and the output summary to 100 tokens. We use the Adam optimization algorithm (Kingma and Ba, 2015)
with a batch size of 32 and an initial learning rate of 0.001. We use the gradient clipping of 2.0 using L-2 norm. The learning rate is reduced by half if the validation loss stops dropping. We apply early stopping when the validation loss stops decreasing for three consecutive checkpoints. During testing, we use the beam search algorithm with a beam size of 5 and a maximum depth of 120 for all models. We repeat all experiments five times with different random seeds and the averaged results are reported111Source code is available at https://github.com/kenchan0226/dual_view_review_sum.
|M.||B. Acc||M.||B. Acc||M.||B. Acc||M.||B. Acc|
6. Results Analysis
Our experiments are intended to address the following research questions.
RQ1: What is the performance of our proposed model on review summarization and sentiment classification?
RQ2: What is the impact of each component of our model on the overall performance?
RQ3: Which of the source-view and summary-view classifiers is better? Can we further improve the sentiment classification performance if we combine the source-view and summary-view classifiers by ensemble?
RQ4: Are the generated review summaries consistent with the predicted sentiment labels?
6.1. Main Results (RQ1)
We show the review summarization results on the four datasets in Table 2
. We note that our dual-view model achieves the best performance on almost all the metrics among all the four real-world datasets, demonstrating the effectiveness of our model to summarize a product review on different domains. We also conduct a significance test comparing our model with HSSC, Max, and PGNet. The results show our dual-view model significantly outperforms these three baselines on most of the metrics (paired t-test,).
The review sentiment classification results are shown in Table 3. For our dual-view model, the classification results of the source-view classifier are reported in this table. We observe that our dual-view model consistently outperforms all the baseline methods on all the datasets. These results show that our model can predict more accurate sentiment labels than baselines. Besides, the significance test results comparing with HSSC and Max indicate the improvements by our dual-view model are significant on most of the metrics (paired t-test, ).
6.2. Ablation Study (RQ2)
We conduct an ablation study to verify the effectiveness of each important component of our model. The results on the Sports dataset are displayed in Table 4. “-I” denotes that we do not incorporate the inconsistency loss when training. By comparing the performance of the full model and “-I” in the table, we observe that after removing the inconsistency loss, the performance of both review summarization and sentiment classification drops obviously. Our experiment results on the Sports validation set also show that our inconsistency loss substantially reduces the number of inconsistent predicted sentiment labels between the source-view and summary-view sentiment classifiers from 11.9% to 6.3%. If we replace the attention mechanism in classifiers with a max-pooling operation (i.e., compare “Full” and “-A” in the table), the performance decreases as we anticipated. We also find that after removing the residual connection in the encoder (i.e., compare “Full” and “-R”), the performance of both review summarization and sentiment classification degrades, which suggests that the residual connection module is effective for both tasks. Moreover, we note that the copy mechanism (i.e., compare “Full” and “-C”) is helpful for both review summarization and sentiment classification.
|Classifiers||Validation Set||Testing Set|
|w/ TF||w/o TF||w/TF||w/o TF|
6.3. Classifier Ensemble (RQ3)
Though Table 3 reports the performance of the source-view sentiment classifier of our dual-view model, we are also interested in the performance of the summary-view sentiment classifier of our model and whether merging these two classifiers can improve the sentiment classification performance. To combine the sentiment label predictions from the source-view and summary-view sentiment classifiers, we average their predicted sentiment label probability distributions into a merged prediction probability distribution: . We report the macro scores of the source-view, summary view, and merged sentiment classifiers in Table 5.
From this table, we note that the source-view classifier achieves the best results on three datasets. Thus, combining the source-view and summary-view sentiment classifiers does not yield better performance in most of the cases. Moreover, we also find that the source-view sentiment classifier consistently outperforms the summary-view sentiment classifier on all of these datasets. The main reason is that the exposure bias issue (Ranzato et al., 2016) of RNN decoder affects the performance of the summary-view sentiment classifier during testing. More specifically, in the training stage, previous ground-truth summary tokens are fed into the decoder to predict the next summary token (i.e., teacher-forcing) and the hidden states of the decoder can be regarded as the hidden representations of the ground-truth summary. In the testing stage, we cannot access the ground-truth summary. The previously-predicted tokens are fed into the decoder to predict the next summary token and the errors are accumulated in the decoder hidden states. Therefore, the summary-view classifier, which is based on the decoder hidden states, has a worse performance during testing.
We conduct experiments to demonstrate the effects of the exposure bias problem. Table 6 displays the macro scores of different sentiment classifiers with and without teacher-forcing. It is observed that when teacher-forcing is applied (i.e., w/ TF in the table), the summary-view classifier outperforms the source-view classifier by a large margin. However, its performance drops dramatically on both validation and testing sets after removing teacher-forcing (i.e., w/o TF). As we anticipated, the source-view classifier is not affected by whether teacher-forcing is adopted since it performs sentiment classification from the source input view instead of the summary view. The results of balanced accuracy scores show similar observations and we do not report them in Table 5 and 6 for brevity. These results suggest a future work of alleviating the performance gap of the summary-view sentiment classifier.
6.4. Case Studies (RQ4)
We conduct case studies to analyze the readability of the generated review summary and the sentiment consistency between the predicted sentiment labels and summaries. Table 7 compares the predicted sentiment labels and the generated summaries from the HSSC+copy model222HSSC+copy is a strong baseline which enhances the state-of-the-art model HSSC (Ma et al., 2018) with a copy mechanism (See et al., 2017). and our dual-view model on the testing sets. We use “oracle” to denote a model that always outputs the ground-truth. From row “a” to row “d” of Table 7, we observe that the sentiment labels and the summaries predicted by the HSSC+copy model are not consistent with each other. For example, in row “a” of the table, the sentiment label predicted by HSSC+copy is 3, which indicates a neutral sentiment and matches the ground-truth. However, its generated summary (“not worth the money”) conveys a negative sentiment. In row “d” of the table, the HSSC+copy model generates a summary with a positive sentiment that is similar to the ground-truth, but it predicts a sentiment label of 3, which is not consistent with the positive sentiment of the generated summary. On the other hand, the sentiment labels and the review summaries predicted by our dual-view model are sentimentally consistent to each other on these rows. Moreover, the HSSC+copy model sometimes omits sentiment words in the generated summaries, which makes the summaries less informative. For example, in row “f” of the table, the summary generated by HSSC+copy omits the sentiment word “paramount”. Thus, it does not reflect the opinion that the user likes the product very much, while the summary generated by our model contains “paramount” which expresses the consumer’s opinion accurately.
|Id||Model||Predicted Sentiment Label and Summary|
|a||Oracle||3: you get what you pay for .|
|HSSC+copy||3: not worth the money .|
|Dual-view||3: you get what you pay for .|
|b||Oracle||1: do not buy it .|
|HSSC+copy||1: this thing is okay .|
|Dual-view||1: not worth it .|
|c||Oracle||1: worst toaster i have ever owned .|
|HSSC+copy||2: worst toaster i have ever owned .|
|Dual-view||1: worst toaster i have ever owned .|
|d||Oracle||5: best for side/stomach sleepers .|
|HSSC+copy||3: good for side/stomach sleepers .|
|Dual-view||5: comfortable for side/stomach sleepers .|
|e||Oracle||5: makes great ice cream !|
|HSSC+copy||5: ice cream maker .|
|Dual-view||5: great ice cream attachment !|
|f||Oracle||5: paramount stockton 5-piece daybed ensemble , twin leggett & platt - home textiles .|
|HSSC+copy||5: 5-piece daybed ensemble .|
|Dual-view||5: paramount stockton 5-piece daybed ensemble .|
|g||Oracle||4: works great for less money .|
|HSSC+copy||5: it ’s a rangefinder .|
|Dual-view||4: great for the price .|
We propose a novel dual-view model with inconsistency loss to jointly improve the performance of review summarization and sentiment classification. Compared to previous work, our model has the following two advantages. First, it encourages the sentiment information in the decoder states to be similar to that in the review context representation, which assists the decoder to generate a summary that has the same sentiment tendency as the review. Second, the source-view and summary-view sentiment classifiers can learn from each other to improve the sentiment classification performance. Experiment results demonstrate that our model achieves better performance than the state-of-the-art models for the task of joint review summarization and sentiment classification.
Acknowledgements.The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CUHK 2300174 (Collaborative Research Fund, No. C5026-18GF)). We would like to thank the four anonymous reviewers for their comments.
- Neural machine translation by jointly learning to align and translate. In ICLR, May 7-9, 2015, External Links: Cited by: §1, §4.3, §4.4, §4.4, 1st item.
- . CoRR abs/1611.09940. External Links: Cited by: §4.4.
- The balanced accuracy and its posterior distribution. In ICPR, 23-26 August 2010, pp. 3121–3124. External Links: Cited by: §5.2.
- Improving multi-document summarization via text classification. In AAAI, February 4-9, 2017, pp. 3053–3059. External Links: Cited by: §2.
- Fast abstractive summarization with reinforce-selected sentence rewriting. In ACL, July, 2018, Melbourne, Australia, pp. 675–686. External Links: Cited by: §4.4.
- Aspect-level sentiment classification with HEAT (hierarchical attention) network. In CIKM, November, 2017, pp. 97–106. External Links: Cited by: §2.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, Oct, 2014, Doha, Qatar, pp. 1724–1734. External Links: Cited by: §4.2, 1st item.
- Inducing document structure for aspect-based summarization. In ACL, July, 2019, Florence, Italy, pp. 6263–6273. External Links: Cited by: §2.
- Opinosis: A graph based approach to abstractive summarization of highly redundant opinions. In COLING, August, 2010, Beijing, China, pp. 340–348. External Links: Cited by: §2.
- Modeling user leniency and product popularity for sentiment classification. In IJCNLP, October, 2013, Nagoya, Japan, pp. 1107–1111. External Links: Cited by: §2.
- Bottom-up abstractive summarization. In EMNLP, October 31 - November 4, 2018, pp. 4098–4109. External Links: Cited by: 2nd item.
- Abstractive summarization of product reviews using discourse structure. In EMNLP, October, 2014, Doha, Qatar, pp. 1602–1613. External Links: Cited by: §2, §2.
- Seeing stars when there aren’t many stars: graph-based semi-supervised learning for sentiment categorization. In ACL, TextGraphs Workshop, June, 2006, New York City, pp. 45–52. Cited by: §2.
- Deep residual learning for image recognition. In CVPR, June 27-30, 2016, pp. 770–778. External Links: Cited by: §4.2.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §1.
- Real time tweet summarization and sentiment analysis of game tournament. International Journal of Science and Research 4 (9), pp. 1774–1780. Cited by: §1, §2.
- A unified model for extractive and abstractive summarization using inconsistency loss. In ACL, July, 2018, Melbourne, Australia, pp. 132–141. External Links: Cited by: §2.
- Mining and summarizing customer reviews. In KDD, August 22-25, 2004, New York, NY, USA, pp. 168–177. External Links: Cited by: §2.
Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies. MIT Press. Cited by: §5.2.
- Adam: A method for stochastic optimization. In ICLR, May 7-9, 2015, External Links: Cited by: §5.4.
- Towards personalized review summarization via user-aware sequence network. In AAAI, January 27 - February 1, 2019., pp. 6690–6697. External Links: Cited by: §2, §2.
- Neural rating regression with abstractive tips generation for recommendation. In SIGIR, August 7-11, 2017, pp. 345–354. External Links: Cited by: §5.2.
- Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §5.2.
- A hierarchical end-to-end model for jointly improving text summarization and sentiment classification. In IJCAI, July 13-19, 2018, pp. 4251–4257. External Links: Cited by: §1, §1, §2, §4.4, 1st item, 2nd item, 3rd item, §5.2, footnote 2.
- Summarization and sentiment analysis from user health posts. In ICPC, 2015, pp. 1–4. Cited by: §1, §2.
The stanford corenlp natural language processing toolkit. In ACL: System Demonstrations, June, 2014, Baltimore, Maryland, pp. 55–60. External Links: Cited by: §5.1.
- Image-based recommendations on styles and substitutes. In SIGIR, 2015, New York, NY, USA, pp. 43–52. External Links: Cited by: §5.1.
- Distributed representations of words and phrases and their compositionality. In NEURIPS, December 5-8, 2013, pp. 3111–3119. External Links: Cited by: §5.4.
- Combining text summarization and aspect-based sentiment analysis of users’ reviews to justify recommendations. In RecSys, September, 2019, pp. 383–387. External Links: Cited by: §2.
- Thumbs up? sentiment classification using machine learning techniques. In EMNLP, July, 2002, Philadelphia, PA, USA, pp. 79–86. External Links: Cited by: §2.
- Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, June, 2005, Ann Arbor, Michigan, pp. 115–124. External Links: Cited by: §2.
- The bag-of-opinions method for review rating prediction from sparse text patterns. In COLING, August, 2010, Beijing, China, pp. 913–921. External Links: Cited by: §2.
- Sequence level training with recurrent neural networks. In ICLR, May 2-4, 2016, External Links: Cited by: §1, §4.4, §6.3.
- Get to the point: summarization with pointer-generator networks. In ACL, July, 2017, Vancouver, Canada, pp. 1073–1083. External Links: Cited by: §4.3, §4.3, 1st item, 2nd item, 3rd item, 4th item, §5.2, footnote 2.
- Document modeling with gated recurrent neural network for sentiment classification. In EMNLP, September, 2015, Lisbon, Portugal, pp. 1422–1432. External Links: Cited by: §2.
- Aspect and opinion aware abstractive review summarization with reinforced hard typed decoder. In CIKM, November,2019, pp. 2061–2064. External Links: Cited by: §2.
- Attention is all you need. In NEURIPS, 4-9 December 2017, pp. 5998–6008. External Links: Cited by: 2nd item.
- Order matters: sequence to sequence for sets. In ICLR, May 2-4, 2016, External Links: Cited by: §4.4, §4.4, 1st item.
- Neural network-based abstract generation for opinions and arguments. In NAACL, June, 2016, San Diego, California, pp. 47–57. External Links: Cited by: §2.
- Empirical analysis of exploiting review helpfulness for extractive summarization of online reviews. In COLING, August, 2014, Technical Papers, Dublin, Ireland, pp. 1985–1995. External Links: Cited by: §2.
- A survey on multi-view learning. CoRR abs/1304.5634. External Links: Cited by: §1.
- Aspect and sentiment aware abstractive review summarization. In COLING, August, 2018, Santa Fe, New Mexico, USA, pp. 1110–1120. Cited by: §2.
- Hierarchical attention networks for document classification. In NAACL, June, 2016, San Diego, California, pp. 1480–1489. External Links: Cited by: §2.
- Document-level multi-aspect sentiment classification as machine comprehension. In EMNLP, September, 2017, Copenhagen, Denmark, pp. 2044–2054. External Links: Cited by: §2.
- Deep learning for sentiment analysis: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8 (4). External Links: Cited by: §2.
- Differentiated attentive representation learning for sentence classification. In IJCAI, July 13-19, 2018, pp. 4630–4636. External Links: Cited by: 2nd item.
- Attention-based LSTM network for cross-lingual sentiment classification. In EMNLP, November, 2016, Austin, Texas, pp. 247–256. External Links: Cited by: §2.