With the ever increasing interests in user-generated reviews on online marketplace websites, such as Amazon, Yelp and TripAdvisor, it is necessary to provide a range of tools that would encourage users to provide feedback in a more efficient and effective manner, as only a small fraction of users really take time to write their own reviews Chen and Xie (2008). Automatic review generation, for example, takes the product information and user behavior as input and generates user reviews following the arbitrarily given users’ sentiment designation and writing style personalized towards the specific product and user.
Researchers have proposed various types of product review generation methods Yao et al. (2017); Dong et al. (2017); Lipton et al. (2015); Radford et al. (2017) and achieved great performance. However, they did not consider the inner hierarchical word-sentence-paragraph structure within user reviews, thus making their generation results significantly limited in length and coherence. Li et al. (2015); Zang and Wan (2017) did include the hierarchical connection in their review generation model, but they did not address the problem of controllable and personalized review generation targeted at the specific product and user, which is essential for the usefulness of generated reviews. Most importantly, all the aforementioned generative models did not include production descriptions in the generation process, thus their generation results lack credibility and diversity.
To address these problems, we propose a novel model RevGAN that automatically generates high-quality user reviews given the information of product descriptions, sentiment labels and users’ historical reviews. The proposed RevGAN model follows a three-staged process: In Stage 1, we propose to use Self-Attentive Recursive Autoencoder for mapping the discrete user reviews and product descriptions into continuous embeddings for the advantage of capturing the ‘’essence” of textual information and the convenience for subsequent optimization processes. In Stage 2, we utilize a novel Conditional Discriminator structure to control the sentiment of generated reviews by conditioning sentiment on the discriminator and forced the generator to adapt its generation policy correspondingly. Finally in Stage 3, to improve the personalization of generated reviews, we use a new Personalized Decoder method to decode the generated review embeddings according to users’ writing styles extracted from their history corpus.
We conduct extensive experiments using multiple real-world datasets and show that the proposed RevGAN model significantly and consistently outperforms state-of-the-art baseline models and lead to the automated generation of reviews that are indeed very similar in style and content to the set of original reviews.
In general, this paper makes the following contributions:
a. We propose a novel RevGAN model that automatically generates controllable and personalized reviews from product information, a set of user reviews and their writing styles. Especially. we propose three novel components of the generative framework: Self-Attentive Recursive Autoenocder that captures the hierarchical structure and latent semantic meanings of user reviews, Conditional Discriminator that generates controllable user reviews by conditioning the sentimental information on the discriminator to improve the generation performance in terms of sentence quality and context accuracy, and Personalized Decoder that takes the personalized writing style into account by concatenating the users’ vocabulary preference onto the decoder to improve the personalization and credibility of the generated results, which is validated by the empirical human evaluation.
b. We empirically demonstrate that our proposed RevGAN model achieves state-of-the-art review generation performance, statistically and empirically outperforming several important benchmarks on multiple datasets. We also empirically show that the reviews generated by our method are very similar to the organically generated reviews and that the linguistic features of generated reviews follow the same statistical linguistics laws as reviews organically produced by the users.
2 Related Work
In this section, we briefly summarize the related work following two aspects covering previous work on automated review generation and GAN for NLG. We point out the connection and difference between our proposed model and prior literature, which leads to significant improvements of review generation performance.
2.1 Automated Review Generation
Researchers have been utilizing multiple versions of Seq2SeqSutskever et al. (2014) framework to generate online product reviews of good quality, including Aspect-Aware Representations Ni and McAuley (2018), Gated Contexts to SequencesTang et al. (2016), RNNYao et al. (2017), Aspect-Sentiment ScoreZang and Wan (2017), Generative Concatenative NetsLipton et al. (2015) and Sentiment UnitsRadford et al. (2017). In particular, Li et al. (2015)
proposed a two-stage LSTM neural network to construct a hierarchical autoencoder for long-text representation and generation.
However, these review generative models include neither product information nor users’ writing styles into the generation process, thus making the generated reviews less persuasive. Also, they are limited in length and lack of coherence for neglecting the hierarchical connections within sentences, which are very important elements towards the helpfulness of a specific reviewMudambi and Schuff (2010). Our RevGAN model, on the other hand, utilizes the combination of self-attentive hierarchical autoencoder and conditional discriminator for improved and controllable review generation, while we concatenate the contextual labels and users’ history corpus into the personalized decoder at the same time. Experimental results support that our proposed model indeed achieves significantly better generation results compared to the prior literature.
2.2 GAN for NLG
GANGoodfellow et al. (2014)
has become a powerful method for reconstruction and generation in real data space, which leaves great potential to be used for natural language generation purposes. Various methods have been proposed to get over the major problem of the discontinuity of textual information, including SeqGANYu et al. (2017), TextGANZhang et al. (2017), RankGANLin et al. (2017) and LeakGANGuo et al. (2017). However, regarding long-text generation tasks, for all the models above, the computation complexity might be too high, thus failing to provide satisfying results. Nor do these models take contextual and personalized information into consideration for controllable generation. Conditional GANMirza and Osindero (2014); Hu et al. (2017); Dong et al. (2017)
concatenates the supervised labels into the input of generator and is able to control the generation of simple sentences. However, considering the high dimension of latent embedding vectors, concatenating significantly lower dimensional supervised information into the input might not be strong enough to force the generator to update towards the designated direction.
Therefore, to address these problems, we propose a novel conditional discriminator model which conditions the sentiment label on the discriminator to artificially change how the discriminator works, and force it backpropagate the loss function that could make the generator to learn what the user really want. Experimental results reported in Section 5 show that we outperform all other GAN-based-NLG models in the generation performance, and outperform Conditional GAN in terms of sentiment accuracy as well.
In this section, we introduce the proposed RevGAN model for review generation that includes three novel components: Self-Attentive Recursive AutoEncoder, Conditional Discriminator and Personalized Decoder. We first use Self-Attentive Recursive AutoEncoder to map the discontinuous user reviews and product descriptions into a latent continuous space, and utilize a novel version of cGAN to generate review embeddings for subsequent personalized decoding and review generation. Experimental results show that the combination of all three novel components achieves state-of-the-art review generation performance.
3.1 Self-Attentive Recursive AutoEncoder
The proposed Self-Attentive Recursive AutoEncoder is illustrated in Figure 1
. We implement a bidirectional Gated Recurrent Unit (GRU)Cho et al. (2014) neural network as the encoder and the decoder in the model respectively. Compared with classical models like RNN or LSTM, GRU is computationally more efficient and better captures latent semantic meanings. We split each user review or product description into single sentences, and then map each sentence to their corresponding word indexes in the pre-defined dictionary. The index sequences constitute the input of our proposed model.
We use to represent a certain review consisting of sentences as . Each sentence consists of words as , where represents the index of certain word in the vocabulary with size . We denote as the weight matrices for update gates and reset gates, as the status for the update gate and reset gate, and as the input and output vector at time
respectively. GRU learns the hidden representations using the following equations, while the hidden stateat the end of the sequence constitutes latent sentence embeddings.
Besides, to capture the relative position representations. we incorporate self-attentive mechanism Shaw et al. (2018) during encoding process. Typically, each output element
is computed as weighted sum of a linearly transformed input elements
Each weight coefficient is computed using a softmax function
And is computed using a compatibility function that compares two input elements correspondingly. We visualize the self-attention mechanism with an example in Figure 2.
After getting the sentence-level embeddings within each review, we merge those sentence embeddings in a recursive way to obtain paragraph embeddings via a binary tree structure. We denote the embedding for sentence s as , and during the encoding process the first parent node is computed from the first two children node by the standard neural network layer:
where is the natural concatenation of these two embedding vectors, is the weight matrix with twice the size of the embedding vectors. The second parent node will be computed from the concatenation of the first parent node and the following embedding vector :
We obtain the representation of the rest nodes similarly. The embedding for root node constitutes the entire review embeddings. The training process of follows a standard MLP network, where we optimize the reconstruction loss over every layer of the binary tree.
To unfold the recursive autoencoder, we’ll start from the root node of the binary tree. By utilizing another MLP network, we expand the paragraph embedding vector to two vectors: the leaf node and the lower level parent node, where the parent node would go through the same procedure until all of the leaf nodes in the binary tree are deciphered.
Finally, we assemble the paragraph from the bottom leaf node to the top one such that to complete the review reconstruction process.
3.2 Conditional Discriminator
To generate meaningful reviews from the product information, we construct a cGAN model to transfer product descriptions into user reviews given specific sentiment labels. However, unlike the traditional conditional GAN methods Mirza and Osindero (2014); Hu et al. (2017), we do not concatenate the sentiment label directly into latent codes; considering the relatively high dimensions of latent embeddings, concatenating the sentiment scalar into the input might not be powerful enough to force the generator to update itself to match with the designated sentiment. Thus, we condition the sentiment labels on the discriminator to artificially change the rules that the discriminator works, and force it backpropagate loss functions that update generator policy correspondingly. The generated reviews and original reviews with the opposite sentiment are judged as negative examples, while only the original reviews that matches with the given sentiment are judged as positive examples by the conditional discriminator, as we propose the novel conditional discriminator D (for positive sentiment):
For generator, the model is optimized to minimize the reconstruction error between generated reviews and original reviews,
where KL stands for Kullback-Leibler divergence. The loss function of discriminator D is:Under ideal circumstances when generator and discriminator both reach their equilibrium, we could get that and that the generated reviews are indeed indistinguishable from the original reviews from the discriminator point of view.
The core idea lies in that, by artificially forcing the discriminator to take certain type of reviews as real samples, generator should learn the conditioned information and then transform the generated data distribution. This unique structure of GANs makes possible the controllable review generation process, and experimental results support the strength of our model over classical cGAN models.
3.3 Personalized Decoder
To personalize the generation process, apart from the conditioning of sentiment labels, we also take the users’ specific writing styles into account. We provide the definition of writing style according to Zheng et al. (2006) :
Writing Style refers to the user’s distinctive vocabulary choices and style of expression in his review creations.
Assuming that the historical reviews written by user i contain , we calculate the usage frequency of each word from the corpus, which is denoted as a V-dimensional writing style vector . The intuition is that, during the decoding process, instead of generating each word right from the calculated word distribution via GRU network, we concatenate the writing style vector onto the distribution and sample the generated word afterwards, which would be determined by both the writing style vector and the distribution vector :
Note that, to deal with the cold start problem when the user has no historical reviews, we could simply set the writing style vector as identity matrixand generate the reviews under normal settings. Experimental results show that the involvement of personalized information (sentiment information and writing style) indeed improve the generation results and the helpfulness score from the empirical study as well.
4 Experimental Results
To empirically validate our proposed model, we implemented RevGAN111Codes and pre-trained models will be publicly available upon acceptance of this paper. on three subsets of the Amazon Review DatasetHe and McAuley (2016); McAuley et al. (2015)222http://jmcauley.ucsd.edu/data/amazon/, namely Musical Instrument, Automotive and Patio which include 44,006 reviews written by 3,697 users on 6.039 items.
4.2 Experiment Settings
The self-attentive recursive autoencoder is implemented by bidirectional GRUs with embedding dimension 300. GRU parameters and word embeddings are initialized from a uniform distribution between [-0.1,0.1]. The initial learning rate is 1e-3, which will be halved every 50 epochs until convergence. Batch size is set to 128 (128 sentences across review documents) for batch normalizationIoffe and Szegedy (2015) et al. (2017) is adopted by scaling the gradients when the norm exceeds the threshold 1. For the recursive structure, the parameter settings are the same with sentence-level autoencoder only except that the size of the weight matrix is . The beam size for beam searchingWiseman and Rush (2016)
would be fixed as 3. To validate the emotion label for each review, we implemented the state-of-the-art sentiment classifier VADERGilbert (2014)333https://github.com/cjhutto/vaderSentiment to label the sentiment score for each review. The baseline SeqGAN, RankGAN and LeakGAN models are implemented through the TexygenZhu et al. (2018)444https://github.com/geek-ai/Texygen
toolkit. The generator and the conditional discriminator of GANs are both set as Multilayer PerceptronRumelhart et al. (1985)
(MLP) with 300 hidden layers. Their parameters are initialized from the normal distribution N(0,0.02). The learning rates for generator and conditional discriminator are fixed at 5e-5 and 1e-5 respectively. During each epoch, generator G would iterate 5 times while discriminator D would only iterate 1 time. The model updates 30,000 times in total. We implemented our model on a Tesla K80 GPU within PyTorch555https://pytorch.org/ environment, where the whole training takes about 12 hours.
4.3 Evaluation Metrics
To demonstrate that our purposed model indeed achieves the state-of-the-art review generation performance, we implement various evaluation metrics, including distribution-based Log-Likelihood and Perplexity, coherence-based Word Mover Distance (WMD)Kusner et al. (2015), ngram-based BLEU Papineni et al. (2002) and ROUGE Lin (2004), contextual label accuracy and human evaluation to measure the performance of review generation. Specifically, following the same metric as Dong et al. (2017), we use sentiment accuracy, the ratio of the reviews whose sentiment matches with the given label, as an important indication of the personalization ability of the generator. The higher sentiment accuracy is, the better it could provide supervised generated results. Besides, we conduct the human evaluation to assess the quality and helpfulness of generated results by randomly selecting the same number of reviews from the original dataset, the generation of RevGAN and the generation of other baseline models and asking the participants to analyze which ones are generated by the machine and which ones are really created by humans, where significance test shows that our generated reviews are indeed indistinguishable from the original data.
4.4 Baseline Models
To demonstrate that our purposed model indeed achieves the state-of-the-art review generation performance, we compare our model across various evaluation metrics with several important benchmarks, including charRNN Yao et al. (2017), MLE Bahl et al. (1990), SeqGAN Yu et al. (2017), LeakGAN Guo et al. (2017), RankGAN Lin et al. (2017) and Attr2Seq Dong et al. (2017). Besides, to verify the effectiveness of combining three novel components into the RevGAN model, we also compare the performance between RevGAN+CD (Conditional Discriminator), RevGAN+CD+SA (Self-Attentive Autoencoder) and RevGAN+CD+SA+PD (Personalized Decoder). The results show that our model indeed outperforms all the selected benchmark models significantly and consistently.
4.5 Significance Testing
We conduct significance tests to identify whether the difference between two review generation algorithms could indicate a difference in true system quality. Typically, followingKoehn (2004)
, we use bootstrap re-sampling methods to get the asymptotic standard error of the estimated value of the evaluation metrics. Then the paired two-sample t-test could be used to test the significance whether their population means differ statistically.
In terms of the indistinguishableness of the generated results, we conduct a chi-square test for independence to test whether there is a significant association between the human assessment and the actual value. As the results of statistical test are insignificant, we could then claim that our generated reviews are indistinguishable from original ones in the sense that human can’t separate them apart.
4.6 Evaluation of review generation
To illustrate the superiority and generalizability of our RevGAN model, we implement our model on three different domains of the Amazon Review Dataset including musical instruments, automotive and patio products. The summary of our experiment results is reported in Table 1, from which we could clearly observe that, compared with the baseline text-generation models, our proposed RevGAN model performs significantly better in sentence quality and coherence performance. On average, we could witness a 5% increase in Word Mover Distance (WMD), 80% improvement in BLEU and 10% rising in ROUGE. Besides, the comparison between different variations of RevGAN model verifies that indeed the combination of all three novel components gives the best generation performance. By deploying bootstrap re-sampling techniques introduced in the previous section, we conduct hypothesis tests where all the tests confirm the significant improvement of our RevGAN model. In that sense, we claim that our model achieved the state-of-the-art results on review generation. We also showcase some generated reviews at the end of this section.
4.7 Evaluation of controllable generation
In this part, we evaluate the controllable generation performance of our purposed model by pre-setting the contextual labels. We fixed the sentiment label as ’positive’ and ’negative’ respectively conditioned on the discriminator, and then evaluate the sentiment accuracy of the generated reviews. The results are reported in Table 2, where our model beats the state-of-the-art algorithm Attr2SeqDong et al. (2017) and the classical model Conditional GAN where we condition the same sentiment on these two models as well.
4.8 Evaluation of Personalized Generation
Besides the statistical and semantical metrics, we also design an empirical study to test the personalized performance of our generated reviews. We randomly select 15 reviews to include in each questionnaire, 5 from the original dataset, 5 from RevGAN generated results with personalization and 5 from RevGAN generated results without personalization, and ask participants to analyze which ones are generated by the machine and which ones are really created by humans. Besides, they are also asked to assess the helpfulness of each review by choosing the helpfulness score scale 1-5 for each review. We sent out 100 questionnaires in total, and get 36 responses, the confusion matrix of which is reported in Table3. To test that whether the RevGAN generated reviews are indeed statistically indistinguishable from original ones, we run chi-test for significance testing:
which shows that, under 95% confident interval, we could claim that there’s no statistical difference between our machine-generated reviews and those actually written by humans.
Besides, the results indicate that our generated reviews have no statistical difference in terms of the helpfulness scores from those written by consumers towards certain products, with average helpfulness scores 3.10 and 3.03 for machine-generated and real-world reviews respectively. Thus, based on the t-test, we accept the hypothesis that there’s no statistical difference between those two groups in terms of helpfulness as well.
And finally, we would conduct t-test over the performance of personalized and non-personalized generated results with helpfulness score 3.10 and 2.91, which indicates the significant improvement in helpfulness by the involvement of users’ writing style.
|Musical||Positive||These chords got me to play my guitar better in less than one day. An excellent overdrive and an incredible value. I’ll use them all the time.|
|Musical||Negative||These pedals are not budget friendly. If you are looking for classic rock sounds, you won’t love these expensive hardware.|
|Automotive||Positive||I bought two sets of seat covers and this roll kit. Both fit well and look good. They were much easier to slide over the leather seats.|
|Automotive||Positive||These seat covers look good and seem to be made of a good quality material. For the price, these are a great buy.|
|Patio||Negative||It is not recommended. The cover is a little tight and hard to open and close.|
|Patio||Positive||These traps have caught more mice than ever give. You only need a little peanut butter for the bait and tomcat would caught so many mice in one night. Will order again if needed .|
|User History Reviews||RevGAN||RevGAN+PD|
|1.They play well and hold up well(never had one break) They are my second favorite pick. But this pick is better in some songs than my favorite. 2.I ordered 5 different kind of picks and these were my favorite picks. They have a very comfortable feel and great sound!||The guitar has always made a quality and you would really love that life from it!||The guitar has always made better quality and you would really love that comfortable feel from it!|
We present several showcases of our generated results with different contextual labels and domains as shown in Table 4. Additionally, we showcase the modification process in Table 5, where the personalized generated reviews tend to use more words from the user’s history corpus.
Besides, we check if reviews generated by RevGAN would have the same linguistic features by testing two major statistical laws of linguisticsAltmann and Gerlach (2016): Zipf LawZipf (1935) and Heap LawHerdan (1964). The former states that if words are ranked according to their frequency of appearance , the frequency f(r) of the r-th word scales with the rank as , while the latter states that the number of different words V scales with database size measured in the total number of words as . As shown in Figure 3 and 4, both the original reviews and the generated ones satisfy those two linguistic laws.
In this paper, we proposed RevGAN that automatically generates personalized product reviews from product embeddings as opposed to labels, which could output results targeting on specific products and users. To do this, we incorporate three novel components: self-attentive recursive autoencoder, conditional discriminator and personalized decoder. Experimental results show that RevGAN performs significantly better than other baseline models and that our generated reviews are very similar to organically generated user reviews, as shown in Section 5.2 and Table 3.
As a part of the future work, we would like to improve the review generation process in a way that could receive several key words from users as input and generate reviews based on these prior information. Another direction of the future research, however, lies in developing novel methods that distinguish the type of reviews described in the paper and organic reviews.
- Statistical laws in linguistics. In Creativity and Universality in Language, pp. 7–26. Cited by: §4.9.
- A maximum likelihood approach to continuous speech recognition. In Readings in speech recognition, pp. 308–319. Cited by: §4.4.
- Online consumer review: word-of-mouth as a new element of marketing communication mix. Management science 54 (3), pp. 477–491. Cited by: §1.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.1.
- Learning to generate product reviews from attributes. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Vol. 1, pp. 623–632. Cited by: §1, §2.2, §4.3, §4.4, §4.7.
Vader: a parsimonious rule-based model for sentiment analysis of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14). Available at (20/04/16) http://comp. social. gatech. edu/papers/icwsm14. vader. hutto. pdf, Cited by: §4.2.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.2.
- Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769–5779. Cited by: §4.2.
- Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624. Cited by: §2.2, §4.4.
- Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp. 507–517. Cited by: §4.1.
- Quantitative linguistics. Cited by: §4.9.
Toward controlled generation of text.
International Conference on Machine Learning, pp. 1587–1596. Cited by: §2.2, §3.2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §4.2.
Statistical significance tests for machine translation evaluation.
Proceedings of the 2004 conference on empirical methods in natural language processing, Cited by: §4.5.
- From word embeddings to document distances. In International Conference on Machine Learning, pp. 957–966. Cited by: §4.3.
- A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057. Cited by: §1, §2.1.
- Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §4.3.
- Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pp. 3155–3165. Cited by: §2.2, §4.4.
- Generative concatenative nets jointly learn to write and classify reviews. arXiv preprint arXiv:1511.03683. Cited by: §1, §2.1.
- Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: §4.1.
- Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.2, §3.2.
- Research note: what makes a helpful online review? a study of customer reviews on amazon. com. MIS quarterly, pp. 185–200. Cited by: §2.1.
- Personalized review generation by expanding phrases and attending on aspect-aware representations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 706–711. Cited by: §2.1.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.3.
- Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Cited by: §1, §2.1.
- Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §4.2.
- Self-attention with relative position representations. arXiv preprint arXiv:1803.02155. Cited by: §3.1.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.1.
Context-aware natural language generation with recurrent neural networks. arXiv preprint arXiv:1611.09900. Cited by: §2.1.
- Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960. Cited by: §4.2.
- Automated crowdturfing attacks and defenses in online review systems. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1143–1158. Cited by: §1, §2.1, §4.4.
- SeqGAN: sequence generative adversarial nets with policy gradient.. In AAAI, pp. 2852–2858. Cited by: §2.2, §4.4.
- Towards automatic generation of product reviews from aspect-sentiment scores. In Proceedings of the 10th International Conference on Natural Language Generation, pp. 168–177. Cited by: §1, §2.1.
- Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850. Cited by: §2.2.
- A framework for authorship identification of online messages: writing-style features and classification techniques. Journal of the American society for information science and technology 57 (3), pp. 378–393. Cited by: §3.3.
- Texygen: a benchmarking platform for text generation models. arXiv preprint arXiv:1802.01886. Cited by: §4.2.
- The psycho-biology of language.. Cited by: §4.9.