Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human- and Machine-based Detection

07/22/2019 ∙ by David Ifeoluwa Adelani, et al. ∙ 0

Advanced neural language models (NLMs) are widely used in sequence generation tasks because they are able to produce fluent and meaningful sentences. They can also be used to generate fake reviews, which can then be used to attack online review systems and influence the buying decisions of online shoppers. A problem in fake review generation is how to generate the desired sentiment/topic. Existing solutions first generate an initial review based on some keywords and then modify some of the words in the initial review so that the review has the desired sentiment/topic. We overcome this problem by using the GPT-2 NLM to generate a large number of high-quality reviews based on a review with the desired sentiment and then using a BERT based text classifier (with accuracy of 96%) to filter out reviews with undesired sentiments. Because none of the words in the review are modified, fluent samples like the training data can be generated from the learned distribution. A subjective evaluation with 80 participants demonstrated that this simple method can produce reviews that are as fluent as those written by people. It also showed that the participants tended to distinguish fake reviews randomly. Two countermeasures, GROVER and GLTR, were found to be able to accurately detect fake review.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Neural text generation is one of most active research areas in deep learning. It involves building a neural network based language model (known as

neural language model (NLM) [1]

) given a set of training text token sequences and then using the learned model to produce texts similar to the training data. With the development of deep learning algorithms, neural text generation has become an indispensable technique in the natural language processing field as it can generate more fluent and semantically meaningful text than conventional methods 

[2]. Its application mainly includes machine translation [3], image captioning [4]

, text summarization 

[5], dialogue generation [6], and speech recognition [7].

Despite the benefits that the advances in neural text generation techniques have brought, their abuse has created obvious security issues. In particular, high-performance neural language models can be used to generate fake reviews or fake comments/news, and the generated fake reviews or fake comments/news can then be used to attack online systems or fool human readers. For example, a review system can be flooded with positive reviews to increase a company’s profit [8] or with negative reviews to reduce a competitor’s profit, and fake comments/news can be posted on social websites for political benefits. Previous work [9, 10] demonstrated the feasibility of fake review attacks. However, because basic language models (LMs) were used, it was difficult to generate high-quality reviews, and post-processing was needed to adjust the contents to match the desired topic. In this paper, we investigate how well up-to-date LMs can generate reviews. We also investigate how these fake reviews can fool human readers and how susceptible they are to machine-based countermeasures.

Fig. 1: Threat model proposed in this work. A review with the desired sentiment (positive or negative here) is taken from the target shopping website automatically and input to a fake review generator to produce a large number of fake reviews with the same sentiment.

Figure 1 shows the threat model proposed in our investigation. We suppose that an attacker is able to access reviews (or comments) on a website (e.g., a shopping website) and use a method to automatically identify reviews with a desired sentiment (i.e., positive or negative in this work). We also suppose that the attacker can access a large database containing real reviews (written by people) to train an LM for automatic text generation. The attacker then inputs the identified reviews to the LM to generate a large number of fake reviews. The generated reviews that have the same sentiment as the original review are selected to a fake review pool. Since the fake reviews are generated on the basis of an original review, the context of the original review (e.g., an Italian restaurant) should be implicitly embedded in them. Finally, the attacker submits the selected fake reviews to the site to increase or decrease the rating of a product, service, etc.

To generate sentiment-preserved fake reviews, we use a pre-trained GPT-2 NLM [11], which is able to generate length variable, fluent, meaningful sentences, to generate reviews and then use a fine-tuned text classifier based on BERT [12] to filter out undesired-sentiment reviews. Since GPT-2 training data differs from the data used in our experiment (i.e., Amazon reviews [13] and Yelp reviews [14]), it may generate reviews with irrelevant topic. We solved this problem by adapting the original GPT-2 model to the two databases we used. Subjective evaluation with 80 participants demonstrated that the fake reviews generated by our method had the same fluency as those written by people. It also demonstrated that it was difficult for the participants to identify fake reviews given that they tended to randomly identify fake reviews as the one most likely to be the real review. However, the use of two countermeasures, the Grover  [15] and the GLTR for detecting text generated by an LM [16], enabled fake reviews to be accurately identified.

Ii Related Work

The most common attack on online review systems is a crowdturfing attack [17, 18]

whereby a bad actor recruits a group of workers to write fake reviews based on a specified topic for a specified context and then submits them to the target website. Since this method has an economic cost, it is typically limited to large-scale attacks. Automated crowdturfing, in which machine learning algorithms are used to generate fake review, is a less expensive and more efficient way to attack online review systems.

Yao et al. [9]

proposed such an attack method. Their idea is to first generate an initial fake review based on a given keyword using a long short-term memory (LSTM)-based LM. Because the initial fake review is stochastically sampled from a learned distribution, it may be irrelevant to the desired context. Then specific nouns in the fake review are replaced with ones that better fit the desired context. Juuti et al. 

[10] proposed a similar method for generating fake reviews that further requires additional meta information such as shop name, location, rating, and etc.

Our method differs from these methods in that we use a whole review as the seed for generating a large number of fake reviews without using additional information or additional processing and then filter out the ones without the desired sentiment. Our method is thus more straightforward. We do not modify the generated reviews, so their fluency is close to that of the training samples. Since the LM used is adapted from a pre-trained model, our method can be easily implemented even by low-skill attackers.

In addition, adversarial text examples can also be used for attacking online review systems [19, 20]. The aim is to deceive text classifiers, not people, by adding small perturbations to the input. Unlike this type of method, fake reviews generated by our method are aimed at changing overall user impressions.

Iii Fake review generation

The most important part of the proposed method for generating sentiment-preserving fake reviews is the GPT-2 text generation model [11]. Details of our method are as below.

Iii-a GPT-2 Model

The task of an LM is to estimate the probability distribution of a text corpus or to estimate the probability of the next token conditioned on the context tokens. Given a sequence of tokens

, the probability of the sequence can be factorized as

(1)

This probability is approximated by learning the conditional probability of each token given a fixed number of -context tokens by using a neural network with parameters . The tokens used for training can be of different granularities such as word [21], character [22], sub-word unit [23], or hybrid word-character [24]. The objective function of the LM is to maximize the sum of the logs of the conditional probabilities over a sequence of tokens:

(2)

The neural network parameters

can be learned using various architectures such as a feed-forward neural network 

[21]

, a recurrent neural network (RNN) such as a vanilla RNN 

[25, 26], an LSTM [27] and its variants [28], and the transformer [29, 30] architectures. A GPT-2 model based on the transformer architecture has the lowest perplexity on various language modeling datasets and it generates high-quality fluent texts.

The GPT-2 model was trained on a large unlabeled dataset — 8 million webtexts obtained by scraping all outbound links (about 45 million) from Reddit, resulting in about 40 GB of text. This LM is easily generalizable to a corpus for domains that differ from that of the original training data. For instance, the GPT-2 LM attained state-of-the-art lower perplexity on seven out of eight tested datasets in a zero-shot setting. In addition, generative pre-trained models such as GPT-2 are transferable

to many natural language understanding tasks such as document classification, question answering, and textual entailment through discriminative fine-tuning of the models within a few epochs. Moreover, the GPT-2 LM can be adapted to a new domain by fine-tuning the model on a corpus in that domain, e.g., online reviews.

There are four different GPT-2 models in terms of size. We used the smallest one (117 million parameters, )111https://github.com/openai/gpt-2. As of now, they have released only the smaller models — 117M and 345M — to prevent the malicious use of their larger models. Even with the smallest one, we were able to generate realistic reviews.

Iii-B Sentiment-Preserving Fake Review Generation

Fig. 2: Fake review generation procedure

As shown in Figure 2, we use a two-step approach to generating sentiment-preserved reviews: generation and validation. In the generation step, the attacker provides an original review with a given sentiment as the seed text to the GPT-2 LM, which then generates a different review based on . We refer to as a fake review; it differs from in its literal representation. There is no strict guarantee that the original review and the fake review have the same context because is sampled from the probability distribution represented by the model while the context information may be implicitly embedded in to some degree. Therefore, part of can be thought of as a continuation or paraphrase of .

Validation step aims to filter out the generated reviews with undesired sentiment. In this step, the attacker determines whether has the same sentiment as by using the BERT text classifier [12], which is similar to the GPT-2 in that it is also based on the transformer, but it further takes into account bidirectional context information. We assume that the attacker has access to such a classifier and uses it to quickly check the generated reviews for their sentiment. If the sentiment of is the same as that of , it is added to the fake review pool. Otherwise it is discarded.

Iii-C Fine-tuning Language Model on Review Data

One major advantage of LMs like GPT-2 is that they are very easy to adapt (i.e., fine-tune) to a new dataset or domain. During fine-tuning, the model is first initialized before training with the pre-trained parameters rather than random weights. Fine-tuning takes less time than training a high-capacity LM from scratch with millions of web documents. Furthermore, text classification and other natural language understanding tasks benefit from pre-training the model on a large amount of unlabeled text. It has been shown that fine-tuning using labeled data after initializing the model with pre-trained parameters improves accuracy for downstream tasks [12]. Therefore, we fine-tuned both the GPT-2 LM and the BERT classifier. We used Amazon and Yelp review databases containing both positive and negative reviews written in English. Following the approach of Yang et al. [14], we divided the reviews in each database into training and test datasets, as shown in Table I. The model was fine-tuned on each training dataset, and evaluation was performed on the respective test dataset.

Amazon Yelp
Total number of reviews 4 million 598, 000
Number of training examples 3.6 million 560, 000
Number of test examples 400, 000 38, 000
Number of class labels 2 2
TABLE I: Statistics for Amazon and Yelp review databases used for fake review generation.
Method Seed/generated review

Original Review (SEED)
I currently live in europe, and this is the book I recommend for my visitors. It covers many countries, colour pictures, and is a nice starter for before you go, and once you are there.
Pre-trained GPT-2 fake review Just as I recommend before you go. And there are lots more things to read. What are your favourite books of the day? This is my take on the day before a work trip to
Fine-tuned GPT-2 fake review Great for kids too. Recommended for all young people as the pictures are good (my kid’s are 11) favourite books of the day? This is my take on the day before a work trip to
TABLE II: Example reviews generated using pre-trained GPT-2 LM.

As of now, the authors of GPT-2 have not released the training code, but we found a reliable source code222https://github.com/nshepperd/gpt-2 on GitHub for training the GPT-2 model, which is the implementation we used to fine-tune the pre-trained model on the review databases. We fine-tuned the GPT-2 by concatenating all reviews with a newline symbol into a giant text file; we did not distinguish between positive and negative reviews during fine-tuning. We fine-tuned the 117M GPT-2 model on the Amazon training set for two weeks (485K epochs) and on the Yelp training set for five days (190K epochs) by using the default hyper-parameters. We stopped the training when the validation error was no longer decreasing. We found that the pre-trained GPT-2 LM sometimes produced texts that were not review-like, as shown in Table II. Nevertheless, after fine-tuning, the generated texts were review-like.

Similarly, we fine-tuned the BERT text classifier on the Amazon and Yelp training sets for three epochs to classify reviews as positive or negative. We achieved % accuracy on the original Amazon test dataset and % accuracy on the original Yelp test dataset. Fine-tuning BERT took only a few hours, and the performance was better than that reported for the character-level CNN [14] (% for the Amazon test dataset; % for the Yelp test dataset).

Iii-D Explicit sentiment modeling

In addition to the above basic attack method, which simply fine-tunes the pre-trained GPT-2 LM, we further propose a “skill-up” method in which an LM is explicitly conditioned by a specified sentiment. This method requires a natural language processing expert to train a tailored LM.

Radford et al. [31]

reported that a sentiment neuron can be learned by using a single-layer multiplicative LSTM (mLSTM) 

[28]. The sentiment neuron can be found by manually visualizing the distribution of output values of hidden units, and a unit for which the output values can be categorized into two groups across multiple sentiment databases can be considered as a sentiment neuron. It has reported that mLSTM outperforms LSTM because it allows each possible input to have different recurrent transition functions [28], so fake review generation based on mLSTM is better than that based on LSTM [9]. By replacing the output values of the sentiment neuron with (positive) or (negative), we can explicitly force the output to be conditioned by a specified sentiment [31]. We refer to this method as “sentiment modeling”. Our implementation is based on that of Puri et al. [32]333https://github.com/NVIDIA/sentiment-discovery, which had 4,096 units.

Iv Experiment

Iv-a Measurements and Setup

We measured the effectiveness of the proposed method for generating sentiment-preserving fake reviews in three ways. 1) The sentiment-preserving rate was used for evaluating whether the sentiment of the original review was preserved, with the BERT text classifier used for sentiment prediction. It was defined as the ratio of number of sentiment correctly preserved fake reviews to number of total fake reviews. Note that all generated reviews (without filtering) were used. 2) Subjective evaluation was used for evaluating the fluency of the generated reviews and how well people could distinguish between the real reviews and the fake ones. 3) The detection rate was used for evaluating how well machine-based detection methods could identify fake reviews.

Four types of LMs were investigated: a pre-trained GPT-2 LM, a fine-tuned GPT-2 LM, an mLSTM LM, and a sentiment modeling. Considering the high computational cost, we randomly selected 1,000 reviews from each test dataset for use as seed texts under the assumption that most of the reviews were written by a person. For each LM, we then generated 20 different fake reviews based on each real review. In total, there were 20,000 fake reviews per LM per dataset. The generated reviews contained from 1 to 165 words, with an average of 94 words. Training of the LMs and review generation were performed on a machine with a Tesla P100 GPU.

For the subjective evaluation, we first asked 80 volunteers (39 native and 41 non-native English speakers) to evaluate the fluency of reviews. Fifty real reviews (200 300 characters) were randomly selected (half were positive and half were negative) from each test dataset, and fake reviews were generated on the basis of those reviews. We used the real reviews and the fake reviews with a sentiment most closely matching the associated real review for fluency evaluation. The evaluation was done using a 5-point Likert mean opinion score (MOS) scale, with 5 being the most fluent. We then asked them to select from four reviews the one they thought was the most likely real review, where the four reviews contain a real review and three fake reviews. The average correct selection rate was used as the metric. To facilitate evaluation, the reviews were shortened to only the first three sentences. The evaluations were performed on a web interface444An image of the interface is available at https://nii-yamagishilab.github.io/wifs2019_fakereview/ with the real and fake reviews listed in random order. The participants evaluated a minimum of 10 and a maximum of 100 random reviews. Most of the participants evaluated only ten reviews. We obtained 1025 data points for fluency and real/fake selection evaluation, respectively.

For machine-based fake review detection, we used the GROVER [15] and the GLTR [16] as countermeasures. The GROVER is based on a neural network and it can defend against fake news generated by an NLM such as the GPT-2 LM. Its reported detection accuracy is 92%. The GLTR does not directly judge whether text is real or fake. Rather it helps a person to distinguish real from fake text by reporting how likely a word in the text was machine generated. It has been reported to improve fake review judgment from 54% to 72%. We used the fine-tuned GPT-2 LM as the text generation model.

Iv-B Sentiment-preserving fake review analysis

LM Amazon Yelp
Pretrained GPT-2
Fine-tuned GPT-2
mLSTM
Sentiment modeling
TABLE III:

Rate (in %) and standard error of fake reviews preserving sentiment of original review.

As shown in Table III, the fine-tuned GPT-2 model was better at preserving the sentiment of the original review than the pre-trained GPT-2 model for both databases. This means that a large number of fake reviews can be efficiently generated with a desired sentiment by just fine-tuning an LM. The sentiment modeling method had the highest rate for the Amazon database. This was because explicitly modeling sentiment benefits from the additional sentiment information given before the fake reviews are generated. This indicates that explicitly modeling sentiment could be a more efficient way to generate desired sentiment reviews. For the Yelp reviews, fine-tuned GPT-2 was also clearly better than the pretrained GPT-2 and the mLSTM had the highest rate. Further analysis revealed that the mLSTM model performs very well only for food and restaurant reviews but it did not generalize well to other domains or it generates reviews completely outside the context of the original review. As shown in Table IV, the fake review generated by the mLSTM model review talks about a restaurant experience while the original review was about a shirt purchase. This suggests that we need to further explicitly preserve context. (We leave this for future work.)

Seed / Model Amazon Yelp
Original reviews (seed) I really enjoyed this book. It shows the judgmental tendencies in our human race and how one woman strove to live a life of service to others to gain redemp- tion for her mistake. Can’t go wrong with the classics. Very bad purchase experience. I bought a shirt with a hole covered in the rolled up sleeves, but they denied my request to return it. I am so angery at this and will never shop their chothes anymore.
Pretrained GPT-2 I think he is an excellent young writer who will keep his title for another year. Highly recommended! by The author has taken several important decisions. He is an excellent historical detective and the best detective ever to be published. The shirt was completely ripped with no stitching, it was a mess with no color or anything else that suggested anything. This is now in a shop. - Michael I’m very, very sorry about this experience and that they will consider me a part of a larger family of customers.
Fine-tuned GPT-2 A great read. It reads good in that it is not predictable. You really do learn from the story as you continue to follow the story, although most of it isa little more simple. This store is disgusting. I went in a couple weeks ago to pick up a blouse of mine. The manager on duty was extremely rude and made me feel like I was interrupting her personal conversation.
mLSTM I wish the author had more books out there to enjoy as well as this one. I bought this for my son for Christmas. He loves it! It is so simple to use and the colors are bright and beautiful. I will never go back to this place. The staff at this location are the worst I have ever encountered in a restaurant. The food is overpriced and the servers are so rude to them and they will take your business to a club that is close to me.
Sentiment modeling I really enjoyed the book and have recommended it to many friends. This is a great book for any child who loves to read. It is a great story about a child who has a hard time with books that are simple and easy to read. I will never go to this store again. I would never go back to this store again. The store is still okay but not worth the price they charge for everything.
TABLE IV: Examples of fake reviews generated by four models using original review as the seed. Bold font indicates words expressing sentiment.

Iv-C Subjective Evaluation

As shown in Table V, the non-native English speakers tended to give higher scores for fluency than the native English speakers to the original reviews while the native speakers tended to give higher scores to most cases of fake reviews (5 of 8), but the differences are slight. The fine-tuning improved the fluency compared with that of the reviews generated by the pre-trained GPT-2. This suggests that an attack can be made more effective by simply fine-tuning existing models. For the Amazon dataset, the reviews generated by explicitly modeling the sentiment (sentiment modeling) had the highest overall score, followed by those generated by the fine-tuned GPT-2 model. Interestingly, the scores for all fake review were higher than that for the original review. This observation is similar to that of Yao et al. [9], who observed that people tended to consider fake reviews highly reliable. This observation does not hold for the Yelp database — the score for the original reviews is higher than those for the fake ones. Among the fake review generation models, the fine-tuned GPT-2 model had the highest score (3.30).

Table VI shows the results for judging which of the listed four reviews was the most likely real review. It was surprising to find that it was difficult to identify the real review from the four options. The lowest overall correctness were 25.4% and 20.8% and the highest ones were 29.1% and 34.6% for the Amazon and Yelp databases, respectively. These results demonstrate that the participants tended to randomly judge which of the listed four reviews was the most likely real review because the rates were close to the chance rate of 25%.

Model Amazon Yelp
Native Non-native Overall Native Non-native Overall
Original review
Pretrained GPT-2
Fine-tuned GPT-2
mLSTM
Sentiment modeling
TABLE V: Fluency of reviews (in MOS). Bold font indicates highest score.
Model Amazon Yelp
Native Non-native Overall Native Non-native Overall
Pretrained GPT-2
Fine-tuned GPT-2
mLSTM
Sentiment modeling
TABLE VI: Correctness (in %) for judging which of four reviews was the most likely real review. Bold font indicates worst case.

Iv-D Automatic Fake Review Detection

We tested the ability of the GROVER to detect fake reviews on 150 Amazon and Yelp fake reviews. It correctly detected 97% of the Amazon fake reviews and 87% of the Yelp ones. As additional examples, Table VII shows that two fake reviews were misidentified as real reviews with high confidence.

As shown in Table VIII, using the GLTR tool revealed that the fake reviews contained slightly more top-10 words that are likely to be generated by LMs than the real reviews. Since this tool does not directly judge whether text is real or fake, it is difficult to say how well this method can distinguish fake reviews. The authors of GLTR argue that fake texts have fewer words beyond the top-1000 words likely to be generated by LMs, this characteristic is a weak way to identify fake reviews. If we were to set a hard threshold for detection, i.e., that real reviews have more than 4% of tokens that are beyond the top-1000 words (“Remainder” in Table VIII), GLTR would have a fake review detection accuracy of 92% for Amazon and 84% for Yelp reviews.

Fake reviews Grover’s judgement
I ordered the pork tenderloin sandwich. Delicious! Really nice, juicy, tender meat. I could eat an entire sandwich by myself! Great spot with friendly staffs. written by machine (quite sure)
We visited this little gem during our visit Arizona. We wanted to get away from our kids for dinner and it sure was! The menu was extensive, the quality of food was very good and the service was top notch:) The place was nearly empty at prime dinner hours. It was refreshing and not to noisy. written by a person (quite sure)
Wow! The food, and the price were outstanding This place is tucked away, but worth the effort. It was great - not your typical bar food, we were greeted at the door promptly, and sat down. written by a person (quite sure)
TABLE VII: Example results for Grover detection: two fake reviews were judged to have been written by a person.
Fake Amazon Yelp
words in Fake Real Fake Real
Top-10
Top-100
Top-1000
Remainder
TABLE VIII: Results using GLTR: distribution (in %) of generated (fake) words in different ranges. Top-10 means the most likely ten words generated by LMs and likewise for Top-100 and Top-1000.

V Conclusion

We proposed a sentiment-preserving fake review generation method. It fine-tunes GPT-2 model to generate a large number of reviews based on a review with the desired sentiment taken from the website to be attacked. Then it uses the BERT text classifier to filter out the ones with undesired sentiments. Since there is no post-processing or word modification, the generated reviews may be as fluent as the samples used for language model training. Subjective evaluation of review fluency by 80 participants produced a mean opinion score of 3.23 (scale of ) for fake reviews based on Amazon real reviews and 3.30 for fake reviews based on Yelp real reviews. The values for the real reviews were 2.95 and 3.49, respectively. This means that the generated reviews had the same fluency as the reviews written by a person. Subjective judgment of which of four reviews (one real review and three fake reviews in random order) was the most likely real review produced correctness between 20.8% and 34.6%. This is roughly equivalent to random selection. Application of two countermeasures, GROVER and GLTR, to the detection of fake reviews demonstrated detection accuracy of around 90%.

We plan to investigate ways to further preserve both sentiment and context information by using cold fusion [33] or simple fusion [34]. Since the generated reviews is the most probable sequence, they lack diversity and the corresponding distribution area may be already covered by the countermeasures. This may be the reason that the countermeasures could easily detect fake reviews. To generate more robust reviews, we plan to develop a method that generates reviews with more diversity [35]. We also plan to develop a countermeasure for detecting these generated reviews.

Acknowledgments: This research was carried out when the first and second authors were at the National Institute of Informatics (NII) of Japan in 2018 and 2019 as part of the NII International Internship Program. This work was partially supported by a JST CREST Grant (JPMJCR18A6) (VoicePersonae Project), Japan, and by MEXT KAKENHI Grants (16H06302, 17H04687, 18H04120, 18H04112, 18KT0051), Japan.

References