Log In Sign Up

GANgster: A Fraud Review Detector based on Regulated GAN with Data Augmentation

Financial implications of written reviews provide great incentives for businesses to pay fraudsters to write or use bots to generate fraud reviews. The promising performance of Deep Neural Networks (DNNs) in text classification, has attracted research to use them for fraud review detection. However, the lack of trusted labeled data has limited the performance of the current solutions in detecting fraud reviews. Unsupervised and semi-supervised methods are among the most applicable methods to deal with the data scarcity problem. Generative Adversarial Network (GAN) as a semi-supervised method has demonstrated to be effective for data augmentation purposes. The state-of-the-art solution utilizes GAN to overcome the data limitation problem. However, it fails to incorporate the behavioral clues in both fraud generation and detection. Besides, the state-of-the-art approach suffers from a common limitation in the training convergence of the GAN, slowing down the training procedure. In this work, we propose a regularised GAN for fraud review detection that makes use of both review text and review rating scores. Scores are incorporated through Information Gain Maximization in to the loss function for two reasons. One is to generate near-authentic and more human like score-correlated reviews. The other is to improve the stability of the GAN. Experimental results have shown better convergence of the regulated GAN. In addition, the scores are also used in combination with word embeddings of review text as input for the discriminators for better performance. Results show that the proposed framework relatively outperformed existing state-of-the-art framework; namely FakeGAN; in terms of AP by 7 Yelp and TripAdvisor datasets, respectively.


page 1

page 4

page 8


GANs for Semi-Supervised Opinion Spam Detection

Online reviews have become a vital source of information in purchasing a...

Semi-supervised Complex-valued GAN for Polarimetric SAR Image Classification

Polarimetric synthetic aperture radar (PolSAR) images are widely used in...

Detecting Deceptive Reviews using Generative Adversarial Networks

In the past few years, consumer review sites have become the main target...

Social Fraud Detection Review: Methods, Challenges and Analysis

Social reviews have dominated the web and become a plausible source of p...

Yes, we GAN: Applying Adversarial Techniques for Autonomous Driving

Generative Adversarial Networks (GAN) have gained a lot of popularity fr...

DivAug: Plug-in Automated Data Augmentation with Explicit Diversity Maximization

Human-designed data augmentation strategies have been replaced by automa...

DFraud3- Multi-Component Fraud Detection freeof Cold-start

Fraud review detection is a hot research topic inrecent years. The Cold-...

I Introduction

Social media is full of users’ opinion about matters such as news, personal events, advertisements, and businesses. Opinions concerning businesses can greatly influence the users’ decisions on purchasing certain products or services. A study in 2015 demonstrated that about 70 percent of people in the US, visit other users’ reviews for a product, before purchasing 111 The openness of popular review platforms (Amazon, eBay, TripAdvisor, Yelp, etc.) provide an opportunity for marketers to promote their own business or defame their competitors, by deploying new techniques such as bots, or hiring humans to write fraud reviews for them. The reviews produced in this way are called “Fraud Reviews” [17, 22, 35]. Studies show that fraud reviews increased in Yelp by 5% to 25% [20] from 2005-2016. It is worth mentioning that there are fraud contents in different contexts of social media with the same characteristics [10]. Fake news consists of articles intentionally written to convey false information for a variety of purposes such as financial or political manipulation [40, 36]. There has to be enough knowledge of political science, journalism, psychology, etc. to study these types of contents generated in social media [6, 32].

Since the first work on social fraud reviews in 2008 by Jindal et al. in [14], many approaches were used to address this problem, including text based features which refer to those extracted from text [25] such as language models [30], or behavioral ones which extract behavioral clues from users’ behavior pattern using metadata or users’ profile [23]. These approaches can also be combined for better performance [30, 31]

. Hand crafted features are fed to classifiers such as the Multi-Layer Perceptron (MLP), Naive Bayes, Support Vector Machines (SVMs) to predict if a review is genuine or not. We call these approaches using hand-crafted feature “classic approaches”. Recent years have seen Deep Learning (DL) used for fraud review detection and model it as a “text classification” task, for better feature representation, and to address the overfitting problem

[28]. To deal with data scarcity, a recent attempt [1] adopted GAN in a framework called “FakeGAN”. FakeGAN consisted of a generator to generate fake samples as auto-generated reviews and two discriminators. One for discriminating between fake and real samples and the other one for discriminating fraud human reviews and fraud generated ones. Despite FakeGAN’s simplicity and effectiveness, it suffers from major limitations. The first limitation is the lack of high quality score-correlated data. Reviews generated by FakeGAN contain text and provide no metadata such as score, which has shown to be more useful than text reviews when it comes to fraud detection [30, 31]. Generating high quality data correlated with the score provides a better feature representation learned jointly from both text and metadata. Second, FakeGAN suffers from the lack of stability in the training step. In better words, the training procedure in FakeGAN takes time to stabilize. Regularizing the objective function is one way to ensure the convergence of GAN. Finally, the performance of FakeGAN was evaluated only on one dataset, and has not been tested on other datasets. Experiments on datasets from different domains are required to ensure the scalability of the proposed approach.

In this paper, we propose to use Generative Adversarial Networks (GANs) [13] in our framework to solve the data sparsity problem. To ensure the generation of more meaningful and authentic reviews, we use a concept from Information Gain Maximization theory [5] to select generated samples that have the highest information gain against the score of reviews. In the real world, fraudsters mimic the same process by incorporating new techniques to produce genuine like reviews. Two discriminators are used for the same purpose as [1]. To ensure the generation of score-correlated reviews the discriminator has been modified to calculate the gain between the text and score and in the training phase and the gain maximization is achieved by forcing the generator to generate score-correlated reviews. As a result, generating new score-correlated reviews helps the discriminator to expand during learning of the joint representations of the text and score from the augmented data.

Finally, to address the problem of stability, which is a well-known problem with GANs [13], Information Gain Maximization also [5] provides a regularization term that is applied to the objective function of the GAN. So we can summarize our contributions as follows:

  • Our proposed GAN framework investigates the problem of fraud review detection, by incorporating the score in the review generation based on the information gain maximization theory, for the first time. Our results shows that customization of the generated reviews based on the score, leads to a relative improvement in fraud review detection by 7% on the Yelp dataset and 5% on TripAdvisor compared to state-of-the-art system (Sec. IV-E1, IV-E2.‘Effect of Score’).

  • The objective function of GAN is improved in terms of regularization using the Maximizing Information Gain. Our experiments demonstrate an improvement in the convergence of the proposed approach in terms of the number of required iterations (Sec. IV-E2.‘Effect of Regularization’).

  • We demonstrate that using the the GAN to generate labeled data, addresses the data scarcity and stability problem, one of the main challenges in fraud review detection. We show that with our smaller data, generating reviews can help the detector to converge to the final performance on two different datasets; Yelp and TripAdvisor (Sec. IV-E3).

Ii Related Works

Ii-a Fraud Detection

There are two different approaches for fraud detection: Classic and Deep Learning (DL) approaches.

Ii-A1 Classic Approaches

Two types of features are used in fraud detection in this context; behavioral features and text based features [23]. These features are either used separately or in combination.

Text based features: Text based approaches extract features directly from the reviews or from text statistics, language models, etc. [24]. Pairwise features were used by Chang et al. [4]

to spot fraudsters (pairwise features are features extracted by comparing pairs of reviews). Their study used similarity among reviews to spot group fraud reviews. Previous studies have also shown how n-gram models improve the accuracy of fraud and fraudster detection

[33]. For example, fraudsters tend to dominate their reviews and as a result use more first person pronouns to increase their impression. In addition to making their reviews bolder, they use CAPITAL words, so they can catch the readers’ attention [18]. In summary, handcrafted text based features use the review text for feature extraction, while metadata of text review can also play an important role in discriminating between genuine reviews and fraud ones.

Behavioral features: These types of features were initially proposed to address the inadequacy and completeness of text based features. Some of the important behavioral features are reported in [14][30][31]. As an example, consider a reviewer who writes reviews about every hotel in a town; this is unusual, since a traveler will likely use just one hotel in that town [21]

. Fraudsters also tend to write as many reviews as they can, since they are paid based on the number of reviews they write. So as much as the number of reviews for a certain reviewer increases, the probability for him/her to be a fraudster increases

[2]. In addition, normal users have a low deviation of their opinion while fraudsters tend to promote the services for the companies they are working for, and defame the services of the competitors. So a user’s Rating Deviation (RD) can also be considered a behavioral feature of a fraudster [9].

Ii-A2 Deep Learning

In the recent years, DL has attracted attention for the extraction of sentiment representations from text and sentences, for two main reasons. First, hidden layers are able to extract complex hidden information in sentences. Second, a global representation of a sentence is achievable using such networks, while hand crafted features fail to do both [28][8]. Ren et al. [37]

claimed that they are the first ones to use DL to spot fraud reviews. In their work a CNN is used to extract a representation of a sentence from words. Then the sentence representation is fed to a Gated Recurrent Neural Network (GRNN) and generates a document representation of the whole text. These features are then fed to a softmax layer to determine if the document is fraud or not. This approach demonstrated a 3% improvement in fraud classification, on TripAdvisor. The term frequency, word2vec and LDA (Latent Dirichlet Allocation) were combined by Jia

et al. [29]

to spot spam reviews on Yelp. This work extracted 5 topics from fake and non-fake reviews and describes each topic using 8 words. Finally, they use SVM, Logistic Regression and MLP and the results show that the MLP along with Logistic Regression achieved the best performance (81.3% Accuracy).

Ii-B Generative Adversarial Networks

Generative Adversarial Networks (GANs) [12]

are among the latest approaches that have been used in Artificial Intelligence (AI) for different applications. Liang

et al. in [19] tried to produce a description of an image using G and on the other hand D tried to distinguish between the generated description and the real description. Their work claimed that the insufficient labelled description data available for images, may have caused overfitting, and GAN can generate new descriptions used to both augment and balance the data, and in addition to provide real descriptions of images.

Akhlaghani et al. [1] proposed FakeGAN to investigate the problem of fraud review detection using a GAN, by generating fraud reviews with the generator and then the GAN is fed with review text. FakeGAN consists of a generator to generate fake samples as auto-generated reviews and two discriminators. One to discriminate between fake and real samples and the other one to discriminate fraud human reviews and fraud generated ones. Unlike the original GAN, FakeGAN makes use of the dual discriminators to overcome the well-known mode collapse. Mode collapse refers to a situation that the generator switches between different modes during the training phase, because of the complexity of the input data distribution. FakeGAN was evaluated on the TripAdvisor dataset containing 800 reviews, 400 real and 400 deceptive from [25].

Iii Proposed Methodology

Iii-a Problem Definition

Given a set of real reviews , consisting of genuine reviews () and fraud human reviews (), our purpose is to design a system that generates a set of fraud bot reviews (). We denote the fraud reviews as . First, we train a discriminator to differentiate from . This will ensure that we generate more human like fraud reviews . This in turn allows us to train the discriminator to differentiate between and .

Fig. 1 depicts the overall system architecture of GANgster, our proposed GAN based fraud review detection system with regularized GAN. In the following, we provide relevant details of the different components of GANgster.

Iii-A1 Generator

In the generator (Fig. 1, Module 1), is a random noise from a parametric distribution with being the distribution parameters, and is a latent variable that we intend to discover and put as a constraint for the generated samples from the random noise. To adapt the system with a new constraint , we use the mutual information between and the constraint ; namely for regulation. Details of the regulated GAN are provided in Sec. III-C.

Iii-A2 Generated Reviews

Generated reviews are generated by as a function of from the random noise . Note that in the original GAN, there is no constraint on the generated samples and the generator generates samples that satisfy . denotes the probability that the generator generates a sample given constraint . is the probability that the generator generates a sample without regard to any constraints. The constraint in this case is the score s determined by vector s as an input for the generator in Fig. 1.

Iii-A3 Discriminators

Dual discriminators are used in GANgster. Discriminator (Fig. 1, Module 2, 3) helps to generate more fraud like reviews, and the discriminator (Module 3 in Fig. 1) as the main discriminator, tries to force to generate reviews which seem more authentic. Similar to [1] two discriminators are used to deal with the “mode collapse” problem. “Mode collapse” is a well-known limitation of GAN. It refers to the generator instability to generate samples from the learned distribution, because of the input complexity and multidimensionality of data distribution.

Fig. 1: Framework of GANgster.

Iii-B Preliminaries

GANgster embodies three innovations; First we use score to generate reviews, helping generated reviews to customize the reviews based on different scores. Next, the score is combined with the review text to improve the detection accuracy. Finally, to address the problem of instability of FakeGAN, the cost function is updated using the concept of information gain maximization.

Information Gain Maximization was also proposed in [5] in a framework named ”InfoGAN”. The basic idea of “InfoGAN” is to generate handwritten numbers using . This framework uses information gain maximization in order to generate image samples considering a special constraint such as the angle and thickness of a digit’s stroke. The idea of incorporating constraints is also applicable to reviews, for example the polarity (score) of reviews can be considered as a constraint. In Sec. III-C, we explain how to apply these constraints.

Iii-C Information Gain Maximization GAN Regularization

As mentioned in Sec. III-A1, we need to increase the mutual information between and our model . In order to achieve this goal, a regularization must be included in the objective function. The constraint information is also added to , and we also have to add another layer to both and . From the information theory viewpoint, we intend to increase the amount of knowledge that comes from each review by and vice-versa, which is . In order to do that we need to expand the following equation:


From Eq. (3) it is difficult to maximize the information gain directly, since sampling the posterior is required. Therefore we use variational mutual information maximization [3] to find a lower bound over to make the lower bound as tight as possible. To do that, we first need to define an auxiliary distribution; which is an approximate for . So we extend Eq. (3):


In the above equation we are able to maximize the entropy of , since the distribution is fixed. Using the lemma (proof in [5]), we have:


There are a couple of approaches to calculate ; Here we just add a fully connected layer to the output parameters of the dual discriminators to calculate . Hence, we can regulate the objective function of GAN to solve the minmax game for as follows:


In addition to training GAN, the objective function of Eq. 7 tries to preserve the contribution of during the generation process.

Iii-D Models

Iii-D1 Generator Model

Since Recurrent Neural Networks (RNNs) are the most used deep networks for sequential patterns, we adopted this type of network in this work and fed the embedding to the RNN as an input and the hidden states were trained in recursive mode:


In Eq. 8, represents the sequence of the input word embedding which is mapped into a sequence of hidden states, , using function . The output was generated using a softmax function.


where is the output of the softmax function, is the weight matrix and

is the bias vector. We also used the Long Short-Term Memory (LSTM) cells for the basic RNN to deal with the well-known vanishing and exploding problem


Iii-D2 Discriminator Model

Similar to the generator we also extracted embedding vectors in the first stage of discrimination. We represent the embedding of each word as , where is the embedding of word . The representation is then concatenated with the review score . CNN was used as the classifier for fraud review detection [15]. We first concatenated the extracted word embeddings and score:


where symbol represents concatenation. We apply a convolution layer to a window size of

using an ReLU function:


where is the kernel function, and represents a bias value. The output for this step is

. A max-pooling is applied to

to get the combination of the different kernel outputs. Finally, a softmax function was applied to calculate the class probability of the fraud and real reviews. A softmax function is used to calculate the to enforce an update on the constraint evolution in the cost function, using the following function:


where is the final input representation obtained from applying max-pooling on . is a weight matrix trained based on different scoring, and is the bias vector.

Iii-E Training

Training of GAN in GANgster, consists of two steps; pre-training and training. Pre-training is used to generate first reviews for the subsequent training of the with and

. For training, we use Monte-Carlo search, in order to solve the problem of discrete token generation for the generator. In brief, Monte-Carlo is a search algorithm to identify the most promising moves in a game, heuristically. For each state, it plays the game to the end for a fixed number of times, based on a specific policy. The best move is selected based on a reward for a complete sequence of moves, which is words in this case.

Iii-E1 Pre-training

In the pre-training section, we need to generate some as an input for . So we first train the on

using Maximum Likelihood Estimation. The discriminators are also pre-trained using cross-entropy with generated reviews and fake reviews as input for

, and fake reviews and real reviews as input for .

Iii-E2 Adversarial Training

We adopt the idea of the roll-out policy model for reinforcement learning to generate the sequence of words for

. So in adversarial training, the generated samples have to achieve higher scores (more realistic) from the discriminators. This forces the reviews generated by the generator to take actions that lead to a better score from the discriminator and subsequently higher rewards in the policy gradient of the Monte-Carlo search. So the indicates the reward or quality of the generated reviews from . The action-value for a taken action by the generator is calculated by:


The problem here is that in , every generated word for the final review generation needs a reward and discriminators can only calculate the reward for real or fake complete generated sentences and not the incomplete ones. Therefore, for reviews with length less than a complete sentence ( as the length for a complete review and for our arbitrary length) we need to perform a Monte-Carlo search on words to predict the remaining ones. For a good prediction, an Monte-Carlo search is engaged. The reviews generated by Monte-carlo is defined as which are sampled using roll-out policy based on their current state. It is worth to mention that is the same generative model we used for generation, hence . As mentioned before for the complete review is the reward. For an incomplete sentence, though, it can be calculated using the following equation:


where is the reward value of the incomplete generated review which is completed times using Monte-Carlo search and is rewarded the same number of iterations by two discriminators.
Using this approach we can convert the discrete nature of the produced words into continuous form. The updates can be then propagated backwards from discriminators through the generator. The last step to complete this loop is to complete the adversarial training, which is a function to maximize the final reward. Inspired by [34], we use the following objective function:


So for updating we just need:


where is the learning rate and is set to 1. In addition we use Eq. 7 for training the discriminators. So both the generator and the discriminators are updated mutually to finally converge to an optimum point. Algorithm 1 describes how GANgster works.

Result: reviews probability ranked by as fraud, customized reviews generated by
% Pre-training;
generating word embedding of words in and ;
pre-train with word embedding of ;
% Training;
while  convergence do
       % Generator training;
       for  to  do
            % generating customized reviews base on score;
             generate from and score ;
             for  to  do
                   % Reward for each word based on and  compute by Eq. 14;
             end for
            update by Eq. 16;
       end for
      for  to  do
             % Samples from generator as positive input for and negative one for ;
             generate ;
             train with as (+) and as (-);
             train with as (+) and as (-)
       end for
end while
% Testing;
generate by ;
compute probability of , to be fraud;
Algorithm 1 GANgster Algorithm

Iv Results and Evaluation

Iv-a Datasets

As discussed in Sec. I

, datasets for fraud review detection labeled by humans are referred to ”near ground-truth”. Most of the existing datasets only provide review text, rather than both text and metadata. We need a labeled dataset containing both text and metadata to apply the proposed approach. In this work we used Yelp and TripAdvisor datasets. Since most of the reviews are less than 400 words, we selected those reviews with less than 400 words. Then we pad words with ”END”, so they can have a length of 400. Yelp is a social media platform which provides the opportunity for people to write reviews of their experience of different restaurants and hotels in NYC. The dataset is labeled by the Yelp filtering system, which is more trusted than other datasets labeled by human

[23]. The dataset contains review ID, item ID, user ID, score (rating from 1 to 5) given by different users to different items, date of written reviews, and text itself. TripAdvisor provides the opportunity for people to write reviews about different entertainment places and rate them. Unlike the Yelp dataset people are not able to rate the businesses. They can only like or dislike the business based on their negative or positive tendency. The dataset contains the review texts together with the people’s tendency in the form of like or dislike (positive or negative polarity), labeled by human judges. The dataset does not provide any information about the users who wrote the reviews. We used two datasets representing different businesses. Our purpose, is to show both the scalability of the proposed approach and also to show the impact of missing data on our approach. Details of the two datasets are listed in Table I.

Datasets Reviews (spam%) Users Resto. & hotels Rating
Yelp-main 6,000 (19.66%) 47 5,046 1-5
TripAdviser 1600 (50%) - 20 -1 (dislike), 1 (like)
TABLE I: Details of datasets.

Iv-B Experimental Setup

Advanced word embedding techniques have been developed in the past years, such as ELMO [27], BERT [7], XLNet [38]

to achieve the new state of the art techniques on many natural language processing tasks. However, to provide a fair comparison with the two benchmark systems, we used GLoVe

[26] as the baseline system for the word embedding with dimension 50, and a batch size of 64 for inputs of discriminators and generator. For CNN, we used different kernels for the hyper-tuning of the results from different successful studies of text classification [16][39]. The input filters are representative of the n-gram language model, which in our case is chosen from . We used the weighting matrix to map the input features (obtained from concatenation using Eq. 10) to a one dimension representation with different filter size, chosen from . The learning rate for the discriminators is and it is for the generator. The training iterations are set to 100 for the generator and discriminators. For adversarial training, we used 120 iterations. In algorithm 1 for each outer loop, the generator was trained 5 iterations in the inner loop (

). The training epochs for the discriminators trained is set to 3 (


Dataset Framework AP AUC Accuracy
Yelp-main NetSpam 0.5832 0.0028 0.7623 0.0192 0.7232 0.0293
FakeGAN 0.6159 0.03684 0.8686 0.01334 0.8280 0.0045
GANgster 0.6516 0.0275 0.8878 0.02012 0.8476 0.0048
TripAdvisor Netspam 0.6194 0.0093 0.7782 0.0174 0.7428 0.0029
FakeGAN 0.6858 0.0403 0.8510 0.0336 0.7619 0.0461
GANgster 0.7160 0.0058 0.8767 0.0197 0.7726 0.0084
TABLE II: GANgster performance in comparison with FakeGAN and NetSpam, using 70% of the Yelp-main as a training set and 30% as a test set.

Iv-C Evaluation Metrics

For evaluation, we rank the fraud probability for each review. Reviews with higher values are more probable to be fraud and vice-versa. We used three metrics to describe the performance: Area Under Curve (), Average Precision (), and Accuracy. Accuracy is a well-known and common metric to measure the performance of ML approaches. Hence, we only elaborate on the first two metrics: AP and AUC.

Iv-C1 Area Under Curve

For , an integration of the area under the plot of True Positive Ratio () on the -axis and False Positive Ratio () on the -axis is calculated. Consider as a list of sorted reviews according to their probability to be fraud. If the number of fraud (genuine) reviews higher than the review in index , is , then () for index is , where is the total number of fraud (genuine) reviews. The is calculated as follows:


where is the total number of reviews.

Iv-C2 Average Precision

For , we need to have a list of sorted reviews based on their probability to be fraud. If is a list of sorted review indices based on their probability and is the total number of fraud reviews, then is formalized by:


Iv-D Baseline Systems

For comparison we chose two state-of-the-art systems which were used on the Yelp and TripAdvisor datasets. GANgster is compared with these two systems in terms of the different metrics of Sec. IV-C.

Iv-D1 NetSpam

NetSpam was proposed in [31], and is considered among one of the most recent works in the area of fraud detection, where both text and behavioral features were combined in four categories providing 8 different types of features. These features were fed to a Heterogeneous Information Network (HIN) based classifier to output a ranked list of fraud probabilities for reviews. This work used Yelp and the Amazon dataset as real world datasets and four other datasets created from the Yelp datasets to demonstrate the scalability of the work.

Iv-D2 FakeGAN

FakeGAN [1] uses GAN as the main framework to improve the detection accuracy. It extracts deep contextual features using GLoVe as the word embedding from the TripAdvisor dataset to input to both the generator and discriminators.

Iv-E Main Results

In this section we evaluate the performance of our proposed system, carry out an ablative study on the effect of score and regularization and examine its robustness to data scarcity. Note that all the results are based on the performance of as the main discriminator.

Iv-E1 Performance

We used both datasets from Table I and we partitioned the dataset into two sets: training and test set.

As Table II shows, the proposed methodology outperforms the other systems for both datasets according to the three metrics. is highly dependent on the fraud percentage in the dataset, while is independent of the fraud percentage. The improvement is mostly because this method combines the key strengths of both FakeGAN (synthetic data generation) and NetSpam (combination of multiple features).

Impact of Generated Fraud Reviews: As mentioned in Sec. I, generated reviews play an important role in improving the performance. Fig. 2 displays the performance of GANgster when it is only trained with human fraud reviews (green) vs. when it is trained with the combination of generated and fraud reviews.

Fig. 2: Effect of the generated reviews on performance with 70% of data as training set.

Obviously, for both datasets, the performance of the GANgster is improved over all of the metrics when real data is combined with generated data. Generated reviews are simply increasing the training data amount and this leads to better performance. In addition, generated reviews can imitate the bot written reviews in the datasets, helping the discriminators to learn the more diverse data, enabling discriminators to spot the bot reviews in real datasets more accurately.

Iv-E2 Ablative Study:

To show the effectiveness of the different parts employed in this study, we conducted various experiments on the GANgster framework. The experiments include two sections; first, the effect of the score employed for both the generator and discriminator is examined, and then the effect of regularization is analyzed to support using the regularization process in GANgster.

Effect of Score: In this section, we aimed to examine the importance of using the score in both the generator and discriminator. To prove the effectiveness of using the score in our proposed approach, we removed the score from both the discriminator and the generator, once for each and simultaneously to observe their effects on the performance.

Fig. 3: Effect of including the score in the and with 70% of data as training set.

Fig. 3 represents the effect of using score in both the generator and discriminator on the performance of the proposed approach. Results on the Yelp dataset show that the performance is gradually improved after considering the score in the generator and the discriminator. Specifically, generating the reviews with the score has a greater impact on the performance rather than including the score in the discriminator. To explain the improvement, one may say that generating the reviews correlated with score increases diversity in the generated data. More diverse data helps the discriminator to learn the model, more accurately. Including the score of the discriminator will also improve the performance for all three metrics, but obviously with less improvement in comparison with when the score is used in the generator. One simple conclusion is that increasing the training data in this task is effective in achieving a consistent final performance increasing pattern, regarding the inclusion of the score in the generator and the discriminator. Conversely, the results on the Tripadvisor do not show such an increasing pattern. For AP, improvement is obvious given the score in both the generator and the discriminator. For Accuracy, the improvement is evident, while including the score in the generator, shows a negative impact. Degradation in performance could be the result of employing a binary score (positive or negative) in the Tripadvisor. Such a binary score prevents the accurate generation of reviews correlated with the score. This inconsistency results in improper training of discriminator.

Effect of Regularization: One of the important issues of the standard GAN is instability. Different techniques exist to overcome this issue. Here we use regularization, as explained in Sec. III-C. Convergence is referred to as a situation when the key metrics ( and ) for evaluating the performance achieve a steady-state best value after a specific number of iterations.

To examine convergence, we remove the regularization term from the objective function (Unregulated GANgster) to show the impact of regularization on the convergence. We compare , of different systems with GANgster in the sequential iterations of the adversarial learning. For pre-training, after 100 iterations, the convergence is achieved for the generator. This number is 50 for the discriminator. For the adversarial training step, it took 100 iterations to converge.

Fig. 4: AP, AUC, and Accuracy for different supervisions of different frameworks on Yelp dataset.

Fig. 4 shows that for FakeGAN, 140 iterations are required to achieve the final performance, while for the unregulated GANgster, this is not achieved even after 200 iterations. With our regulated cost function, GANgster achieves convergence to the final performance in 100 iterations.

Iv-E3 Robustness with Data Scarcity

Due to data scarcity, robustness is considered to be an important matter in fraud review detection. In this section, we conducted two sets of experiments, first to observe the effect of different percentages of the dataset used as training set on the performance. In the next experiment, a different number of reviews of both datasets are selected and the cross-dataset performance of GANgster is compared with other frameworks.

Robustness to Different Supervisions: As mentioned in Sec. IV-E1 we partitioned the main dataset into a train and test dataset with different amounts of data and refer to these different partitions as “supervisions” (0.7, 0.5, 0.3, and 0.1 as training set, respectively, and the remaining as test set). Hence, we use different supervisions (proportion of data) to demonstrate the robustness of GANgster. Fig. 5 shows the results for different supervisions.

Fig. 5: AP, AUC, and Accuracy for different supervisions for different frameworks.

Fig. 5 shows that our framework is robust to data scarcity for all three metrics, and this can be observed by the effects of different supervisions. For , the best result is obtained with 0.7 supervision for TripAdvisor, while the performance for 0.5 of data as the training set remains the same. GANgster performs similarly on the Yelp dataset and converges to a constant value with 30% supervision. The performance of other frameworks shows instability with different supervisions, mostly because these approaches are mainly semi-supervised or even unsupervised and their performance is rarely proportional to supervision. In addition, the variations in is greater than the variations of or , especially for GANgster. This happens because the values for are relatively low and a slight change in the amount of training data, leads to a larger change in comparison to other measures. For , the rate of improvement in GANgster decreases against the supervisions. NetSpam works better than FakeGAN with data on TripAdvisor, but as the amount of samples increases, FakeGAN becomes superior, while GANgster performs better for all of the supervisions. In addition, the performance of GANgster ensures a convergence point for all three metrics on Yelp dataset (almost five times larger than TripAdvisor), while the performance of NetSpam is linearly increasing, given more training data. This shows how important the amount of data is for approaches to learn models appropriately.

Finally, GANgster demonstrates promising scaleable results in terms of . Prior works, use as the main metric to measure performance and robustness. Therefore, this metric plays an important role in measuring detection quality.

Cross-dataset Robustness: Since the same supervisions on different datasets turn up to have different amounts of data, we also conducted a cross-dataset evaluation of robustness. Cross-dataset evaluations can guarantee that the performance of our proposed approach performance on different domains, while such evaluation also shows that data augmentation will result in gaining fair performance even with a small training dataset. In this experiment, a different number of samples are selected from each dataset, and the performance of the frameworks are compared against each other based on the same number of samples. We call this process “cross-dataset” experiment.

Fig. 6: AP, AUC, and Accuracy for cross-dataset experiment on different frameworks.

Fig. 6 represents the cross-dataset performance of all frameworks on both datasets. Generally, the performance of GANgster is consistent with the number of samples for the Yelp dataset. On the other hand, the performance of GANgster improves on TripAdvisor, given more samples from the training set. The fluctuation for other frameworks reflects their sensitivity to the training data proportion. Anyway, the robustness of the frameworks decays with the small number of reviews in the training dataset. However, the performance of GANgster exhibits imperceptible changes compared to other frameworks.

V Conclusion

This work proposed GANgster, a regulated GAN with one generator and dual discriminators for fraud review detection that is capable of making use of both the review text and metadata, such as review rating scores. Information gain maximization between the score and the generated rated review is used as the basic idea for a new loss function, which will not only stabilise GAN, addressing the low convergence issue, but also focus the Generator to automatically produce more human-like bot reviews. GANgster on Yelp dataset produced AUC of 88.78% and AP of 65.16% which is a significant improvement over what FakeGAN [1] and NetSpam [31] have reported. Future work will focus on fraud review detection considering a combination of text features with other features and investigate how other word embedding methods such as Word2Vec, ELMo and BERT affect the performance. This can also be helpful for acquiring a joint representation of both text and metadata.


  • [1] H. Aghakhani, A. Machiry, S. Nilizadeh, C. Kruegel, and G. Vigna (2018) Detecting deceptive reviews using generative adversarial networks. In 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, USA, pp. 89–95. Cited by: §I, §I, §II-B, §III-A3, §IV-D2, §V.
  • [2] M. Arjun, B. Liu, and N. Glance (2012) Spotting fake reviewer groups in consumer reviews. In In Proceedings of the 21st international conference on World Wide Web, Lyon, France, pp. 191–200. Cited by: §II-A1.
  • [3] D. Barber and F. Agakov (2003) The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, Cambridge, MA, USA, pp. 201–208. External Links: Link Cited by: §III-C.
  • [4] X. Chang and J. Zhang (2015) Combating product review spam campaigns via multiple heterogeneous pairwise features. In In Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, British Columbia, Canada, pp. 172–180. Cited by: §II-A1.
  • [5] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Barcelona, Spain, pp. 2180–2188. External Links: ISBN 978-1-5108-3881-9, Link Cited by: §I, §I, §III-B, §III-C.
  • [6] N. J. Conroy, V. L. Rubin, and Y. Chen (2015) Automatic deception detection: methods for finding fake news. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community, ASIST ’15, Silver Springs, MD, USA, pp. 82:1–82:4. External Links: ISBN 0-87715-547-X, Link Cited by: §I.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §IV-B.
  • [8] T. Duyu, B. Qin, and T. Liu (2015) Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 conference on empirical methods in natural language processing, Lisbon, Portugal, pp. 1422–1432. Cited by: §II-A2.
  • [9] L. Ee-Peng, V. Nguyen, N. Jindal, B. Liu, and H. W. Lauw (2010) Detecting product review spammers using rating behaviors. In In Proceedings of the 19th ACM international conference on Information and knowledge management, Toronto, ON, Canada, pp. 939–948. Cited by: §II-A1.
  • [10] M. Gong, Y. Gao, Y. Xie, and A. K. Qin (2020) An attention-based unsupervised adversarial model for movie review spam detection. IEEE Transactions on Multimedia (), pp. 1–1. Cited by: §I.
  • [11] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. The MIT Press. External Links: ISBN 0262035618, 9780262035613 Cited by: §III-D1.
  • [12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. External Links: Link Cited by: §II-B.
  • [13] I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, San Diego, CA, USA, pp. 15–23. External Links: Link Cited by: §I, §I.
  • [14] N. Jindal and B. Liu (2008) Opinion spam and analysis. In In Proceedings of the 2008 international conference on web search and data mining, Palo Alto, California, USA, pp. 219–230. Cited by: §I, §II-A1.
  • [15] S. Khan, H. Rahmani, S. A. A. Shah, M. Bennamoun, G. Medioni, and S. Dickinson (2018)

    A guide to convolutional neural networks for computer vision

    Vol. . Cited by: §III-D2.
  • [16] S. Lai, L. Xu, K. Liu, and J. Zhao (2015) Recurrent convolutional neural networks for text classification. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2267–2273. External Links: ISBN 0-262-51129-0, Link Cited by: §IV-B.
  • [17] K. Lee and S. Webb (2014) The dark side of micro-task marketplaces: Characterizing fiverr and automatically detecting crowdturfing. In In Proc. of ICWSM, Ann Arbor, MI, USA, pp. 115–125. Cited by: §I.
  • [18] F. Li, M. Huang, Y. Yang, and X. Zhu (2011) Learning to identify review spam. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Three, IJCAI’11, Vol. 3, Barcelona, Catalonia, Spain, pp. 2488–2493. External Links: ISBN 978-1-57735-515-1, Link, Document Cited by: §II-A1.
  • [19] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing (2017) Recurrent topic-transition gan for visual paragraph generation. In In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, pp. 3362–3371. Cited by: §II-B.
  • [20] M. Luca and G. Zervas (2016) Fake it till you make it: reputation, competition, and yelp review fraud. Management Science 62 (12), pp. 3412–3427. Cited by: §I.
  • [21] A. J. Minnich, N. Chavoshi, A. Mueen, S. Luan, and M. Faloutsos (2015) Trueview: Harnessing the power of multiple review sites. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, pp. 787–797. Cited by: §II-A1.
  • [22] M. Motoyama, D. McCoy, K. Levchenko, S. Savage, and G. M. Voelker (2011) Dirty jobs: The role of freelance labor in web service abuse. In In Proc. of SEC, Berkeley, CA, USA, pp. 14–14. Cited by: §I.
  • [23] A. Mukherjee, V. Venkataraman, B. Liu, and N. S. Glance (2013) What yelp fake review filter might be doing?. In ICWSM, Ann Arbor, MI, USA, pp. 134–144. Cited by: §I, §II-A1, §IV-A.
  • [24] O. Myle, C. Cardie, and J. Hancock (2012) Estimating the prevalence of deception in online review communities. In In Proceedings of the 21st international conference on World Wide Web, New York, NY, USA, pp. 201–210. Cited by: §II-A1.
  • [25] M. Ott, Y. Choi, C. Cardie, and J. T. Hancock (2011) Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, Stroudsburg, PA, USA, pp. 309–319. Cited by: §I, §II-B.
  • [26] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link Cited by: §IV-B.
  • [27] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018-06) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §IV-B.
  • [28] L. Quoc and T. Mikolov (2014) Distributed representations of sentences and documents. In

    International Conference on Machine Learning

    Beijing, China, pp. 1188–1196. Cited by: §I, §II-A2.
  • [29] J. Shaohua, X. Zhang, X. Wang, and Y. Liu (2018) Fake reviews detection based on LDA. In 4th International Conference on Information Management (ICIM), Oxford, UK, pp. 280–283. Cited by: §II-A2.
  • [30] R. Shebuit and L. Akoglu (2015) Collective opinion spam detection: Bridging review networks and metadata. In Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining. ACM،, New York, NY, USA, pp. 985–994. Cited by: §I, §II-A1.
  • [31] S. Shehnepoor, M. Salehi, R. Farahbakhsh, and N. Crespi (2017) NetSpam: A networkbased spam detection framework for reviews in online social media. IEEE Transactions on Information Forensics and Security 12 (7), pp. 1585–1595. Cited by: §I, §II-A1, §IV-D1, §V.
  • [32] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu (2017-09) Fake news detection on social media: a data mining perspective. SIGKDD Explor. Newsl. 19 (1), pp. 22–36. External Links: ISSN 1931-0145, Link, Document Cited by: §I.
  • [33] F. Song, R. Banerjee, and Y. Choi (2012) Syntactic stylometry for deception detection. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Stroudsburg, PA, USA, pp. 171–175. Cited by: §II-A1.
  • [34] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour (1999) Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, Cambridge, MA, USA, pp. 1057–1063. External Links: Link Cited by: §III-E2.
  • [35] G. Wang, C. Wilson, X. Zhao, Y. Zhu, M. Mohanlal, H. Zheng, and B. Y. Zhao (2012) Serf and turf: crowdturfing for fun and profit. In In Proc. of WWW, Lyon, France, pp. 679–688. Cited by: §I.
  • [36] W. Y. Wang (2017) ”Liar, liar pants on fire”: a new benchmark dataset for fake news detection.. In ACL (2), R. Barzilay and M. Kan (Eds.), pp. 422–426. External Links: ISBN 978-1-945626-76-0, Link Cited by: §I.
  • [37] R. Yafeng and Y. Zhang (2016) Deceptive opinion spam detection using neural network. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, New York, NY, USA, pp. 140–150. Cited by: §II-A2.
  • [38] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §IV-B.
  • [39] X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, pp. 649–657. External Links: Link Cited by: §IV-B.
  • [40] X. Zhou, R. Zafarani, K. Shu, and H. Liu (2019) Fake news: fundamental theories, detection strategies and challenges. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, New York, NY, USA, pp. 836–837. External Links: ISBN 978-1-4503-5940-5, Link, Document Cited by: §I.