Introduction
In the era of ecommerce, consumers typically buy products or services based on reviews. Therefore reviews are increasingly valuable for sellers and service providers due to the benefits of positive reviews or the damage from negative ones. In light of this, fake reviews are flourishing and pose a real threat to the proper conduct of ecommerce platforms. A group of reviewers (we will call them opinion spammers) post fake reviews to promote their products or demote their competitors’ products.
Fake reviews are often written by experienced professionals who are paid to write highquality, believable reviews. Detecting opinion fraud is a nontrivial and challenging problem that was extensively studied in the literature using various approaches. Some approaches are based solely on the review content jindal2008opinion; lietal2013topicspam; ottetal2011finding, reviewer behavior lim2010detecting; mukherjee2013yelp and the tripartite relationships between reviewers, reviews, and products 10.1145/2187836.2187863; rayana2015collective; rayana2016collective; wang2011review; 10.1007/s1011501710687. While each paper presented a method that is useful to some extent for detecting certain kinds of spamming activities, there is no onesizefitsall solution. This is because spammers keep changing their strategies, many times in an adaptive manner to the spam detection policies. Therefore, there is a need to study and incorporate as many approaches as possible.
Another challenge is the fact that most datasets are imbalanced, as the three datasets that we used for evaluation, where only about 20% of the users are spammers (Table 1).
Despite the vast commercial impact of opinion spam detection, most machinelearningbased solutions for this problem do not achieve very high performance due to insufficient labeled data to properly train an ML model. In addition, standard ML models often treat each sample separately, disregarding the underlying graph structure of the spammer group. Indeed, this graph is often latent and should be derived from the data in an adhoc manner. One can use deep graph learning methodology to automatically embed the users, and thus infer the underlying graph structure, but such an approach requires a large training set, which is often not available or costly to obtain.
The setting where only part of the data is labeled is often called “semisupervised learning”. When very few examples are labeled, this setting is also termed “fewshot learning”. When in addition one is given the possibility to choose which set of users will be labeled (given a budget, one can decide for which users to invest the budget in order to acquire their label), this is called the activelearning setting. In this paper we study the fewshot active learning setting for the opinion spam detection problem.
The fewshot active learning setting is very reasonable for the opinionspam detection problem because labeled data never comes labeled for free, and operators of ecommerce websites choose defacto which users they want to check manually and label, and typically only a small fraction of the users are labelled.
Our Contribution
We propose a classification algorithm, Clique Reviewer Spammer Detection Network, CRSDnet, for detecting fake reviews in the fewshot active learning setting. CRSDnet harnesses the power of both machine learning algorithms and classical graphical models algorithms such as Belief Propagation. We show that this combination yields a better performance than each approach separately.
We evaluate our algorithm on a goldenstandard dataset for the spammer detection task, the Yelp Challenge Data YelpData. The performance of our algorithm is better than all previous work in almost every metric. We also outperform other methods that use the graph structure, such as graph embedding, and neural graph networks algorithms liu2020alleviating; wang2019fdgars; 9435380. These algorithms use much more labeled data for training (30% and over, compared to at most 2.5% that we use).
We show that using both machine learning and the graph structure (via label propagation algorithms) improves over standalone machine learning by at least 10% in AUC measure.
We attribute the success of our algorithm to the following key innovations in our approach:

We derive from the raw data a useruser graph, where two users share an edge if they wrote a review for the same item. Figure 1 illustrates this graph, which is a clique graph by definition. In previous work a tripartite userreviewproduct graph was used.

The useruser graph may be much denser than the tripartite userreviewproduct graph. To overcome computational issues that such density entails, we design a careful edge sparsification procedure to speed up the algorithm without compromising performance much. The sparsification is guided by the rule that each node will end up having just enough edges connecting it to nodes both from his own class (spammer or not) and from the opposite class.

We run a labelpropagation algorithm (concretely, Belief Propagation), but some parameters of that algorithm (the node and edge potentials) are determined using a machine learning model. This is the first time that such a combination of approaches is undertaken, and its usefulness demonstrated.

We propose a new way of choosing the set of users whose labels will be obtained (active learning). Instead of randomly choosing a set of users up to the allowed budget, we choose random users from the largest clique of the useruser graph. The intuition behind this rule comes from the work of Wang et. al. wang2020collueagle where it was shown that collusive spamming (or, cospamming) is a useful lens to identify spammers.
Related Work
Opinion spam detection has different nuances such as fake review detection jindal2008opinion; ottetal2011finding; 10.1145/2187980.2188164; 10.1145/1281192.1281280, fake reviewer detection lim2010detecting; wang2011review; 10.1145/2505515.2505700 and spammer group detection 10.1145/2187836.2187863; 10.1007/9783319235288_17; wang2020collueagle; 10.1007/s1011501710687; 10.1007/s1048901811421. Two survey papers, Crawford2015SurveyOR; VivianiSurvey, provide a broad perspective on the field.
Our method is part of the graphbased models, which take into account the relationships among reviewers, comments, and products. The key algorithm in this approach is Belief Propagation (BP) pearl2014probabilistic which is applied to a carefully designed graph and the Markov Random Field (MRF) associated with it. The first to use this approach were Akoglu et al. akoglu2013opinion who suggested FraudEagle, a BPbased algorithm that runs on the bipartite reviewerproduct graph, where the edge potentials are based on the sentiment in the review. In later work, Rayana et. al. rayana2015collective introduced SPEagle, where node and edge potentials are derived from a richer set of metadata features, improving significantly over the performance of FraduEagle. Wang et al. wang2011review consider the tripartite userreviewproduct network and define scores for trustiness of users, honesty of reviews, and reliability of products. They use an adhoc iterative procedure to compute the scores, rather than BP. In 7865975 an algorithm called NetSpam was introduced which utilizes spam features for modeling review datasets as heterogeneous information networks.
A graphbased approach was suggested in fei2013exploiting but this time the graph contains edges for reviews that were written within a certain time difference from each other (a “burst”). The authors use a different dataset to evaluate their method, reviews from Amazon.com.
The authors of wang2020collueagle present ColluEagle, a graphbased algorithm to detect both review spam campaigns and collusive review spammers. To measure the collusiveness of pairs of reviewers, they identify reviewers that review the same product in a similar way.
A different approach to the problem of spammer detection is via Graph Neural Networks (GNNs) and Graph Convolutional Networks (GCNs). These are deep learning architectures for graphstructured data. The core idea is to learn node representations through local neighborhoods
kipf2016semi. In liu2020alleviating, the authors design a new GNN framework, GraphConsis, to tackle the fraud detection task. The authors evaluated the method on four data sets, where one of them is used by us too. GraphConsis is benchmarked on different training set sizes, from 40% to 80% of the data. GraphConsis with training set size achieves AUC of on the Yelp Chicago dataset, and our method achieves with only . Note that GraphConsis uses all the metadata that we use as well (the graph structure, and the reviews).A GCNbased algorithm was designed in 10.1145/3308560.3316586
, and tested on reviews from Tencent Inc. The algorithm outperformed four baseline algorithms (Logistic Regression, Random Forest, DeepWalk
perozzi2014deepwalk, LINE tang2015line).Our work departs from previous work in several ways. Compared to the works where labelpropagation algorithms were used, we use machine learning to predict the edge and node potentials rather than handcrafted threshold functions. Second, we consider the useruser graph and not a bi/tripartite userreviewproduct graph. To overcome the computational challenge incurred by the density of the useruser graph, we apply a new rule for edge sparsification, based on the ML prediction. Finally, in the active learning setting, we introduce a new sampling rule. All these modifications have led to an improvement over FraudEagle akoglu2013opinion and SPEagle rayana2015collective.
Active Learning:
The active learning approach aims to achieve high accuracy by using few queries, and therefore the “most informative” points are natural candidates for label acquisition. Various heuristics were proposed to determine the “most informative” nodes, e.g., uncertainty sampling
lewis1994heterogeneous; culotta2005reducing; settles2008analysisand variance reduction
flaherty2006robust; schein2007active. In our setting, we chose a rule that is native to the problem itself – sampling from the largest clique, following the takehome message from the work of wang2020collueagle about collusive spamming.Many works on active learning choose the train set using adaptive rules, point by point. This however is infeasible in our case as rerunning the entire pipeline for every new example is computationally prohibitive for datasets as large as ours. Therefore we choose all users for labelling in bulk.
Methodology
In this section we describe our pipeline, end to end. The flow chart is depicted in Figure 2.
We formulate the spam detection problem as a classification task on the user network. The dataset consists of reviewers who write reviews on products from the set . The vertex set of the graph is the set of users (reviewers); user and share an undirected edge if there exists some product such that user and wrote a review for . The resulting graph consists of interconnected cliques, each corresponds to a different product. Figure 1 illustrates such a network. Each node
has in addition a vector of features
associated with it, and a binary class variable , for spammer (1) or benign (1).The classification task is given the graph (with the nodes’ features), and possibly a set of labeled users (the “few shots” training set), predict the value of for the remaining nodes (the test set).
Ideally, to solve the classification task, we would find an assignment that maximizes
(1) 
This Maximum Likelihood Estimation (MLE) task is in general NPhard. However, in practice, a useful solution
(perhaps not the maximizer) may be obtained by using a Markov Random Field modelling for the probability space.
Markov Random Field
Markov Random Field (MRF) is often used to model a set of random variables having a Markov property described by an undirected dependency graph. A pairwiseMRF (pMRF) is an MRF satisfying the pairwise Markov property: a random variable depends only on its neighbors and is independent of all other variables.
A pMRF model involves two types of potentials, node potentials, and edge potentials. Our node potentials, , stand for the probability that reviewer belongs to either class (spam/benign):
(2) 
The edge potential signifies the affinity of reviewer and , namely, the probability that both belong to the same class. Formally,
(3) 
The parameters and satisfy for all . To determine the values of these parameters we use machine learning applied to features that are extracted from the metadata of the reviewers dataset.
The pMRF model is used to approximate the expression for in Eq. (1):
(4) 
where is a normalization factor, the sum over the energies of all possible assignments .
Finding the assignment that maximizes the probability in Eq. (4) is still an intractable problem; LBP (loopy belief propagation) is the goto heuristic for approximating the intractable maximization problem.
The LBP algorithm pearl2014probabilistic is based on an iterative message passing along the edges of the graph. Messages are initialized according to some userdefined rule. At iteration , a message is sent from node to each neighboring node . The message represents the belief of about the label of . If is a tree, then BP is guaranteed to converge; if contains cycles then convergence is not guaranteed (hence the name loopy), but in practice, a cap on the number of iterations is set. We use the standard LBP messages, omitted for brevity, and can be found in pearl2014probabilistic.
Each iteration of LBP takes time, hence the number of edges, which may be quadratic in , plays a key role in the computational complexity. The more iterations one can perform for the same time budget, the better the performance. In the next section, we describe how to address the computational aspect using graph sparsification.
Running LBP on a Sparse Graph
Recall that our graph is defined over users, and not as a tripartite userproductreview graph. This may result in a rather dense graph, which poses a computational impediment even on LBP, when the number of nodes is large. For example, the graph created from the Yelp Chicago dataset is very dense (average degree 1193). Therefore our first step is to sparsify the graph by choosing a linear number of “useful” edges (linear in the number of nodes).
To gain intuition into a useful way of sparsification, we conducted the following experiment using the Chicago Yelp data rayana2015collective. The initial graph contains nodes, and, , edges. The sparsification procedure is parameterized with two numbers and . For every node we choose neighbors from ’s class (spammer or benign) and neighbors from the opposite class, and color these edges red. We then remove all edges that were not colored red. The resulting graph has an average degree of at most (multiple edges are merged). We set all node potentials to ; in other words, we don’t provide any prior knowledge about the class of the node . We set all edge potentials as follows: if and if (that is, according to the true agreement relationship between users and ). We fix .
We run LBP on the resulting graph, and label each node as a spammer if the probability that LBP assigns it is larger than a predefined threshold .
Figure 3 depicts the AUC of the LBP classification when varying and between 0 and 15. We see that the AUC approaches 1 when each node has “enough” neighbors from each class.
Sparsification and Edge Potentials
In practice, we have the true labels of a small set of nodes, and we use it to learn and predict the edge potentials between the remaining unlabelled nodes. We shall use this prediction for a sparsification procedure similar to what was just described. Our sparsification proceeds as follows:
(1) The first step is to train a machine learning algorithm on the set of users whose labels we know with the objective of predicting , the probability that a pair of reviewers belongs to the same class (either both spammers or both benign). The exact choice of ML algorithm, along with the parameters is explained in experimental setting section. (2) Compute using the ML model for all the remaining edges of the graph (edges that do not connect two users from the training set). (3) Choose all edges for which or and set the potential in Eq. (3) accordingly. We call these edges the “trusted” edges. LBP will be run on the graph containing the trusted edges.
The input to the machine learning algorithm in steps (1) and (2) is a set of features that is extracted from the metadata of both the users and the reviews. Typical metadata includes the text of the review, the rating that the reviewer gave the product, the total number of reviews that the reviewer wrote, etc. The Yelp set of features is described in Tables 2 and 3.
Additional sparsification of the graph can be obtained by removing edges that connect users whose reviews of the same product were written in faraway times. Namely, two users and share an edge only if they wrote a review for the same product, and these reviews were written within a period of days (following previous work, we fixed ). Such a graph with timedependent edges is called a bursty graph and was introduced in fei2013exploiting. We tested our pipeline with and without the bursty variant. The bursty sparsification, if applied, is done before the trusted edges are selected.
Node Potential
In the experiment just described, all node potentials were set to , and only edge potentials played a role. However, there may be a gain in setting the node potentials according to the metadata features rather than ignoring it.
Similar to the way we set the edge potential, we use machine learning to predict in Eq. (2). The machine learning algorithm is trained on the set of users chosen for labeling (active learning setting) with the objective of predicting spam or benign. The value of is predicted for all the remaining users using the trained model (which gives the “probability” of being a spammer or benign, alongside the discrete label).
Active learning: Sampling Users
The final component in our methodology is the way we choose the set of users for training. In this work we explore two sampling rules with the same budget of users:

Random Sampling: Pick reviewers uniformly at random from .

Sampling from largest clique: In this strategy, we sample users that belong to the largest clique. The largest clique corresponds to the product on which the largest number of reviews were written. If the budget is not consumed, we sample the remainder from the secondlargest clique, and so on.
Dataset  #Reviews
(fake %) 
#Users
(spammer %) 
#Products 

Y’Chi  67,395
(13.23 %) 
38,063
(20.33%) 
201 
Y’NYC  359,052 (10.27%)  160,225 (17.79%)  923 
Y’Zip  608,598 (13.22%)  260,277 (23.91%)  5,044 
Data Description
To evaluate our methodology we use three datasets that contain reviews from Yelp.com, summary statistics of which are presented in Table 8. The datasets contain reviews of restaurants and hotels and were collected by mukherjee2013yelp; rayana2015collective. YelpChi covers the Chicago area, YelpNYC covers NYC and YelpZip is the largest, and it includes ratings and reviews for restaurants in a continuous region including NJ, VT, CT, PA, and NY. They differ in size (YelpChi is the smallest and YelpZip is the largest), as well as in the percentage of spammers out of the total number of users. Yelp has a filtering algorithm that identifies fake/suspicious reviews. The three datasets contain these labels. We partition the users into spammers: authors of at least one filtered review and benign: authors with no filtered reviews.
Alongside the text of the reviews, the dataset contains additional metadata such as ratings, timestamps. From the text and the additional data, various features are extracted, which were used in previous work that studied these datasets rayana2015collective; mukherjee2013yelp; lim2010detecting. Tables 2 and 3 include brief descriptions of these features. Most of them are selfexplanatory, and hence we omit detailed explanations for brevity. Note that we used exactly the same set of features as rayana2015collective to allow a fair comparison.
The features are used to compute both the node and the edge potentials as explained in the Methodology section.
User Features  

MNR  Max. number of reviews written in a day mukherjee2013spotting; mukherjee2013yelp 
PR  Ratio of positive reviews (45 star) mukherjee2013yelp 
NR  Ratio of negative reviews (12 star) mukherjee2013yelp 
avgRD  Avg. rating deviation of user’s reviews mukherjee2013yelp; lim2010detecting; fei2013exploiting 
WRD  Weighted rating deviation lim2010detecting 
BST  Burstiness mukherjee2013yelp; fei2013exploiting (spammers are often shortterm members of the site). 
RL  Avg. review length in number of words mukherjee2013yelp 
ACS  Avg. content similarity—pairwise cosine similarity among user’s (product’s) reviews, where a review is represented as a bagofbigrams lim2010detecting; fei2013exploiting 
MCS  Max. content similarity—maximum cosine similarity among all review pairs mukherjee2013spotting 
Review Features  

Rank  Rank order among all the reviews of product jindal2008opinion 
RD  Absolute rating deviation from product’s average rating li2011learning 
EXT  Extremity of rating mukherjee2013spotting 
DEV  Thresholded rating deviation of review mukherjee2013spotting 
ETF  Early time frame mukherjee2013spotting (spammers often review early to increase impact) 
ISR  If review is user’s sole review, then , otherwise 0 rayana2015collective 
PCW  Percentage of ALLcapitals words li2011learning; jindal2008opinion 
PC  Percentage of capital letters li2011learning 
L  Review length in words li2011learning 
PP1  Ratio of 1st person pronouns (‘I’, ‘my‘, etc.) li2011learning 
RES  Ratio of exclamation sentences containing ‘!’ li2011learning 
Evaluation
In this section, we describe the results of the experiments we ran on the three Yelp datasets. We report our results and results obtained by previous work on the same datasets.
Evaluation Metrics
We evaluated the performance of CRSDnet using four popular metrics, which were used by previous work as well. The Average Precision (AP), which is the area under the precisionrecall curve, the ROC AUC, and the precision@k. To compute precision@k we rank the reviewers according to the probability that LBP assigned each one to be a spammer. We compute the fraction of real spammers among the top places. We compute precision@k for .
Finally, we use the Discounted Cumulative Gain (DCG@k) which provides a weighted score that favors correct spammer predictions at the top indices. Formally, where if the user at the place is a correctly identified spammer, and 0 otherwise. For compatibility with other works, we actually report the normalized , which is obtained by dividing by the ideal which is the where all (all top are indeed spammers in the ideal ranking, for the that we choose).
Setting  Nodes  Edges  Sampling  Bursty 

1  ML  None  Random  No 
2  ML  Threshold  Random  No 
3  Threshold  ML  Random  No 
4  ML  ML  Random  No 
5  ML  ML  Clique  No 
6  ML  ML  Random  Yes 
7  ML  ML  Clique  Yes 
Method  Y’Chi  Y’NYC  Y’Zip  

SpEagle  0.691  0.657  0.671  
SpEagle  0.708  0.683  0.691  
NetSPAM  0.650  0.650  
Set. 1  0.519  0.602  0.620  0.632  0.561  0.578  0.558  0.585  0.566  0.593  0.672  0.693 
Set. 2  0.669  0.701  0.711  0.729  0.664  0.663  0.687  0.692  0.685  0.700  0.784  0.794 
Set. 3  0.688  0.699  0.702  0.702  0.659  0.677  0.691  0.692  0.562  0.708  0.779  0.831 
Set. 4  0.689  0.712  0.723  0.731  0.673  0.681  0.685  0.665  0.684  0.707  0.706  0.828 
Set. 5  0.718  0.724  0.735  0.754  0.669  0.688  0.720  0.766  0.703  0.729  0.790  0.848 
Set. 6  0.673  0.730  0.719  0.741  0.668  0.671  0.682  0.696  0.628  0.707  0.783  0.835 
Set. 7  0.672  0.694  0.662  0.668  0.666  0.666  0.645  0.645  0.631  0.726  0.763  0.824 
Method  Y’Chi  Y’NYC  Y’Zip  

SpEagle  0.396  0.348  0.424  
NetSPAM  0.300  0.28  
Set. 1  0.874  0.802  0.896  0.852  0.902  0.913  0.917  0.916  0.885  0.896  0.901  0.886 
Set. 2  0.901  0.879  0.882  0.906  0.915  0.912  0.921  0.912  0.782  0.805  0.859  0.883 
Set. 3  0.901  0.793  0.897  0.883  0.912  0.919  0.924  0.924  0.794  0.845  0.923  0.941 
Set. 4  0.890  0.825  0.896  0.901  0.917  0.922  0.968  0.927  0.859  0.870  0.869  0.875 
Set. 5  0.909  0.906  0.900  0.914  0.875  0.727  0.925  0.926  0.903  0.906  0.935  0.942 
Set. 6  0.907  0.707  0.886  0.913  0.914  0.920  0.926  0.921  0.864  0.891  0.927  0.946 
Set. 7  0.885  0.873  0.886  0.883  0.914  0.920  0.948  0.914  0.899  0.833  0.847  0.870 
Y’Zip  Y’NYC  Y’Chi  

FraudEagle 
Wang 
SpEagle 
CRSDnet 
FraudEagle 
Wang 
SpEagle 
CRSDnet 
FraudEagle 
Wang 
SpEagle 
CRSDnet 

100  0.30  0.21  0.93  1  0.21  0.15  0.96  0.98  0.55  0.18  0.90  0.99 
200  0.30  0.19  0.81  1  0.19  0.19  0.96  0.91  0.52  0.18  0.91  0.99 
300  0.38  0.21  0.69  0.93  0.17  0.18  0.95  0.86  0.48  0.20  0.91  0.99 
400  0.33  0.26  0.61  0.80  0.21  0.17  0.95  0.86  0.49  0.20  0.92  0.99 
500  0.29  0.27  0.57  0.75  0.22  0.17  0.95  0.88  0.48  0.20  0.92  0.93 
600  0.28  0.27  0.56  0.74  0.27  0.17  0.96  0.89  0.47  0.21  0.92  0.90 
700  0.27  0.29  0.54  0.76  0.37  0.16  0.95  0.90  0.47  0.21  0.92  0.91 
800  0.26  0.30  0.51  0.73  0.45  0.16  0.90  0.91  0.49  0.22  0.91  0.92 
900  0.26  0.30  0.50  0.69  0.5  0.15  0.85  0.92  0.48  0.22  0.91  0.92 
1000  0.28  0.32  0.49  0.67  0.45  0.16  0.82  0.92  0.47  0.22  0.90  0.93 
Dataset  Method  AUC 

Y’Zip  DFraud (80%) 9435380  0.733 
Y’Zip  RF (30%)  0.740 
Y’Zip  CRSDnet (2.5%)  0.847 
Y’Chi  GraphConsis liu2020alleviating (80%)  0.742 
Y’Chi  RF (30%)  0.735 
Y’Chi  CRSDnet (2.5%)  0.754 
The Experimental Setting
There are four choices that effect the performance of CRSDnet: (a) the way the node potentials are computed, using ML or using a threshold function as in rayana2015collective; (b) the way the edge potentials are computed, again using ML or using a threshold function rayana2015collective; (c) the active learning sampling rule which specifies how to choose the users for which the label is revealed; (d) with the timedependent bursty sparsification or without.
In Table 4 we summarize the seven different configurations with which we tested CRSDnet. Each configuration was tested with a labeled set of users that is of size of the entire set of users. In total we have experiments, each was ran 10 times with fresh randomness.
The machine learning algorithm that we used to compute the edge and node potentials was random forest, written in the Wolfram Language. The code is available on Github ^{1}^{1}1https://github.com/users/KirilDan/projects/1. We chose this implementation as Wolfram has a good support for graph structures, on which LBP can be easily run. All the parameters of the random forest are the default ones besides the following: 950 trees , a maximum tree depth of 16, and a maximum of fraction of the features are considered per split. The features that we used are the ones in Tables 2 and 3.
To measure the extent to which each new component in our pipeline is responsible for the improvement over previous results, we ran our pipeline also with the way that edge and node potentials were computed in rayana2015collective. For completeness, we describe this method briefly. This method is completely unsupervised. A set of features is computed for every user and review. Let be the value of feature for user . For every feature , the probability is estimated from the data. Finally, a “spam score” is computed for every user via
(5) 
The potential of reviewer is set to . A similar procedure is carried out to determine edge potentials.
Results
Tables 5,6 and 7 present the results of running CRSDnet
according to the aforementioned experimental setting, reporting the different evaluation metrics.
We compared the performance of CRSDnet to other algorithms that were evaluated on the same dataset: SpEagle rayana2015collective, FraudEagle akoglu2013opinion, NetSpam 7865975, Wang et al. wang2011review and ColluEgale wang2020collueagle. The results that we report are taken from the relevant papers and are not reproductions that we carried out.
Different papers report different metrics and for different budgets. Hence some cells in the tables are left empty. Some algorithms are completely missing from a table/plot if the paper did not report that metric at all.
Table 5 provides a comparison using the AUC measure. As evident from the table, our method is superior to all previous work. The most interesting comparison is with SpEagle+ rayana2015collective. That work reports only results for a budget of 1% of the users. For all three datasets, we already obtain a better result than SpEagle+ when using only 0.25% of the users (6% better for the Chicago dataset, 10% better for the NYC dataset and 17% for the ZIP).
Table 6 reports the AP measure. Here the difference is even more dramatic. Our results are between 2 to 3 times better than SpEagle+ on all three datasets.
Table 7 reports the precision@k measure when using 1% of the users and our best configuration, #5. For the Chicago and ZIP dataset, our algorithm has the upper hand; for NYC, SpEagle+ outperforms CRSDnet for up to , but the difference is very small. When our algorithm outperforms the other competitors, it is by a very large margin in most cases.
Tables 5 and 6 suggest that the best way to run CRSDnet is according to configuration #5 in Table 4. Namely, both edge and node potentials are set using ML, and the budget is spent on users from the largest clique. Comparing settings 2,3 vs 4 we see that using ML for both nodes and edges (setting 4) is preferable to using ML only on one of them (settings 2,3). Settings 6 and 7 show that adding the timedependent aspect, bursty edges, only harms the performance. Configuration #1, using only ML applied to the user’s features, gave the worse performance in the AUC measure, with a big gap.
Figures 4,5 and 6 plot the NDCG@k measure for between 0 and 1000. Here, all five competing algorithms are represented. Again, CRSDnet outperforms all algorithms for most values of , in all three datasets.
Table 8 shows a comparison with two GNNbased approaches, 9435380 and liu2020alleviating. These approaches use much more data for training (80%). As an additional baseline, we trained our Random Forest classifier this time on 30% of the data, and predicted the reviewers’ class. As evident from the table, more data is not a guarantee for better performance. One possible reason for the relative poor performance of NNbased methods is that much more data (in absolute value) is needed for successfully training the NN.
Conclusion
In this work, we proposed a new holistic framework called CRSDnet for detecting review spammers. Our method combines both machine learning and more classical algorithmic approaches (Belief Propagation) to better exploit the relational data (user–review– product) and metadata (behavioral and text data) to detect such users. Adding to previous work in this line of research, we come up with two new components: using machine learning to predict the edge and node potentials, and a new sampling rule in the active learning setting – sample users from the largest clique.
Our results suggest that the two components improve performance one on top of the other, and when combined, give the best result obtained so far for the Yelp dataset.
Another point that our work highlights is that while in many settings, NNbased methods give the best results, this is highly contingent upon having sufficient data for training. The spammer detection problem is exactly one of those problems where obtaining a lot of labeled data is expensive and nontrivial. Fake reviews are many times written by professionals, and it takes an experienced person to identify them. Hence platforms like Amazon Turk may not provide an easy solution to the shortage of data problem. In such cases, oldschool algorithmic ideas become relevant again (Belief Propagation), and as we demonstrate in this paper, their performance may be boosted by incorporating ML in a suitable manner (computing potentials in our case) alongside domain expertise (sampling from the largest clique, following the insight about collusive spamming wang2020collueagle).
One limitation of our work is the fact that we only tested on one platform, Yelp. Future work should run our pipeline on other datasets, once they become publicly available (as far as we know there is only one more dataset in English from Amazon which is publicly available). Also, we only considered the problem of user classification. It will be interesting to extend our method for the task of review classification.