1 Introduction
Product reviews and ratings on ecommerce websites provide customers with detailed insights about various aspects of the product. Ratings allow customers to gauge the quality of the product as perceived by other customers who have bought the product. Consequently, customers rely significantly on product reviews and ratings while making buying decisions on ecommerce platforms. Given their influence on customer spends, product reviews are a fertile ground for abuse.
A common form of abuse involves a seller running campaigns soliciting fake, genuinelooking positive reviews, for their own products or fake negative reviews about their competitors’ products. The paid reviewers have a varied modi operandi. They can either create their own account and start posting paid reviews (fake reviews) or they can hijack an inactive account in good standing to post seemingly innocuous reviews from that account. There are also businesses/agencies which promise a fee for writing fake reviews on Amazon. Figure 1 shows a social media snippet (name anonymized) of a seller or an agency soliciting reviewers for a fee. To circumvent identification, paid reviewers also distribute the volume of fake reviews across multiple accounts.
In order to maintain customer’s trust on the reviews and in turn on the ecommerce platform, it is imperative for ecommerce websites to ensure that the reviews remain sacrosanct. As a result, identifying and taking enforcement actions on fake reviews, paid reviewers and the underlying abusive sellers is a significant focus area within ecommerce companies.
In this paper, we broadly focus on the problem of detecting abusive entities (sellers and reviewers) within the product reviews’ ecosystem. We formalize the interactions between sellers, products and reviewers as a binary tensor. Tensors are multidimensional arrays [KB09]
and are used in capturing multidimensional features effectively. We subsequently apply tensor decomposition techniques to identify the dense cores. These dense cores are treated as anomalous interactions within a tensor with predominantly uniformly distributed interactions. Additionally, we use known abusive sellers and reviewers from the past as partial supervision to further enhance the model, with each form of abuse as a separate target signal. To summarize, our technical contributions are as follows:

We formulate detection of abusive entities (sellers and reviewers), a task of identifying dense cores in the sellerreviewer bipartite graph, as a tensor decomposition problem.

We apply unsupervised Bayesian binary tensor decomposition to detect dense blocks in the sellerreviewer relationship. This is based on the Logistic CP tensor decomposition model.

We then develop semisupervised binary multitarget extension to the unsupervised model (via PólyaGamma data augmentation) so that we can incorporate prior information about multiple forms of abuse to improve detection of abusive entities. We call our proposed approach SENTINEL.

Finally, we develop stochastic partial natural gradient learning for the semisupervised model and show that it empirically achieves faster convergence than stochastic gradient descent and EM with sufficient statistics.

We show the efficacy of the proposed approach as compared to the stateoftheart techniques for tensor decomposition and review abuse detection.
To the best of our knowledge, this is the first time a) semisupervised multitarget binary tensor decomposition has been applied to the problem of detecting fake entities in the review spam domain, b) natural gradients has been used for inference within tensor decomposition. Although the paper focuses on the problem formulation and scientific aspects of SENTINEL, we have additionally developed an extensive platform that uses machine learning models to flag abusive entities. The platform allows a combination of automated as well as manual enforcement. The feedback from the enforcement gets channeled back into the platform to further enhance SENTINEL.
The rest of this paper is organized as follows. Section 2 introduces related works as well as some background regarding our application of tensor decomposition for detecting abuse in the sellerreviewer relationship. Section 3 describes the baseline unsupervised binary tensor decomposition technique. Section 4 describes SENTINEL, encapsulating the semisupervised multitarget extensions. Section 5 describes our proposed stochastic partial natural gradient learning for the inference of all the latent parameters of the semisupervised model. Experimental results are shown in Section 6.
2 Related Work and Background
There has been a lot of attention recently to address the issue of finding fake reviewers in online ecommerce platforms. Jindal et al. [JL07]
were one of the first to show that review spam exists and proposed simple text based features to classify fake reviewers.
Identifying Abusive Reviews: Abusive reviewers have grown in sophistication ever since the initial efforts [JL07, JB08], employing professional writing skills to avoid detection via textbased techniques. In [FBC12], the authors have proposed stylistic features derived from the Probabilistic Context Free Grammar parse trees to detect review spam. To detect more complex fake review patterns, researchers have proposed 1) graph based approaches such as approximate bipartite cores and lockstep behavior detection among reviewers [LMCH16, BXG13, HSB16b, JCB15], 2) techniques to identify network footprints of reviewers in the reviewer product graph [YA15], and 3) using anomalies in ratings distribution [HSB16a]. Some recent research has pointed at the importance of time in identifying fake reviews since it is critical to produce as many reviews as possible in a short period of time to be economically viable. Methods exploiting temporal and spatial features related to reviewers/reviews [LCM15, YKA16], as well as the sequence of reviews [LZW14] have been proposed. While it is not possible to capture all the work on review spam detection, [CKP15] provides a broad coverage of efforts in this area.
Tensor based methods: Techniques such as CrossSpot [JBC15], MZoom [SHF16], and MultiAspectForensics [MGF11] propose identifying dense blocks in tensors or dense subgraphs in heterogeneous networks, which can also be applied to the problem of identifying fake reviewers. MZoom is an improved version of CrossSpot that computes dense blocks in tensors which indicate anomalous or fraudulent behavior. The number of dense blocks (i.e., subtensors) returned by MZoom is configured apriori. Note that the dense blocks identified may be overlapping, i.e., a tuple could be included in two or more blocks. MZoom is used as one of the baseline unsupervised methods in our experiments. MultiAspectForensics, on the other hand, automatically detects and visualizes novel patterns that include bipartite cores in heterogeneous networks.
Tensor decomposition [SPBW15, HRC15, RWG14, RHHC15, RWC15] is also applied to detect abusive entities (i.e., sellers and reviewers) since it facilitates the detection of dense bipartite subgraphs. Dense bipartite subgraphs are indicative of suspicious behavior due to the following reason: at any given time, there is a common pool of fake reviewers (paid reviewers) that are available and who are willing to write a positive or negative review in exchange for a fee. Sellers recruit these fake reviewers through an intermediary channel (such as social media groups, third party brokers, etc.) as shown in Figure 2, where nodes on the left indicate sellers and nodes on the right indicate reviewers and an edge indicates a written review. Note that a seller has to recruit a sizeable number of fake reviewers in a short amount of time to make an impact on the overall product rating – positive impact for her own products and negative impact for her competitor’s products. Given a common pool of available fake reviewers at any given time, suspicious behavior manifests as dense bipartite connections between a group of sellers and a group of reviewers with similar ratings, such as near star or near star ratings. Hence the presence of a bipartite core (a dense bipartite subgraph) in some contiguous time interval is a strong indicator of abuse by the group of sellers and reviewers involved.
Therefore, our goal boils down to finding bipartite cores (or dense blocks) using the factor matrices resulting from tensor decomposition. The modes of the tensor for our problem space correspond to the seller, reviewer, product, rating, and time of review. The entities in the corresponding factor matrices that have relatively higher values are indicative of abuse. By aggregating these abusive entities across the modes of the tensor results in discovering bipartite cores, where each core consists of a group of reviewers that provide similar rating across a similar set of sellers (and their products) where all of these reviews are occurring within a short contiguous interval of time.
We have a small set of known abusive sellers and reviewers identified via manual audits. We leverage the partial supervision to propose a semisupervised extension to tensor decomposition to detect new abusive entities with greater fidelity. To leverage correlations between different forms of abuse (e.g., paid reviews abuse, abuse related to compromised accounts), we incorporate multiple binary targets based on Logistic Model with PólyaGamma data augmentation.
Natural gradient learning: Natural gradient learning [Ama98] is an alternative to traditional gradient descent based learning. We develop stochastic partial natural gradient learning for the semisupervised tensor decomposition model and show that it empirically achieves faster convergence as compared to stochastic gradient descent and EM with sufficient statistics.
3 Logistic CP Decomposition
Let be a 3mode (aka 3way) tensor. CP decomposition of a tensor is defined as:
(1) 
where denotes the weight, , , and
are vectors (or rank1 tensors) and
represents vector outer product. is called the rank of the tensor. CP Decomposition is a generalization of matrix decomposition. Each tuple in Amazon’s review data captures the ‘reviewing/rating’ event wherein a reviewer writes a review/rating at a specific time for a product sold by a seller. These relationships can be described as a 5way (or 4way, if the seller is omitted) binary tensor. Such a binary tensor is incomplete since not all reviewers would have rated all products. Hence, the elements of a binary tensor can be modeled as being generated from a Logistic function, i.e., a value of 1 when the relationship exists and value of 0 where relationships do not exist. In our probabilistic tensor decomposition framework, we only consider those entries where a current relationship exists.Logistic CP tensor decomposition is tensor CP decomposition in a probabilistic framework [RWG14, RHHC15, RWC15]. The generative model for a mode tensor denoted by is:
where specifies the BernoulliLogistic function for the binary valued tensor. Let denote the dimensional factor corresponding to entity in mode . The number of nonzero values of determines the rank of the tensor. Gaussian priors are assigned to the latent variables and for
. The variance of the Gaussian prior for
is controlled by a Multiplicative Inverse Gamma Process [Dur16] that has the property of reducing the variance as increases. To get closedform updates for all the latent variables, [RWG14, RHHC15, RWC15] introduce additional variables, denoted by corresponding to each tuplein the binary tensor that are PólyaGamma distributed.
We assign an InverseGamma hyperprior to the variance of the Gaussian priors of . The reason for adding the InverseGamma hyperprior to the model is that it provides adaptive L2regularization. Hence, we can control the amount of regularization, i.e., provide different amounts of regularization to factors of the mode(s) that have target information than those without target information.
We use this unsupervised model as our base and develop multitarget semisupervised extensions to it as described in Section 4. [RWG14, RHHC15, RWC15] propose using either sufficient statistics (in batch EM or as online EM) or Gibbs sampling (in batch) for inference. They claim that the online EM reaches reasonably good quality solutions fairly quickly as compared to their batch counterparts in most situations. However, the online EM does scalar updates of each latent variable that is inherently a vector of dimension , and this may result in slower convergence. Alternatively, we propose using partial natural gradient learning in a stochastic setting for inference that is both, online in nature as well as allows vectorized updates for each of the latent variables.
4 Proposed MultiTarget SemiSupervised Logistic CP Decomposition
In this section, we describe SENTINEL, the semisupervised extensions to the Logistic CP model for detecting abusive sellers and reviewers. We have prior information associated with a subset of the entities for at least one of the modes. This prior information is specified as binary target(s), where each target corresponds to a specific type of abuse. The framework is called semisupervised tensor decomposition as the tensor decomposition is achieved by incorporating this prior data. The intuition behind using the target information is that the patterns hidden in the known abusive entities could be leveraged to discover, with greater precision, additional entities that have similar signatures. We know that abusive behavioral patterns change as the entities continuously try to game the system. We hypothesize that the proposed semisupervised framework offers a way to detect such changing behavior earlier and with greater efficiency as compared with unsupervised techniques.
In multitarget semisupervised tensor decomposition, one or more targets are specified for at least one of the modes (denoted as for target in mode ). We drop the superscript for convenience. This is represented as a matrix where each column corresponds to a single target. For each target:
1) Both positive and negative labels need to be specified.
2) Data can be specified for only a subset of the entities in that mode (semisupervised learning).
The decomposition of the tensor is achieved by taking the multitarget label information across the mode(s) into account. The label information corresponding to each target is predicted via its own Logistic model whose coefficients are denoted as for target in mode . We drop the superscript for convenience. For a given mode, the covariates of each Logistic model are the same, i.e., the factors corresponding to the entities in that mode. The coefficients are assigned Gaussian priors. Note that the Logistic model corresponding to each target is learnt by taking into account the correlations of the binary labels from other targets, and therefore called as Multitarget Learning. For each Logistic model, to get closed form updates, we introduce additional variables denoted by that are PólyaGamma distributed. Equation 8 in Appendix (Section A.2) shows how the Logistic likelihood for each task becomes a quadratic function in by augmenting the data with .
MultiTarget Learning
Consider binary target information for mode that is specified as an matrix, where denotes the number of entities for which the targets are specified and denotes the number of tasks. Note that in the semisupervised setting the binary target information is usually specified for only a subset of the entities of some mode . Based on the specified target information, the tensor decomposition technique can infer the neighbors of these abusive entities. For example, if a seller is flagged as having review abuse, then the tensor decomposition technique would infer its factor as a function of the reviewers associated with this seller during some time
where the density is high. This in turn would lead to detection of other sellers who are also connected to some subset of these same reviewers during this same time interval. These other sellers would then end up having a similar factor representation indicating a high probability of being abusive.
Given tasks (where ), the learning falls under the multitarget learning framework. Multitarget learning takes into account the correlations (or a similarities) between the tasks to learn different classifiers [HLHW15]. This is achieved via defining a matrix where each element of that matrix, denoted by , carry the reverse cosine information between the targets and and is computed as:
The interpretation of this matrix denoted by is easily understood by taking a single task as an example. Note that all the diagonal elements of are zero (by definition). For task , compute the following quantity given below:
(2) 
Equation 2
can be regarded as an exponential distribution on
(without the normalizing factor) with a rate parameter equal to . We want to find a suitable given the rate parameter such that it maximizes this exponential distribution. This is described considering the three extreme cases:1) If target are negativelycorrelated (cosine measure near and ), then the coefficients defining their respective classifiers need to be of opposite signs and similar magnitude. This implies that the dot product is negative. And since is positive, implies that the exponent is positive.
2) If target are uncorrelated (cosine measure near and ), then the coefficients defining their respective classifiers need to be distinct as well. This implies that the dot product will be closer to zero, meaning that the exponent is close to zero.
3) If target are highly correlated (cosine measure near and ), then the coefficients defining their respective classifiers need to be very similar (same sign and similar magnitude). This implies that the dot product will be positive. However, since , implies that the exponent will be close to zero.
To handle cases when the magnitude any of element(s) are very small or very large, we modify equation 2 as given below:
(3) 
Taking the natural logarithm of equation 3 yields:
The Appendix (sections A.2 and A.3) describes the modeling of the classifier models in a multitarget framework jointly with the factors from the Logistic CP decomposition. Section A.1 in the Appendix describes the modeling of the latent parameter . Next section describes our proposed partial natural gradient inference for the multitarget semisupervised Logistic CP model.
5 Partial Natural Gradients: Inference
Natural gradient is defined as the product of the Euclidean, i.e., standard gradient and the inverse of the Fisher information matrix. Natural gradient learning in the context of online learning is explained in [Ama98]. It is an optimization method that is traditionally motivated from the perspective of information geometry and works well for many applications as an alternate to stochastic gradient descent [Mar14]. Natural gradient descent is generally applicable to the optimization of probabilistic models. It has been shown that in many applications, natural gradients seem to require far fewer total iterations than gradient descent, hence making it a potentially attractive alternate method. However it has been known that for models with many parameters, computing the natural gradient is impractical since it requires computing the inverse of a large matrix, i.e., the Fisher information matrix. This problem has been addressed in prior works where an approximation to the Fisher is calculated such that it is easier to store and invert than the exact Fisher.
The latent variables in our semisupervised model that we need to infer are of length , matrix for each mode of dimension and of length for target in mode . The corresponding logposteriors of these latent variables that we need to maximize are denoted by (defined in A.1), (defined in A.3) and (defined in A.2), respectively. Natural gradient update in iteration is then defined as:
(4) 
where the learning rate in (4) is given by:
Note that is the forgetting rate and is the delay. The values for these are chosen such that is bounded but is unbounded.
in (4) indicates the inversion of the Fisher information matrix (square matrix) in each iteration. There are two difficulties:
1) Computation of the Fisher information matrix is usually not trivial since it may not lend itself in a nice closed form.
2) Size of the Fisher information matrix in our data could possibly be in the tens of thousands or more. This impacts scalability, i.e., could result in very expensive computations that might also pose numerical stability issues leading to an intractable inverse computation.
For the first issue, we show that the PólyaGamma data augmentation facilitates easy computation of the Fisher information matrix. The detailed derivations of the partial Fisher information matrices as well as the gradients computations for each of the arguments, namely, , and are sections B and C.
We have addressed the second issue by exploiting the problem structure which facilitates working with partial Fisher information matrix (hence called partial natural gradients). Partial natural gradients implies that we only work with diagonal blocks of the Fisher information matrix instead of the full matrix. Note that in our problem structure, the loss function is quadratic in each of the arguments (
). Due to the individually quadratic nature of the loss functions, each diagonal block is a symmetric positive definite matrix of size for and or for . Hence, the basic convergence guarantees for the full natural gradient learning extends to the partial set up as well [BCN18]. We note that computation of the partial Fisher information matrix is theoretically and numerically tractable as we are dealing with square matrices of size or (), which is very small (value less than ) in our problem space.For scalability over very large data sets, typical of Amazon data, we have implemented partial natural gradient learning in a stochastic setting. A concern here is that in each iteration, using a minibatch, we are obtaining a noisy estimate of the partial Fisher information matrix. A noisy estimate of this matrix leads to a biased inverse since the mean of inverses is not equal to the inverse of the mean. Hence, theoretical guarantees regarding faster convergence for partial natural gradients in a stochastic setting as compared with stochastic gradients cannot be established. Another sideeffect of this noisy estimate is that the matrix may be highly illconditioned, and hence may lead to numerical stability problems while computing its inverse. To circumvent this issue, we have experimented with the conditioning of this matrix (conditioning implies adding a scaled diagonal matrix) in either of the two ways. We can either use the prior of the corresponding latent variable to scale the identity matrix and add this to the partial Fisher information matrix (denoted as natural gradient 1). Or we can use the diagonal of the partial Fisher information matrix to scale the identity matrix and add this to the partial Fisher information matrix (denoted as natural gradient 2).
So far, we have not come across any efforts that have applied partial natural gradient learning in a stochastic setting for inference in Bayesian CP tensor decomposition. In section 6, we empirically show that in the semisupervised setting, on the test data, scalar updates of vector parameters in OnlineEM lead to very slow convergence and performance is suboptimal w.r.t. ROCAUC as compared with the partial natural gradient and stochastic gradient algorithms. We also empirically show that the partial natural gradient algorithm in a stochastic setting achieves faster convergence than stochastic gradient learning in detecting both abusive sellers and reviewers on Amazon data sets.
6 Experimental Results
This section outlines our experiments to validate the efficacy of SENTINEL towards detecting entities promoting review abuse.
Dataset: For all our experiments, we have taken a random sample of products from the Amazon reviews dataset between 2017 and 2018. Not all reviews are associated with the purchase of a product. As a result, considering only those reviews with an associated seller reduces the dataset size by half. From the review data, we construct two datasets and correspondingly two separate binary tensors  1) SELLERTENSOR captures the existing association of a reviewer giving a numeric rating at time for a product from a seller (5mode tensor), and 2) NOSELLERTENSOR captures the existing association of a reviewer giving a numeric rating at time for a product , without the seller information (4mode tensor). The former is used to detect abusive sellers, while the latter is used to detect abusive reviewers. The modes of the two binary tensor are reviewer ID, product ID, seller ID (if included), numeric rating and time. Note that numeric rating corresponds to an integer between to and that time is converted to a week index.
SELLERTENSOR consists of 25K reviewers, 475K products and 70K sellers. NOSELLERTENSOR consists of 90K reviewers and 1.4M products. SELLERTENSOR has 1.2M tuples, resulting in an extremely sparse tensor with a density of .
Benchmark Methods: To the best of our knowledge, the proposed binary multitarget semisupervised enhancement for a CP decomposition and its inference using partial natural gradients are novel to this paper. However, tensor decomposition in general has been applied to detect abusive entities in multimodal data in the past. Hence, we benchmark the performance of our proposed approach with the following tensor based approaches:
BPTF: Bayesian Poisson Tensor Factorization [SPBW15] is a Bayesian CP tensor decomposition approach assuming a Poisson likelihood. The authors in [SPBW15] have implemented the unsupervised BPTF model using a batch algorithm, but it does not seamlessly lend itself to semisupervised extensions.
MZoom: The authors in [SHF16] have proposed an unsupervised technique to identify dense blocks in tensors or dense subgraphs in heterogeneous networks, which can also be applied to identify abusive sellers and reviewers.
Unsupervised BNBCP: Beta NegativeBinomial CP decomposition [HRC15] is also a Bayesian CP tensor decomposition approach, assuming a Poisson likelihood. The BNBCP model is a fully conjugate model and inference is done using Variational Bayes (VB). To be able to scale for massive tensors, we have implemented an online VB version using Stochastic Variational Inference (SVI).
Semisupervised BNBCP: Since the unsupervised BNBCP model can be easily extended to the semisupervised setting, we implemented a semisupervised version. Note that this is a single target and not a multitarget enhancement.
We use precision, recall and AUC against a test set to measure the performance of the above approaches.
Experimental Validation: Our experiments consist of the following empirical evaluations:
1) Comparison between unsupervised and the proposed semisupervised enhancements in detecting abusive sellers and reviewers.
2) Impact of different inference techniques on the proposed semisupervised enhancement.
3) Robustness of the proposed multitarget semisupervised model to variations in the hyperparameters.
4) Stability of the proposed approach across many runs, owing to the nonconvex nature of tensor decomposition.
5) Scalability of the proposed approach.
6) Comparison with baseline models in terms of early detection of abusive reviewers and its impact on customers viewing the reviews.
Within SENTINEL, we have chosen the following values for the learning rate parameters and . We have chosen the following values for the parameters of the Inverse Gamma distributions: , and . We have set the minibatch size to in all our simulations.
6.1 Detecting Abusive Sellers
In this experiment, we take a random sample of sellers who have been flagged, via manual audits, for being guilty of review abuse. These are treated as positively labeled samples. In addition, we have included a random sample of sellers who are currently not flagged for any kind of abuse. These are treated as negatively labeled samples. Both these samples form the training dataset together. We have taken an additional set of around of sellers ^{1}^{1}1High confidence abusive sellers are hard to obtain due to business reasons, hence the small test set. as test set to measure the performance of different techniques. The test set has a similar distribution as the training set and has been selected from beyond the training time period.
Method  Precision  Recall  F1 Score  AUC  
UnSupervised  MZoom [SHF16]  0.61  0.74  0.67   
BPTF [SPBW15]  0.53  0.79  0.63  0.74  
BNBCP [HRC15]  0.51  0.89  0.65  0.74  
BNBCP [SVI]  0.85  0.93  0.89  0.87  
SemiSupervised  Logistic CP [Stochastic Gradient]  0.89  0.94  0.91  0.88 
(until convergence)  Logistic CP [Natural Gradient 1]  0.85  0.93  0.89  0.88 
Logistic CP [Natural Gradient 2]  0.84  0.94  0.89  0.88  
Logistic CP [Stochastic Gradient]  0.81  0.74  0.77  0.78  
SemiSupervised  Logistic CP [Natural Gradient 1]  0.83  0.89  0.86  0.87 
(stop at 200 iterations)  Logistic CP [Natural Gradient 2]  0.78  0.81  0.79  0.85 
Table 1 shows that all four flavors of the semisupervised tensor decomposition have higher precision, recall and AUC as compared with the unsupervised techniques  indicating that leveraging behavioral patterns from current abusive sellers improves performance and fidelity in identifying new abusive sellers. Given the single type of seller abuse (aka single target); both semisupervised BNBCP and SENTINEL models have comparable AUC performance. With early stopping of the semisupervised models ( iterations), we observe that inference based on natural gradient outperforms stochastic gradient learning, indicating empirically faster convergence of the former.
Note that MZoom does not produce the scores for each suspicious entity and hence the AUC is not calculated. Due to this lack of scores per entity, we cannot rank the suspicious entities to be able to apply different levels of enforcement actions against them. Among the unsupervised methods, BNBCP has the best AUC, which is around % lower than the best semisupervised approach. The best performing unsupervised method in terms of precision is MZoom, which is around % lower than the best semisupervised approach.
Table 2 shows evidence from the review data indicating suspicious behavior for a representative set of sellers that was predicted by SENTINEL to be abusive. Density indicates the ratio of bipartite connections (reviews) observed in the data to the maximum number of bipartite connections between products from this seller and suspicious reviewers. These reviewers are most likely abusive since they have written a review for practically all products by the seller.
Suspicious Seller  # Suspicious Products  # Suspicious Reviewers  # Reviews  Density 
A  75  25  1802  0.96 
B  16  9  132  0.91 
C  7  19  114  0.86 
Since paid reviewer abuse mostly manifests offAmazon (e.g., funds transferred through bank), evidence gathered from review data is indicative but not legally binding to qualify the sellers as fraudulent. Moreover, since enforcement actions on false positives would impact revenue negatively, the thresholds for enforcement are quite stringent. Majority of the sellers are warned based on the model’s recommendation.
Section D
shows confidence intervals of the algorithms.
6.2 Detecting Abusive Reviewers
In this experiment, we have taken a random sample of reviewers, who have been identified to be guilty of review abuse. Among these, roughly % belong to paid reviewer abuse category and the remainder belong to the compromised account abuse category. Recall that the compromised account abuse occurs when a good customer’s account is taken over by an abusive reviewer to post fake reviews. SENTINEL supports multitarget labels in a semisupervised setting. Hence, we treat the two forms of abuse as two separate targets in our model. The BNBCP semisupervised model, on the other hand, does not support multiple targets. Hence there is no differentiation between these two forms of abuse in the BNBCP semisupervised model. A random sample of about % of this data forms the training set and the remaining % forms the test set.
Method  Precision  Recall  F1 Score  AUC  
Unsupervised  MZoom [SHF16]  0.53  0.70  0.60   
BPTF [SPBW15]  0.42  0.54  0.47  0.51  
BNBCP [HRC15]  0.59  0.46  0.52  0.61  
BNBCP [SVI]  0.60  0.83  0.69  0.79  
SemiSupervised  Logistic CP [Stochastic Gradient]  0.60  0.90  0.72  0.85 
(until convergence)  Logistic CP [Natural Gradient 1]  0.62  0.91  0.74  0.85 
Logistic CP [Natural Gradient 2]  0.59  0.91  0.72  0.85  
Logistic CP [Stochastic Gradient]  0.56  0.79  0.66  0.72  
SemiSupervised  Logistic CP [Natural Gradient 1]  0.59  0.87  0.71  0.83 
(stop at 600 iterations)  Logistic CP [Natural Gradient 2]  0.58  0.86  0.69  0.82 
Table 3 shows that all the four variants of our semisupervised tensor decomposition have higher precision, recall, and AUC as compared with the unsupervised techniques. Recall and AUC have shown significant increase between the unsupervised and semisupervised methods, as compared to the precision. With early stopping ( iterations), we see that the natural gradient outperforms stochastic gradient learning, indicating empirically faster convergence of the former. At full convergence, multitarget logistic CP semisupervised model exhibits better performance as compared with the semisupervised model that is not designed to differentiate between different forms of abuse (i.e., BNBCP model). To test the hypothesis that jointly learning separate models for each abuse type is better than learning a single model for all abuse types, we experimented with using a single target which is the union of the two abuse types in SENTINEL. Results indicated that we see an almost % gain in AUC in predicting one of the targets in the multitarget scenario, hence confirming our hypothesis.
6.3 Stochastic Partial Natural Gradients versus Baseline Learning Methods
We apply multitarget semisupervised Logistic CP tensor decomposition approach on Amazon review data (SELLERTENSOR for sellers and NOSELLERTENSOR for reviewers) to compare the performance of two baseline learning methods, namely, sufficient statistics (OnlineEM) and stochastic gradient with the proposed stochastic partial natural gradient learning. As mentioned in Section 5, we have two flavors of stochastic partial natural gradient learning, differing only by the conditioning applied to the partial Fisher information matrix. Figure 3 shows the ROCAUC plot for detecting abusive sellers and abusive reviewers versus the iteration number. Solid red plot corresponds to the stochastic partial natural gradient learning type 1, blue (dashdot) plot corresponds to stochastic partial natural gradient learning type 2, black dashed plot corresponds to stochastic gradient learning and magenta (dash) plot corresponds to online EM with sufficient statistics.
Stochastic gradient learning has a tendency to overtrain since it is unable to shrink some of the elements of towards zero as the tensor rank is less than . Stochastic partial natural gradient learning (both flavors) does not suffer from significant overtraining since it is able to shrink 60% of the values of towards zero within the first one thousand iterations. This leads to similar AUC on train and test data sets for detecting both abusive sellers and abusive reviewers as compared with the other two baselines. Stochastic partial natural gradient learning type 2 has a slightly slower learning rate than type 1. OnlineEM with sufficient statistics shows poorer performance on test data (for both reviewers and sellers) when compared with stochastic gradient or partial natural gradient learning.
(a) Train data: sellers. (b) Train data: reviewers. 
(c) Test data: sellers. (d) Test data: reviewers. 
6.4 Sensitivity to Model Hyperparameter
Impact of the scale parameter of the two InverseGamma distributions (i.e., model hyperparameters), namely, and on F1 score in a semisupervised setting is shown in Figure 4 () and (). () is the scale parameter of the InverseGamma distribution that controls the variance of the Gaussian prior for the factors that belong to the mode(s) associated with (without) multitarget data.
is also the scale parameter for the InverseGamma distribution that controls the variance of the Gaussian prior for the Logistic regression coefficients corresponding to each task
. is set such that it offers very little regularization, i.e., set to a high value. However, has to be chosen more carefully such that it provides the right amount of regularization in the semisupervised setting.(a) F1 Score versus with Fixed at 9. (b) F1 Score versus with Fixed at 0.4. 
In Figure 4 (), we fix to a arbitrarily high value and vary from until . We notice that a value of produces the highest F1 score. In Figure 4 (), we fix and vary from until . We notice that the impact on F1 score of varying is minimal as compared with varying and that the best F1 score is achieved at . Hence all our experiments are done with and .
6.5 Robustness of Various Techniques
Due to the nonconvex nature of tensor decomposition, Figure 5 () shows the variations in the F1 scores across ten different runs for each of the techniques, namely, Logistic CP with stochastic partial natural gradient inference, BNBCP, BPTF and MZoom (unsupervised setting) for detecting abusive reviewers.
(a) Robustness: F1 Scores across Runs. (b) Scalability: Size of Tensor vs Convergence Time. 
BNBCP has the least variations across different runs, followed by the Logistic CP with stochastic partial natural gradient based inference. MZoom (in magenta) requires the number of blocks to be specified as input. Given the number of blocks, it produces identical subtensors across different runs. Hence, we have measured the F1 score performance across different number of blocks, monotonically varying them from to . Below , the F1 score was much lower, hence it is omitted. We see that with blocks as input, the F1 score is at the highest and it beats the other three methods. However, unlike the other three methods, MZoom does not produce a score for each suspicious reviewer. That is, we cannot rank the reviewers according to their suspiciousness (that shows that MZoom lacks a very important factor for taking enforcement actions).
6.6 Scalability of the proposed approach
We ran the logistic CP with partial natural gradient on the NOSELLERTENSOR dataset with varying number of tuples. The hyperparameters were fixed at the same setting mentioned above. Figure 5 () shows the scalability of the proposed method across five different datasets of varying sizes. We plot the time to converge (iteration number at which learning is saturated) in minutes versus the size of the tensor (i.e., the number of tuples) in log scale. We notice an almost linear increase in computational time to converge as the number of tuples increases. Our algorithm is also easily parallellizable because in a given iteration, each tuple in the minibatch can be processed independently. This, we expect should help us in achieving much faster convergence than what we are currently seeing. We are in the process of implementing a parallel version of our algorithm.
6.7 Impact of Early Detection of Abusive Reviewers
In this section, we present a retrospective analysis of the impact of our model as compared with the existing system.
Early detection of abusive reviewers: Figure 6 () shows the fraction of suspicious reviewers predicted by SENTINEL, that are later banned by existing models. The model is executed during ’’ and % of the identified suspicious reviewers overlap with those banned by production models. This overlap increases as we move forward by a week at a time, until five weeks later (indicated by Week 5) when the overlap is roughly %. This indicates that roughly % of the reviewers that are detected by our model at Week 0 are detected across next weeks by the production system. The impact of not banning the abusive reviewers at Week 0 is captured by the number of impressions created by the reviews from those abusive reviewers until Week 5.
(a) Cumulative Percentage of Reviewers Banned versus Time. (b) Cumulative Impressions Count versus Time. 
Impact in terms of affected impression counts: Impressions implies the number of customers who read the reviews (i.e., fake reviews) from those abusive reviewers until they were banned. Note that banning a reviewer results in removal of all his/her reviews. Hence larger the impression count implies more people have read those reviews which means bigger impact from not banning these abusive reviewers early enough. Figure 6 () shows the cumulative impact in terms of the number of impressions accumulated over consecutive weeks, from the time they were detected by our model until their banning by existing production model. These impressions on fake reviews quantify the potential impact of our model, once deployed in production.
6.8 Experiments on public data
To allow others to reproduce our results, we applied SENTINEL to the public Amazon Customer Reviews dataset^{2}^{2}2The dataset is available at http://jmcauley.ucsd.edu/data/amazon/. [HM16]. We selected the product categories Clothing, Shoes, and Jewelry, Home and Kitchen and Sports and Outdoors for this test. Note that this data does not have seller information. Also since this data is from May to July , we do not have any labelled data, i.e., known abusive reviewers corresponding to that period. To test SENTINEL, we used random % of the suspicious reviewers from MZoom to seed SENTINEL as well as the semisupervised version of BNBCP. The remaining % of the suspicious reviewers were considered as the test set. On the test set, we obtain a % increase in AUC with SENTINEL as compared with the semisupervised BNBCP model and % increase in AUC compared with the best unsupervised model, i.e., BPTF. Since we are considering MZoom’s output as the ground truth, we acknowledge that this is not an ideal setting to validate the performance of the proposed approach. Having said that, this setup at least allows us to loosely compare various methods.
7 Conclusion
We formulated the problem of identifying abusive entities (i.e., sellers and reviewers) as a tensor decomposition problem and proposed semisupervised enhancements by incorporating binary multitarget information for a subset of entities, i.e., known abusive sellers and/or reviewers. Our results demonstrated that the proposed approach (titled SENTINEL) beats the stateoftheart baselines in detecting abusive entities.
We have shown, in the supplementary manuscript, that PólyaGamma formulation simplifies calculation of the partial Fisher information matrix, and hence proposed a scalable stochastic partial natural gradient learning for inference of all the latent variables of the semisupervised model. We have empirically shown that our inference using stochastic partial natural gradient learning achieves faster convergence than online EM using sufficient statistics and stochastic gradient learning.
Future Work: Given the equivalence between tensor decomposition and convolutional rectifier networks [COA16], we would like to compare the performance of the latter (hierarchical Tucker) with our probabilistic CP decomposition model. We also want to investigate the feasibility of applying Graph Convolutional Networks to our multimodal data (with data being either reviewers and products and their relationships w.r.t. to rating as well as time of rating OR reviewers, products and sellers and their relationships) to detect abusive entities.
Acknowledgment
This work was started at Amazon under the guidance of Srinivasan H. Sengamedu who suggested application of semisupervised tensor decomposition based techniques towards review abuse detection. We would also like to thank Purushottam Kar for sharing helpful insights related to the biased estimates of the partial Fisher information matrices in our stochastic algorithm.
References
 [Ama98] S. I. Amari, Natural gradient works efficiently in learning, Neural computation 10 (1998), no. 2, 251–276.
 [BCN18] L. Bottou, F. E. Curtis, and J. Nocedal, Optimization methods for largescale machine learning, SIAM Review 2 (2018), no. 60, 223–311.
 [BD11] A. Bhattacharya and B. D. David, Sparse bayesian infinite factor models, Biometrika 98 (2011), no. 2, 291.
 [BXG13] A. Beutal, W. Xu, V. Guruswami, C. Palow, and C. Faloutsos, CopyCatch: stopping group attacks by spotting lockstep behavior in social networks, Proceedings of the 22nd International Conference on World Wide Web (WWW), 2013, pp. 119–130.
 [CKP15] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, and H. Al Najada, Survey of review spam detection using machine learning techniques, Journal of Big Data 2 (2015), no. 1, 23.

[COA16]
N. Cohen, Sharir O., and Shashua A.,
On the expressive power of deep learning: a tensor analysis
, JMLR: Workshop and Conference Proceedings, vol. 49, 2016, pp. 1–31.  [Dur16] D. Durante, A note on the multiplicative gamma process, Tech. report, arXiv preprint arXiv:1610.03408, 2016.
 [FBC12] S. Feng, R. Banerjee, and Y. Choi, Syntactic stylometry for deception detection, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers  Volume 2, ACL ’12, 2012, pp. 171–175.
 [HLHW15] J. Huang, G. Li, Q. Huang, and X. Wu, Learning label specific features for multilabel classification, IEEE International Conference on Data Mining (ICDM), 2015, pp. 181–190.
 [HM16] R. He and J. McAuley, Ups and downs: modeling the visual evolution of fashion trends with oneclass collaborative filtering, Proceedings of the 25th International Conference on World Wide Web (Republic and Canton of Geneva, Switzerland), WWW ’16, International World Wide Web Conferences Steering Committee, 2016, pp. 507–517.
 [HRC15] C. Hu, P. Rai, C. Chen, M. Harding, and L. Carin, Scalable Bayesian nonnegative tensor factorization for massive count data, Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), 2015, pp. 53–70.

[HSB16a]
B. Hooi, N. Shah, A. Beutal, S. Gunneman, L. Akoglu, M. Kumar, D. Makhija, and
C. Faloutsos,
BIRDNEST: Bayesian inference for ratingsfraud detection
, SIAM International Conference on Data Mining (SDM), 2016.  [HSB16b] B. Hooi, H. Song, A. Beutal, N. Shah, K. Shin, and C. Faloutsos, Frauder: Bounding graph fraud in the face of camouflage, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 895–904.
 [JB08] N. Jindal and Liu. B., Opinion spam and analysis, Proceedings of the 2008 International Conference on Web Search and Data Mining, 2008, pp. 219–230.
 [JBC15] M. Jiang, A. Beutal, P. Cui, B. Hooi, S. Yang, and C. Faloutsos, A general suspiciousness metric for dense blocks in multimodal data, IEEE International Conference on Data Mining (ICDM), 2015, pp. 781–786.
 [JCB15] M. Jiang, P. Cui, A. Beutal, C. Faloutsos, and S. Yang, Inferring lockstep behavior from connectivity pattern in large graphs, Knowledge and Information Systems 48 (2015), no. 2, 399–428.
 [JL07] N. Jindal and B. Liu, Analyzing and detecting review spam, Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM ’07, 2007, pp. 547–552.
 [KB09] T. G. Kolda and B. W Bader, Tensor decompositions and applications, SIAM review 51 (2009), no. 3, 455–500.
 [LCM15] H. Li, Z. Chen, A. Mukherjee, B. Liu, and J. Shao, Analyzing and detecting opinion spam on a largescale dataset via temporal and spatial patterns, Proceedings of the 9th International AAAI Conference on Web and Social Media (ICWSM), 2015, pp. 26–29.
 [LMCH16] Y. Li, O. Matrinez, X. Chen, and J. E. Hopcroft, In a world that counts: Clustering and detecting fake social engagement at scale, Proceedings of the 25th International Conference on World Wide Web (WWW), 2016, pp. 111–120.
 [LZW14] Yuming Lin, Tao Zhu, Hao Wu, Jingwei Zhang, Xiaoling Wang, and Aoying Zhou, Towards online antiopinion spam: Spotting fake reviews from the review sequence, Advances in Social Networks Analysis and Mining (ASONAM), 2014, pp. 261–264.
 [Mar14] J. Martens, New insights and perspectives on the natural gradient method, Tech. report, arXiv preprint arXiv:1412.119, 2014.
 [MGF11] K. Maruhashi, F. Guo, and C. Faloutsos, MultiAspectForensics: Pattern mining on largescale heterogenous networks with tensor analysis., Advances in Social Networks Analysis and Mining (ASONAM), 2011, pp. 203–210.
 [PS12] J. W. Pillow and J. G. Scott, Fully bayesian inference for neural models with negativebinomial spiking, Advances in Neural Information Processing Systems (NeurIPS), 2012.
 [PSW13] N. G. Polson, J. G. Scott, and J. Windle, Bayesian inference for logistic models using polyagamma latent variables, Tech. report, arXiv preprint arXiv:1205:0310v3, 2013.

[RHHC15]
P. Rai, C. Hu, M. Harding, and L. Carin, Scalable probabilistic tensor
factorization for binary and count data
, Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI), 2015, pp. 3770–3776.
 [RWC15] P. Rai, Y. Wang, and L. Carin, Leveraging features and networks for probabilistic tensor decomposition, Association for the Advancement of Artificial Intelligence (AAAI), 2015, pp. 2942–2948.
 [RWG14] P. Rai, Y. Wang, S. Guo, G. Chen, D. Dunson, and L. Carin, Scalable Bayesian lowrank decomposition of incomplete multiway tensors, Proceedings of the 31st International Conference on Machine Learning (ICML), 2014, pp. 1800–1808.
 [SHF16] K. Shin, B. Hooi, and C. Faloutsos, Mzoom: fast denseblock detection in tensors with quality quarantees., Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD), 2016, pp. 264–280.
 [SPBW15] A. Schein, J. Paisley, D. M. Blei, and H. Wallach, Bayesian poisson tensor facorization for inferring multilateral relations from sparse dyadic event counts, Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 1045–1054.
 [YA15] J. Ye and L. Akoglu, Discovering opinion spammer groups by network footprints, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2015, pp. 267–282.
 [YKA16] J. Ye, S. Kumar, and L. Akoglu, Temporal opinion spam detection by multivariate indicative signals, International AAAI Conference on Web and Social Media, 2016, pp. 743–746.
Appendix A Appendix
a.1 Determine
has a Gaussian prior whose variance is determined via Multiplicative Gamma Process [BD11, Dur16] aka Adaptive Dimensionality Reduction
technique that induces sparsity and automatically deletes redundant parameters. The Multiplicative Gamma Process consists of a multiplicative Gamma prior on the precision of a Gaussian distribution that induces a multiplicative InverseGamma prior on its variance parameter. Since its performance is sensitive to the hyperparameter settings of the InverseGamma prior we follow certain strategies, shown in
[Dur16], to set their values. The generative model for for in and the Multiplicative Gamma Process are:where:
(5) 
The idea is that for increasing r, should be decreasing in a probabilistic sense. In other words, the following stochastic order should be maintained with high probability, i.e.,
To guarantee such a stochastic order with a high probability, as suggested in [Dur16], we need to set and and nondecreasing for all . Hence, we choose the hyperparameters values as follows:
The update for in iteration t is given by (see [RWG14] for details):
(6) 
From (5), we can calculate the s in iteration t for . Denote as the vector consisting of for .
Let for denote an element of . We introduce auxiliary variables (PólyaGamma distributed variables [PSW13, PS12]), denoted by for each element of the input tensor data, via data augmentation technique.
Consider a minibatch defined at iteration as . Define for each and where: denotes a vector consisting of elements for . Let be the matrix whose rows are for . Let for be the expected value of the auxiliary variable corresponding to the element of the input tensor data. The expected value of has a closedform solution given by:
The update for in iteration t, with the current minibatch , is obtained by maximizing the natural logarithm of the posterior distribution of given by:
(7) 
where and . And the operator represents elementwise division between the two vectors and . Note that (15) is a quadratic equation in i.e., , and hence has a closedform update.
a.2 Determine the coefficients for
The generative model for for in , in is:
For mode , we drop the notation for all variables. Define for each where denotes the number of elements in mode that have binary side information corresponding to one or more targets. The logistic function i.e., the likelihood corresponding to element with label is given by:
where denotes prepended with to account for the bias. With introduction of PólyaGamma variables [PSW13] denoted by for task , the likelihood with the data augmentation becomes:
where:
(8) 
The update for in iteration t is obtained by maximizing the natural logarithm of the posterior distribution of given by:
(9) 
Equation (20) is a quadratic equation in and hence has a closedform update. And the operator represents elementwise division between the two vectors and .
Subsequently, the update for at time step is given by:
(10) 
a.3 Determine the factors for mode
The generative model for is:
The InverseGamma hyperprior on the variance parameter of the Gaussian prior for provides adaptive L2Regularization. The InverseGamma parameters are set so that greater amount of regularization is provided for mode that has target information. Denote as the dimension vector consisting of factors corresponding to element in mode and . Define for each , , and , where each element of is for . Let be the matrix whose rows are .
Mode without target information: For mode , we drop the notation for all variables. The update for in iteration t is obtained by maximizing the natural logarithm of the posterior distribution of given by:
(11) 
where the operator represents elementwise division between the two vectors and .
Equation (21) is a quadratic equation in and hence has a closedform update.
Subsequently the update for at time step is given by:
(12) 
Mode with binary target information: Given binary target information for mode we drop the notation for all variables. Let denote the binary label (either or ) for element for task . Let correspond to the data augmented variable that is PólyaGamma distributed for element the and task .
The update for in iteration t is obtained by maximizing the natural logarithm of the posterior distribution of given by:
(13) 
where denotes the vector of coefficients without the bias. Note that in (22) we are training a Logistic model using the binary target information to detect abusive entities. And the operator represents elementwise division between the two vectors and . Also (22) is a quadratic equation in and hence has a closedform update.
Subsequently the update for at time step is given by:
(14) 
Algorithm
Algorithm presents the pseudocode for the multitarget semisupervised CP tensor decomposition using partial natural gradients.
Appendix B Computation of Partial Fisher information matrix
Partial Fisher information matrix w.r.t.
Consider the exponent of the function to be maximized w.r.t. , which is:
(15) 
where and . And the operator represents elementwise division between the two vectors & .
The first exponential term of the right hand side (RHS) of (15) is the joint conditional likelihood of the binary outcome , denoted as and the conditional likelihood of the PólyaGamma distributed variable (from data augmentation) denoted as . The second exponential term of the RHS of (15) is the Gaussian prior on with variance . The stochastic natural gradient ascentstyle updates for at step is given as:
(16) 
where in (16) is given by the following equation:
in (16) denotes the inverse of the partial Fisher information matrix whose size is since the second order derivatives are computed only w.r.t. .
Note that the joint likelihood term in (15) is unnormalized. The partial Fisher information matrix is computed only w.r.t. the data i.e., . To do this, we first compute the expectation w.r.t. the data augmented variable . The resulting equation is a Logistic function from the following identity:
(17) 
There is a closedform solution for the integral in (17), hence we obtain:
(18) 
Equation (18) is a normalized likelihood for each and let it be denoted by . Using the definition of hyperbolic cosine, (18) becomes:
To this end, the partial Fisher Information with respect to for and is given by:
where denotes the matrix whose rows are for and denotes the diagonal matrix whose diagonal elements are ; where:
The prior term is accounted by considering its variance as a conditioner, hence the conditioned partial Fisher Information matrix for the parameter at step is given by:
(19) 
where diag denotes inverse of a diagonal matrix whose diagonal is .
Partial Fisher information matrix w.r.t. for task in mode
The function to be maximized w.r.t. is:
(20) 
where the operator represents elementwise division between the two vectors & .
Let be the matrix whose rows are for for mode with target information. We compute the partial Fisher information matrix
Comments
There are no comments yet.