Bayesian Semi-Supervised Tensor Decomposition using Natural Gradients for Anomaly Detection

04/11/2018 ∙ by Anil R. Yelundur, et al. ∙ Amazon Microsoft 0

Anomaly Detection has several important applications. In this paper, our focus is on detecting anomalies in seller-reviewer data using tensor decomposition. While tensor-decomposition is mostly unsupervised, we formulate Bayesian semi-supervised tensor decomposition to take advantage of sparse labeled data. In addition, we use Polya-Gamma data augmentation for the semi-supervised Bayesian tensor decomposition. Finally, we show that the Polya-Gamma formulation simplifies calculation of the Fisher information matrix for partial natural gradient learning. Our experimental results show that our semi-supervised approach outperforms state of the art unsupervised baselines. And that the partial natural gradient learning outperforms stochastic gradient learning and Online-EM with sufficient statistics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Anomaly detection implies finding patterns in the data that do not conform to normal behavior (Chandola et al., 2009). In this paper, our focus is on detecting anomalies in seller-reviewer data using tensor decomposition. Anomalies in seller-reviewer graph occur when there is a secretive agreement between the sellers and reviewers for gaining an unfair market advantage. One type of abuse in Amazon e-commerce is called reviews abuse, where a seller enlists (directly or indirectly) reviewers to write fraudulent reviews about their products. In other words, sellers incentivize reviewers to create fake reviews for them and hence the major driver of reviews abuse are the sellers. In fact, there are businesses which promise reviews on Amazon for a fee. Hence detection of anomalies in seller-reviewer graph and enforcing actions on such sellers and reviewers will address the root cause of fake review origination.

From a technical perspective, we represent anomalies as dense cores in the seller-reviewer bi-partite graph (or dense blocks in seller-reviewer matrix) satisfying certain properties. To incorporate meta-data such as review timestamp, rating, etc. we model the data as a tensor and apply Bayesian tensor decomposition to detect anomalies. We develop semi-supervised extensions to the probabilistic tensor decomposition model to incorporate prior information regarding known bad sellers and/or reviewers. Model inference is achieved via natural gradient learning framework in a stochastic setting (Amari, 1998).

Our contributions are as follows.

  1. We formulate anomaly detection as a Bayesian binary tensor decomposition problem.

  2. We develop semi-supervised extensions to the probabilistic binary tensor decomposition that incorporates binary target information for a subset of sellers (and/or reviewers) which have been tagged as being abusive or not abusive. This is based on the Logistic Model with Pólya-Gamma data augmentation.

  3. Finally we develop partial natural gradient learning for inference of all the latent variables of the probabilistic semi-supervised model.

Natural gradient based optimization is motivated from the perspective of information geometry and works well for many applications as an alternate to stochastic gradient descent, see

(Martens, 2014), i.e., natural gradient learning seem to require far fewer total iterations than stochastic gradient descent. Natural gradient learning is generally applicable to the optimizaton of probabilistic models that uses natural gradient as opposed to the standard gradient. Note that natural gradient learning requires computing the inverse of the Fisher

information matrix in every iteration. In most models, computing the Fisher information matrix is non-trivial and even if it can be computed, its inverse is usually not tractable because the size of the matrix depends on the number of latent parameters in the model. We show that the Pólya-Gamma data augmentation facilitates easy computation of the Fisher information matrix. To make the inverse of this matrix tractable, we exploit the quadratic structure of the loss function in each of the latent variables to be able to work with only the partial Fisher information matrix instead that is much smaller in size as compared with the full Fisher information matrix.

Our experiments show the following:

  1. Our semi-supervised approach beat the state of the art unsupervised baselines in identifying abusive sellers and reviewers on Amazon data sets.

  2. Partial natural gradient learning shows better learning than online EM (using sufficient statistics) on the test data set.

  3. Partial natural gradient learning shows better learning than stochastic gradient learning, specifically on detecting abusive sellers in the test data set.

The rest of this paper is organized as follows. Section 2 introduces recent related work as well as background regarding our application of tensor decomposition for anomaly detection, including our proposed partial natural gradient learning for inference. Section 3 describes our proposed semi-supervised extensions to the binary tensor decomposition model via Pólya-Gamma data augmentation. Section 4 describes the modeling for all the latent variables. Section 5 describes the inference of all the latent variables using partial natural gradients. Experimental results are shown in Section 6. Finally, we conclude the paper in Section 7.

The detailed derivations of the partial Fisher information matrix as well as the gradient calculations for the proposed model are in Appendix A.

2 Related Work and Background

2.1 Fake Reviewer Detection

There has been a lot of attention recently to address the issue of finding fake reviewers in online e-commerce platforms. Jindal and Liu (Jindal and Liu, 2008)

were one of the first to show that review spam exists and proposed simple text based features to classify fake reviewers.

Abusive reviewers have grown in sophistication ever since, employing professional writing skills to avoid detection via text based techniques. To detect more complex fake review patterns, researchers have proposed graph based approaches such as approximate bi-partite cores/lockstep behavior among reviewers (Li et al., 2016; Beutal et al., 2013; Hooi et al., 2016b; Jiang et al., 2015b), network footprints of reviewers in the reviewer product graph (Ye and Akoglu, 2015) and using anomalies in ratings distribution (Hooi et al., 2016a).

Some recent research has also pointed out the importance of time in identifying fake reviews because it is critical to produce as many reviews as possible in a short period of time to be economically viable. Methods based on time-series analysis (Li et al., 2015; Ye et al., 2016) have been proposed.

Tensor based methods such as CrossSpot (Jiang et al., 2015a), M-Zoom (Shin et al., 2016), MultiAspectForensics (Maruhashi et al., 2011) have been proposed to identify dense blocks in tensors or dense subgraphs in heterogenous networks, which can also be applied to our case of identifying fake reviewers. Tensors are higher-dimensional generalization of matrices i.e., higher-dimensional arrays, that can seamlessly incorporate many types of information. The different dimensions of the tensor are called modes. M-Zoom (Multidimensional Zoom) is an improved version of CrossSpot that computes dense blocks in tensors which indicate anomalous or fraudulent behavior. The number of dense blocks (i.e., subtensors) returned by M-Zoom can be configured a-priori. Note that the blocks returned may be overlapping i.e., a tuple could be included in two or more blocks. The authors in (Shin et al., 2016)

have implemented M-Zoom and we have leveraged it as one of the baselines (un-supervised learning) in our experiments. MultiAspectForensics on the other hand automatically detects and visualizes novel patterns that include bi-partite cores in heterogenous networks.

2.2 Anomaly Detection using Tensor Factorization

Tensor factorization aka decomposition can be applied to detect anomalies in seller-reviewer data because it facilitates detection of dense bi-partite subgraphs. The reasoning is as follows: at any given time, there is a common pool of fake reviewers (paid reviewers) that are available and who are willing to write a positive or negative review in exchange for a fee. Sellers recruit these fake reviewers through an intermediary channel (such as facebook groups, third party brokers, etc.) as shown in Figure 1; where nodes on the left indicate sellers and nodes on the right indicate reviewers and an edge indicates a written review. Note that a seller has to recruit a sizeable number of fake reviewers in a short amount of time to make an impact on the overall product rating: positive impact for his own and negative impact for his competitor’s products. Given a common pool of available fake reviewers at any given time, anomalies manifest as approximate bi-partite connections between a group of sellers and a group of reviewers with similar ratings, such as near -star or near -star ratings. Hence the presence of a bi-partite core (a dense bi-partite sub-graph) in some contiguous time interval is a strong indicator of anomalous connections between the group of sellers and reviewers involved.

Therefore our goal is to find bi-partite cores (or dense blocks) using tensor decomposition. Decomposing a tensor implies computing the factor matrices for each mode of the tensor. The modes of the tensor in our problem space correspond to the seller, reviewer, product, rating, and time of review. The entities in the corresponding factor matrices that have relatively higher values indicate anomalies. By aggregating these anomalous entities across the modes of the tensor results in discovering bi-partite cores, where each core consists of a group of reviewers that provide similar rating across a similar set of sellers (and their products) where all of these reviews are occuring with-in a short contiguous interval of time.

Figure 1: Seller-Reviewer Data Anomaly: key signals.
Figure 2: CP tensor decomposition.

2.2.1 CP Tensor Decomposition

Let be a 3-mode tensor. We can decompose it as

(1)

where , , and

are vectors (or rank-1 tensors) and

represents vector outer product. is called the rank of the tensor. This is called CP Decomposition

and is a generalization of matrix singular value decomposition. See Figure 

2.

2.3 Bayesian Tensor Factorization

Bayesian tensor factorization based on Poisson, Negative-Binomial and Logistic formulation (described below) have been proposed in (Schein et al., 2015; Hu et al., 2015; Rai et al., 2014, 2015a, 2015b) to infer multi-mode relations, which can be also be applied to identify anomalies. All of these approaches are un-supervised except the Logistic CP decomposition model (Rai et al., 2015b) that can also leverage features (via side-information) while factorizing the tensor.

2.3.1 Bayesian Poisson Tensor Factorization

Equation 1 represents a deterministic decomposition. For count tensors, the authors in (Schein et al., 2015) make it probabilistic by assuming a Poisson likelihood and call their model as Bayesian Poisson Tensor Factorization (BPTF). The generative model for BPTF is shown below:

The authors in (Schein et al., 2015) have implemented BPTF (un-supervised model) using batch algorithm and we have leveraged this as one of the baselines in our experiments. Given the Poisson model and that the partially labeled target information being binary or real, it does not seamlessly lend itself for semi-supervised extensions.

2.3.2 Beta Negative-Binomial CP Model

Let be a -mode tensor of size . The authors in (Hu et al., 2015) assume a count tensor and hence define a Poisson likelihood. They call their model as Beta Negative-Binomial CP decomposition (BNBCP). BNBCP decomposition of tensor into components is as below:

where vectors for denote rank-1 tensors. The generative model for BNBCP as shown below have Gamma and Dirichlet priors assigned to and respectively. The Poisson-Gamma hierarchical construction for effectively results in a Negative Binomial likelihood model which leads to better robustness against over-dispersed counts. Vectors in the simplex can be seen as “topics” over the entities in mode k. Rank R of can be inferred from the gamma-beta construction on ’s:

The BNBCP model is a fully conjugate model and inference is done using Variational Bayes (VB). To be able to scale for massive tensors, we have implemented online VB via Stochastic Variational Inference (SVI). We use this as one of the baselines in our experiments (un-supervised comparison only). This model can be seamlessly extended to achieve semi-supervised learning but the results were sub-par as compared with the results obtained from our semi-supervised Logistic CP model. Hence its semi-supervised results are not compared in this paper.

2.3.3 Logistic CP Decomposition

In Logistic CP tensor factorization framework (Rai et al., 2014, 2015a, 2015b) the decomposition is as follows:

where specifies the Bernoulli-Logistic function for the binary valued tensor. In mode , consider entity . Denote as the dimensional factor corresponding to entity .

  1. Gaussian priors are assigned to latent variables and for .

  2. The number of non-zero values of

    determines the rank of the tensor. The variance of the Gaussian prior for

    is controlled by a Multiplicative Gamma Process that has the property of reducing the variance as increases.

  3. Given the logistic formulation for the tensor decomposition, to obtain closed form updates the data is augmented via additional variables

    that are Pólya-Gamma distributed.

  4. We impose non-negativity constraint on for all modes .

We use this un-supervised model as our base and have developed semi-supervised extensions to it as described in section 3. (Rai et al., 2014, 2015a, 2015b) propose using either sufficient statistics (in batch EM or as online EM) or Gibbs sampling (in batch) for inference. They claim that the online EM reaches reasonably good quality solutions fairly quickly as compared with their batch counterparts in most situations. However their online EM does scalar updates of each latent variable that is inherently a vector of dimension - which may result in slower convergence. We propose using partial natural gradient learning in a stochastic setting for inference that is both: online in nature as well as does vectorized updates for each of the latent variables.

Natural gradient learning requires computation of the Fisher information matrix (as well as its inverse) that is obtained by taking the expectation of the Hessian matrix w.r.t. the data. In most models, computing the Fisher information matrix is non-trivial and even if it can be computed, its inverse is usually not tractable. We show that the Pólya-Gamma data augmentation facilitates easy computation of the Fisher information matrix. To make the inverse of this matrix tractable, we exploit the quadratic structure of the loss function in each of the latent variables to be able to work with only the partial Fisher information matrix instead that is much smaller in size as compared with the full Fisher information matrix. Section 5 describes this in greater detail.

We empirically show that in the semi-supervised setting, on the test data, scalar updates of vector parameters in Online-EM is sub-optimal i.e., results in lower ROC-AUC and precision as compared with the partial natural gradient algorithm. We also show that partial natural gradient algorithm produces better ROC-AUC and precision than stochastic gradient learning in the semi-supervised setting, specifically in detecting abusive sellers. So far, we have not come across any literature that have applied partial natural gradient learning for inference in Bayesian CP tensor decomposition.

3 Semi-Supervised Logistic CP Decomposition

In this section we describe our semi-supervised extensions to the Logistic CP model for anomaly detection. We have prior information associated with a subset of the entities for at least one of the modes. This prior information is specified as a target (either binary or real) that corresponds to a specific type of abuse. Our framework is called “semi-supervised tensor decomposition” since the tensor decomposition is achieved by simultaneously incorporating the prior data i.e., the target information given for a subset of the entities of a mode(s). The intuition behind using the target information is that the patterns hidden in the known abusive entities could be leveraged to discover more entities that have similar signatures and with greater precision. Tensor decomposition with such target information is as follows:

  1. Target information is specified for at least one of the modes.

  2. Target information in each mode can either be real numbers or binary labels.

    1. If data is binary; then both positive and negative labels need to be specified (positive labels indicate abuse and negative labels indicate no-abuse).

    2. Data can be specified for only a subset of the entities in that mode (semi-supervised learning).

    3. The factorization of the tensor is achieved by taking the target information across the mode(s) into account.

This paper concerns only with binary targets and all our experiments have been performed using binary targets. Binary target for mode is specified as a matrix with rows; where denotes the number of entities that have binary labels specified such that for semi-supervised learning. The first column of the matrix consists of the entity identifiers and the second column consists of corresponding binary labels.

Let denote the binary label (either or ) associated with entity in mode . CP decomposition of a tensor produces rank-1 tensors of length for each mode ; which can also be viewed as a factor matrix of dimension for mode . Consider entity in mode that has a binary label associated with it. For entity , denote its corresponding row in the factor matrix by where forms one instance of the explanatory variables. Note that is a dimensional vector, where each element of this vector is denoted by . Since entities in mode have binary labels associated with them, the design matrix consists of the corresponding rows of the factor matrix. Let denote the vector of coefficients with the bias denoted as for mode and let denote the vector of coefficients that includes the bias. The logistic formulation for semi-supervised learning is:

The coefficients () for mode are assigned Gaussian priors. To get closed form updates of the coefficients and factors; we introduce auxiliary variables denoted by that are Pólya-Gamma distributed for each entity in mode that has a binary label associated with it. The next section provides a detailed description of the modeling.

Note that the binary target is usually specified (and available) for only a small subset of the entities of mode . Hence the learning is semi-supervised. Based on the labels for a subset of entities; the tensor decomposition technique can infer the neighbors of these entities via information present in other modes. For example; if a seller is flagged as having review abuse; then the tensor decomposition technique would infer the reviewers associated with seller during some time where the density is high. This will facilitate detection of other sellers who are also connected to some subset of these same reviewers during the same time interval

. These other sellers would then have a high probability of being abusive.

We apply an online (stochastic) algorithm to infer the values of all the latent variables of the semi-supervised Logistic CP model. The latent variables being of length , matrix for each mode of dimension and of length for each mode that has target information.

The next two sections describe the modeling and inference of the latent variables in a stochastic setting for the semi-supervised Logistic CP model.

4 Model Description for Latent Variables

The sub-sections below describe in detail the modeling for all the latent variables and provides the update equations for the hyper-parameters of the semi-supervised Logistic CP model.

4.1 Model for

has a Gaussian prior whose variance is determined via Multiplicative Gamma Process (Bhattacharya and David, 2011; Durante, 2016). It falls under the Adaptive Dimensionality Reduction

technique which adaptively induces sparsity and automatically deletes redundant parameters not required to characterize the data. The Multiplicative Gamma Process consists of a multiplicative Gamma prior on the precision of a Gaussian distribution that induces a multiplicative Inverse-Gamma prior on its variance parameter. However its performance is sensitive to the hyper-parameter settings of the Inverse-Gamma prior and hence it is very important to not naively choose the hyper-parameter values. But instead follow certain strategies to set their values that are based on their probabilistic characteristics. The generative model for

for in and the Multiplicative Gamma Process are:

where:

(2)

The idea is that for increasing r, should be decreasing in a probabilistic sense. In other words, the following stochastic order should be maintained with high probability, i.e.,

To guarantee such a stochastic order with a high probability, as suggested in (Durante, 2016), we need to set and and non-decreasing for all . Hence we choose the hyper-parameters values as follows:

The update for in iteration t is given by (see (Rai et al., 2014) for details):

(3)

From (2), we can calculate the s in iteration t for . Denote as the vector consisting of for .

Let for denote an element of . We introduce auxiliary variables (Pólya-Gamma distributed variables (Polson et al., 2013; Pillow and Scott, 2012)), denoted by for each element of the input tensor data, via data augmentation technique.

Consider a mini-batch defined at iteration as . Define for each ; where: denotes a vector consisting of elements for . Let be the matrix whose rows are for . Let for be the expected value of the auxiliary variable corresponding to the element of the input tensor data. The expected value of has a closed-form solution given by:

The update for in iteration t, with the current mini-batch , is obtained by maximizing the natural logarithm of the posterior distribution of given by:

(4)

where and . And the operator represents element-wise division between the two vectors and . Note that (12) is a quadratic equation in i.e., and hence has a closed-form update.

4.2 Model for for mode

The generative model for the coefficients for is:

The Inverse-Gamma hyper-prior on the variance parameter of the Gaussian distribution provides adaptive L2-Regularization. The parameters of the Inverse-Gamma distribution ( and ) have been chosen so as to provide just the right amount of regularization for the coefficients .

Let denote the number of entities in mode that have binary labels. Let denote prepended with , to account for the bias. The logistic function i.e., likelihood corresponding to entity with label is given by:

With introduction of Pólya-Gamma variables (Polson et al., 2013; Pillow and Scott, 2012) denoted by for task ; the joint likelihood corresponding to entity that includes the data augmented variable becomes:

where:

The update for in iteration t is obtained by maximizing the natural logarithm of the posterior distribution of given by:

(5)

Equation (17) is a quadratic equation in and hence has a closed-form update. And the operator represents element-wise division between the two vectors and .

Subsequently, the update for at time step is given by:

(6)

4.3 Model for for mode

The generative model for the factors for is:

The Inverse-Gamma hyper-prior on the variance parameter of the Gaussian distribution provides adaptive L2-Regularization. The Inverse-Gamma parameters are set so that greater amount of regularization is provided for the mode that has target information than the mode that does not have target information, hence we set that results in minimal regularization. The reason being that the factors corresponding to the mode with target information also act as co-variates (explanatory variables) in predicting the binary target. Note that (from the previous sub-section), we apply the same amount of regularization to the corresponding coefficients .

Denote as the dimension vector consisting of factors corresponding to entity in mode and . Define for each and ; where each element of is for . Let be the matrix whose rows are .

Mode without target information: The update for in iteration t is obtained by maximizing the natural logarithm of the posterior distribution of given by:

(7)

where is the vector of Lagrange multipliers for the non-negativity constraint on the . And the operator represents element-wise division between the two vectors and .

Equation (18) is a quadratic equation in with non-negativity constraints and hence has a closed-form update.

Subsequently the update for at time step is given by:

(8)

Mode with binary target information: Given binary target information for mode , let denote the binary label (either or ) for entity in mode . Let correspond to the data augmented variable that is Pólya-Gamma distributed for the entity .

The update for in iteration t is obtained by maximizing the natural logarithm of the posterior distribution of given by:

(9)

Note that (19) does not have the non-negativity constraint since we are training a Logistic model using the binary target information to detect abusive entities. And the operator represents element-wise division between the two vectors and . Also (19) is a quadratic equation in and hence has a closed-form update.

Subsequently the update for at time step is given by:

(10)

5 Partial Natural Gradients: Inference

Natural gradient learning (or in the context of online learning) is defined in (Amari, 1998). Natural gradient learning is an optimization method that is traditionally motivated from the perspective of information geometry and works well for many applications as an alternate to stochastic gradient descent, see (Martens, 2014). Natural gradient descent is generally applicable to the optimizaton of probabilistic models that uses natural gradient in place of the standard gradient. It has been shown that in many applications, natural gradient learning seem to require far fewer total iterations than gradient descent hence making it a potentially attractive alternate method. However it has been known that for models with very many parameters, computing the natural gradient is impractical since it requires computing the inverse of a large matrix i.e., the Fisher information matrix. This problem has been addressed in prior works via using one of various approximatons to the Fisher that are designed to be easier to compute, store and invert than the exact Fisher.

Given that our model also has a lot of parameters, we have addressed this problem by exploiting the problem structure that facilitates working with partial Fisher (i.e., computing partial natural gradients) as described subsequently. We apply partial natural gradient learning to update the values of the latent variables of the semi-supervised Logistic CP model. Partial natural gradients implies that we only work with diagonal blocks of the Fisher information matrix instead of the full Fisher information matrix. Note that the latent variables in the semi-supervised model that we need to infer are of length , matrix for each mode of dimension and of length for each mode that has target information.

Natural gradient update in iteration is defined as:

(11)

where in (11) is given by:

Where in (11) indicates the inversion of the Fisher information matrix (square matrix) in each iteration whose size could possibly be in the tens of thousands or more depending on the data. This impacts scalability i.e., could result in very expensive computations and might also pose numerical stability issues leading to an intractable inverse computation.

Figure 3: The block diagonal terms in the Fisher information matrix are strictly positive definite and are computationally easy to invert.

To circumvent this; we exploit the problem structure by noting the following:

  1. Loss function is quadratic in each of the arguments ().

  2. This leads to a simpler approximation of the Fisher information matrix i.e., it facilitates working with the partial blocks (i.e., diagonal blocks) of the Fisher information matrix as shown in Figure 3.

Due to the individually quadratic nature of the loss functions, each diagonal block is a symmetric positive definite matrix of size for and or for . Hence the basic convergence guarantees for the full natural gradient learning extends to the partial set up as well; see (Bottou et al., 2016). We note that computations of the partial Fisher information matrix is theoretically and numerically tractable as we are dealing with square matrices of size or (), which is very small (value less than ) in our problem space.

Refer to the supplementary material for the detailed derivations of the partial Fisher information matrices as well as the gradients for each of the arguments, namely, , and .

The partial natural gradients are obtained when the corresponding gradient is multiplied by the inverse of the corresponding partial Fisher information matrix. Algorithm 1 presents the pseudo-code for the semi-supervised CP tensor decomposition using natural gradients.

6 Experiments

For all our experiments, we have chosen the following values for the learning rate parameters and . We have chosen the following values for the parameters of the Inverse Gamma distributions: , and . We have set the mini-batch size to in all our simulations. These values have been chosen by performing 5-fold crossvalidations on a validation set.

Dataset

The binary tensor i.e., the input data corresponds to a random sampling of the products in the Amazon review data until October 2017. The modes of the tensor are reviewer ID, product ID, seller ID, rating and time. Note that rating corresponds to an integer between to . Time is converted to a week index. The tensor consists of millions of entires, where each entry of the tensor represents a unique association that corresponds to a reviewer (buyer) rating a product from seller with a rating at time .

6.1 Detection of Abusive Sellers

We have partial ground truth data consisting of a small number of sellers; who have been actioned against for being guilty of review abuse - treated as positively labeled samples. To this we have included sellers who currently are not flagged with any kind of abuse - treated as negatively labeled samples. The negatively labeled sellers are roughly three times the number of positively labeled sellers. This forms the training set. We have set aside an additional set of around sellers as test set to measure the performance of the un-supervised and semi-supervised tensor decomposition techniques. The test set has similar distribution of positive and negative samples; where the abusive sellers (positively labeled set) have been identified in November and December of 2017 i.e., identified beyond the training time period.

  1. Randomly initialize , and .

  2. Set the step-size schedule appropriately.

  3. repeat

    1. Sample (with replacement) mini-batch from the training data.

    2. For set

–       for
–        where
–       .
  • For set

  • –        for
    –       For entity with binary target information :
               where
               
    –       Compute gradient and partial Fisher information matrix
               w.r.t. and update using (11).
    –       Update using (6).
    –       Compute gradient and partial Fisher information matrix
             w.r.t. and update using (11).
    –       Update using (8) or (10).
  • Compute gradient and partial Fisher information matrix w.r.t. and update using (11).

  • Update using (2).

  • until forever

  • Algorithm 1 Partial Natural Gradient.

    Table 1 compares the relative performance of our semi-supervised approach with un-supervised approaches namely, BNBCP (Hu et al., 2015), BPTF (Schein et al., 2015), and M-Zoom (Shin et al., 2016) on the Precision, Recall and ROC-AUC measured on the test set. Among the un-supervised techniques, M-Zoom and BPTF are batch algorithms while BNBCP uses stochastic variational inference (our implementation). All the three flavors of our semi-supervised Logistic CP model implementations are stochastic in nature. Our proposed natural gradient based implementation is the only one that requires inverting a matrix in each iteration, hence it is approximately % slower than the online EM with sufficient statistics based implementation - which is the fastest among the three. The stochastic gradient based implementation is approximately % slower than the online EM with sufficient statistics based implementation.

    For any column, the metric with the highest value is set at

    . All other values are computed relative to this highest value in percent. Precision and Recall are measured on the test set taking the top

    sellers computed from each method; where corresponds to the number of unique sellers associated with the top blocks of M-Zoom output. ROC-AUC is computed on the test set. M-Zoom does not assign a score for each entity across the tensor modes but rather produces anomalous sub-tensors (i.e., blocks) as output. Hence ROC-AUC cannot be easily computed for M-Zoom.

    Method Precision Recall AUC
    Un-Supervised M-Zoom (Shin et al., 2016) 83.8 83.3 -
    BPTF (Schein et al., 2015) 97.2 88.5 90.0
    BNBCP (Hu et al., 2015) 93.3 78.2 78.8
    Logistic CP [Natural Gradient] 94.1 68.7 77.6
    Semi-Supervised Logistic CP [Sufficient Statistics] 83.3 92.3 91.0
    Logistic CP [Stochastic Gradient] 88.9 94.9 92.2
    Logistic CP [Natural Gradient] 100.0 100.0 100.0
    Table 1: Abusive sellers: un-Supervised and semi-supervised results - relative performance.

    All the three flavors of our semi-supervised Logistic CP implementations have higher recall and AUC as compared with the un-supervised techniques. The best performing un-supervised method for AUC is BPTF and its value is % lower than our semi-supervised approach that uses partial natural gradient for inference. And the best performing un-supervised method for precision is also BPTF and its value is around % lower than our semi-supervised approach that uses partial natural gradient. The precision, recall and AUC for the semi-supervised approach that uses Online-EM with sufficient statistics (scalar updates) is lower as compared with the other two semi-supervised approaches indicating that scalar updates of vector parameters results in sub-optimal solutions.

    6.2 Detection of Abusive Reviewers

    We have partial ground truth data of roughly % of reviewers who have been actioned against for being guilty of paid reviewer abuse - treated as positively labeled samples. To this we have included an almost equal number of reviewers who had the lowest scores from the un-supervised tensor decomposition model - treated as negatively labeled samples. This forms the training set. We have an additional set of reviewers (roughly 4200), that is treated as our test set. In the test set; roughly % of the reviewers are labeled as positive i.e., identified as being guilty of paid reviewer abuse in the months of November and December 2017 to which we have added an almost equal number of reviewers that are labeled as negative.

    Table 2 compares the relative performance of our semi-supervised approach with un-supervised approaches namely, BNBCP (Hu et al., 2015), BPTF (Schein et al., 2015), and M-Zoom (Shin et al., 2016) on the metrics, namely, Precision, Recall and ROC-AUC. Precision and Recall are measured on the test set taking the top reviewers computed from each method; where corresponds to the number of unique reviewers associated with the top blocks of M-Zoom output. ROC-AUC is computed on the test set for all techniques except M-Zoom for the reason stated in the previous subsection.

    Method Precision Recall AUC
    Un-Supervised M-Zoom (Shin et al., 2016) 97.2 91.3 -
    BPTF (Schein et al., 2015) 99.7 97.8 96.7
    BNBCP (Hu et al., 2015) 99.1 87.1 89.6
    Logistic CP [Natural Gradient] 97.4 87.6 90.8
    Semi-Supervised Logistic CP [Sufficient Statistics] 99.4 79.3 96.2
    Logistic CP [Stochastic Gradient] 100.0 100.0 99.2
    Logistic CP [Natural Gradient] 99.6 98.5 100.0
    Table 2: Abusive reviewers: un-Supervised and semi-supervised results - relative performance.

    Our semi-supervised method with stochastic gradient and/or partial natural gradient learning have higher recall and AUC numbers as compared with all the un-supervised techniques. However the gain realized here is not as significant as was obtained in detecting abusive sellers when we look at the best performing un-supervised methods, namely BPTF and M-Zoom. The reason being that the seller behavior for a given product in the time period related to review abuse is more homogenous i.e., the product ratings that are obtained during the abusive period from some common pool of reviewers is statistically higher than normal. On the other hand, a reviewer may be colluding with more than one seller for providing fake reviews and at the same time may be a normal buyer for a product from a different set of sellers. This makes the training data for a reviewer more noisier and is hence providing lesser information for the semi-supervised methodology - resulting in smaller gains in precision, recall and AUC. There is very little difference in the performance between stochastic gradient and natural gradient, in fact stochastic gradient shows slightly better precision and recall numbers than obtained from natural gradient method. The recall numbers for the semi-supervised approach that uses Online-EM with sufficient statistics (scalar updates) is lower as compared with all the un-supervised approaches indicating that scalar updates of vector parameters results in sub-optimal solutions.

    6.3 Multi-mode Binary Target Information

    Our framework supports providing binary target information to more than one mode of the tensor simultaneously. Hence we also experimented with simultaneously providing the binary target data for both the seller and reviewer mode. However since the binary target information for the reviewer mode is noisy, the overall performance in detecting new sellers and reviewers were almost identical to the corresponding cases where we provide the binary target data to only one of the respective modes.

    6.4 Partial Natural Gradient versus Baselines

    We apply our semi-supervised Logistic CP tensor decomposition approach on the Amazon review data (5-mode tensor) to compare the performance of two baseline learning methods, namely, sufficient statistics (Online-EM) and stochastic gradient with our proposed partial natural gradient learning. Figure 4

    shows the AUC plot for detecting abusive sellers and abusive reviewers across six epochs. Color red (solid) corresponds to natural gradient learning, color blue (dash-dot) corresponds to stochastic gradient learning and color black (dash) corresponds to online EM with sufficient statistics. We make the following observations:

    1. Stochastic gradient learning has a tendency to over-train. The reason being that it is unable to shrink some of the values of towards zero since that the tensor rank is less than .

    2. Partial natural gradient learning does not suffer from significant over-training since it is able to shrink 60% of the values of towards zero with-in the first one thousand iterations. This leads to similar AUC on train and test data sets for detecting abusive sellers as compared with the baselines. For detecting abusive reviewers however, the anomalous associations between the entities of the tensor mode in the test data is not necessarily indicative of abuse; hence the train and test AUC for all the three learning methods are much further apart.

    3. For detecting abusive reviewers, at the end of six epochs; the test AUC is almost identical for both partial natural gradient and stochastic gradient learning. However for detecting abusive sellers, the test AUC is almost 8% lower with stochastic gradient learning as compared with partial natural gradient learning.

    4. For detecting abusive sellers, partial natural gradient shows better learning on test data than the baselines. For detecting abusive reviewers, though the AUC at the end of six epochs is similar for partial natural and stochastic gradient approaches, the AUC curve for partial natural gradient learning is much smoother as compared with the AUC curve from stochastic gradient learning.

    5. Online-EM with sufficient statistics show poorer performance on test data (for both reviewers and sellers) when compared with performance of stochastic gradient or partial natural gradient learning.

    (a) Train data: sellers. (b) Train data: reviewers.
    (c) Test data: sellers. (d) Test data: reviewers.
    Figure 4: Efficiency of partial natural gradient learning in identifying abusive sellers & reviewers.

    7 Conclusions and Future Work

    We have formulated anomaly detection as a semi-supervised binary tensor decomposition problem that can simultaneously incorporate binary target information for a subset of sellers and/or reviewers - based on the Logistic Model with Pólya-Gamma data augmentation. We have proposed natural gradient learning for inference of all the latent variables of the semi-supervised model and shown that the Pólya-Gamma formulation simplifies calculation of the partial Fisher information matrix. Our results have demonstrated that the proposed semi-supervised approach beats the state of the art unsupervised baselines and that our inference using partial natural gradient learning has shown better learning than online EM (using sufficient statistics) or stochastic gradient learning on test data sets from the time period that is non-overlapping with the training time period.

    Future work: Our framework can be easily extended to do multi-target learning for each mode(s). For example, instead of considering only one form of abuse for the seller mode; in the multi-target domain we can simultaneously incorporate label information for different types of abuses. In such a setting, each binary target in a given mode corresponds to one type of abuse. We hypothesize that this would increase the precision/AUC of predicting new abusive entities since we can borrow information via accounting for the correlation across multiple forms of abuse. Such a multi-target learning could also be applied towards improving the performance of a recommender system. For example in the movies recommendation domain, we have the publicly available MovieLens data set. The MovieLens data consists of associations (hence can be considered as a binary tensor), where each association is a tuple of person, movie, time and rating. In this setting, we could incorporate gender, age-band and occupation type of a movie goer as multiple binary targets for the person mode. We could incorporate the genre of the movie as multiple binary targets for the movie mode. We hypothesize that such a semi-supervised CP tensor decomposition could result in recommending better movie choices that the movie goer might be interested to watch in the near future.

    Appendix A Computation of Partial Fisher information matrix

    Partial Fisher information matrix w.r.t.

    Consider the exponent of the function to be maximized w.r.t. , which is:

    (12)

    where and . And the operator represents element-wise division between the two vectors & .

    The first exponential term of the right hand side (RHS) of (12) is the joint conditional likelihood of the binary outcome , denoted as and the conditional likelihood of the Pólya-Gamma distributed variable (from data augmentation) denoted as . The second exponential term of the RHS of (12) is the Gaussian prior on with variance . The stochastic natural gradient ascent-style updates for at step is given as:

    (13)

    where in (13) is given by the following equation:

    in (13) denotes the inverse of the partial Fisher information matrix whose size is since the second order derivatives are computed only w.r.t. .

    Note that the joint likelihood term in (12) is un-normalized. The partial Fisher information matrix is computed only w.r.t. the data i.e., . To do this, we first compute the expectation w.r.t. the data augmented variable . The resulting equation is a Logistic function from the following identity:

    (14)

    There is a closed-form solution for the integral in (14), hence we obtain:

    (15)

    Equation (15) is a normalized likelihood for each and let it be denoted by . Using the definition of hyperbolic cosine, (15) becomes:

    To this end, the partial Fisher Information with respect to for and is given by:

    where denotes the matrix whose rows are for and denotes the diagonal matrix whose diagonal elements are ; where:

    The prior term is accounted by considering its variance as a conditioner, hence the conditioned partial Fisher Information matrix for the parameter at step is given by:

    (16)

    where diag denotes inverse of a diagonal matrix whose diagonal is .

    Partial Fisher information matrix w.r.t. for mode

    The function to be maximized w.r.t. is:

    (17)

    where the operator represents element-wise division between the two vectors & .

    Let be the matrix whose rows are for for mode with target information. We compute the partial Fisher information matrix similar to as:

    where the is the diagonal matrix whose diagonal elements are where:

    Partial Fisher information matrix w.r.t. for mode : Without Target Information

    The function to be maximized w.r.t. factor is:

    (18)

    where is the vector of Lagrange multipliers for the non-negativity constraint on the . And the operator represents element-wise division between the two vectors & .

    Similarly we can compute the partial Fisher information for corresponding to element in mode as:

    where is a matrix whose rows are for and is the diagonal matrix whose diagonal elements is given by:

    Partial Fisher information matrix w.r.t. for mode : With Binary Target Information

    The function to be maximized w.r.t. factor is: