Subgroup Fairness in Graph-based Spam Detection

by   Jiaxin Liu, et al.

Fake reviews are prevalent on review websites such as Amazon and Yelp. GNN is the state-of-the-art method that can detect suspicious reviewers by exploiting the topologies of the graph connecting reviewers, reviews, and target products. However, the discrepancy in the detection accuracy over different groups of reviewers causes discriminative treatment of different reviewers of the websites, leading to less engagement and trustworthiness of such websites. The complex dependencies over the review graph introduce difficulties in teasing out subgroups of reviewers that are hidden within larger groups and are treated unfairly. There is no previous study that defines and discovers the subtle subgroups to improve equitable treatment of reviewers. This paper addresses the challenges of defining, discovering, and utilizing subgroup memberships for fair spam detection. We first define a subgroup membership that can lead to discrepant accuracy in the subgroups. Since the subgroup membership is usually not observable while also important to guide the GNN detector to balance the treatment, we design a model that jointly infers the hidden subgroup memberships and exploits the membership for calibrating the target GNN's detection accuracy across subgroups. Comprehensive results on two large Yelp review datasets demonstrate that the proposed model can be trained to treat the subgroups more fairly.



page 7


Fake or Genuine? Contextualised Text Representation for Fake Review Detection

Online reviews have a significant influence on customers' purchasing dec...

ColluEagle: Collusive review spammer detection using Markov random fields

Product reviews are extremely valuable for online shoppers in providing ...

Stay On-Topic: Generating Context-specific Fake Restaurant Reviews

Automatically generated fake restaurant reviews are a threat to online r...

Amazon Fake Reviews

Often, there are suspicious Amazon reviews that seem to be excessively p...

A Novel Higher-order Weisfeiler-Lehman Graph Convolution

Current GNN architectures use a vertex neighborhood aggregation scheme, ...

Opinion Fraud Detection via Neural Autoencoder Decision Forest

Online reviews play an important role in influencing buyers' daily purch...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Graphs are widely used in various applications, such as crime forecasting system (Jin et al., 2020; Wang et al., 2020a), recommendation systems (Wu et al., 2020; Ying et al., 2018), social networks, and spam detection (Liu et al., 2020; Dou et al., 2020b)

. For those non-Euclidean data, Graph Neural Networks (GNNs) are powerful architectures for graph representation and learning. These applications are involved in our day-to-day decision-making; it will harm society if there exists any strong bias in GNN’s output. Therefore, in addition to improving the accuracy, there is growing interest in enhancing fairness in GNNs 

(Li et al., 2020; Spinelli et al., 2021; Ma et al., 2021; Dai and Wang, 2021; Agarwal et al., 2021). Unfortunately, those specific spam detection works (Dou et al., 2020a, b; Li et al., 2019; Wang et al., 2019; Wu et al., 2020) focus exclusively on the accuracy of the fraud detectors and ignore the model fairness. Therefore, we study the fairness of GNN in the background of the spam detection task on the review graphs (Luca, 2016; Rayana and Akoglu, 2015) that contain reviewers, reviews, and products nodes.

Generally, fairness issues on the graph are raised in two perspectives: (a) bias from the node feature (Kang et al., 2020; Rahman et al., 2019) (b) topology bias coming from the edge connections (Jiang et al., 2022; Spinelli et al., 2021; Li et al., 2020; Chen et al., 2021). In Figure 1, we give an example of a review graph where the edges describe that users leave reviews for products. Ideally, given the review graph , the detector infers the suspicion for reviews and assigns higher suspicion for spams than non-spams based on their review features. In practice, restricted by the anonymity of the spammers, we have no access to the traits of users, such as gender, age, and race. Therefore, it is unrealistic to divide nodes into various groups based on node features and enforce the statistical parity on each group.

Since our spam detection task is under the transductive setting, node degrees are observable for the training and test set, representing the number of reviews associated with the users in our graph. For instance, previous work  (Burkholder et al., 2021) reveals that users or products with fewer reviews can have a higher chance of being screened by detectors (GNN). This unfair situation is caused by the long-tail distribution of the user node degree. According to the computation graphs (see Figure 1 right side), if we only compare spams (posted by a high-degree user) and (posted by a low-degree user), the spam information of will be diluted by other non-spam reviews as message passing from bottom to up in aggregation operation for GNN. Therefore, old businesses can reduce their suspicion and evade detection by easily hiding the latest spam reviews in many of their past non-spam reviews, which is unfair to most new businesses with few reviews. Likewise, in this work, we stipulate the “protected group” users (denoted by the sensitive attribute ) who post fewer reviews than a certain threshold and the “favored group” users () for the rest of the reviewers.

Figure 1. Left: graph-based detector (GNN) infers the suspiciousness for reviews in a graph; right: computational graphs for the GNN on spam reviews coming from different groups.

Unfortunately, unfairness exists inside the “favored” groups where the detector treats users from the “favored group” vastly differently. Based on the analyses above, the crux of reducing suspicion depends on whether one user can weaken its spam signal by aggregating signals from other genuine reviews posted by itself. For example, in Figure 1, reviews posted by two high-degree users (posts spam and non-spams; named as “mixed” user) and (posts spam only; named as “pure” user) will be treated differently. Since posts spam reviews only, no matter which spam is calculated at root, GNN will not downgrade its suspicion. On the other hand, for , the detector will be deceived by those non-spams. Therefore, to achieve higher accuracy on group , GNN will target “pure” users like . In other words, if we force the detector to be fair with respect to sensitive attribute only, the fairness between mixed and pure users suffers neglect. This paradox is the so-called “Simpson’s Paradox” raised by the aggregation bias that we wrongly assumed the detector assigns all users from the “favored” group a low suspicion, which neglects the user discrepancy. Thus, it is critical to divide the group of high-degree users () into subgroups: “mixed users” and “pure users”, then maintain group fairness and subgroup fairness both at the same time. There are several challenges to solve this problem:
Define the subgroup correctly on the graph. Most of the recent works (Ustun et al., 2019; Kearns et al., 2019, 2018) study the combination of different sensitive attributes which are observable and can divide the dataset into various groups. Others (Celis et al., 2021; Awasthi et al., 2020; Mehrotra and Celis, 2021) work on the problem when sensitive attribute is unknown or contains noisy information. In (Chen et al., 2019)

, they built probabilistic models to approximate the true sensitive attribute using proxy information, such as inferring race from the last name and geolocation. All these methods utilize I.I.D vector data whose sensitive attribute is already defined, such as race and gender. Unlike the above works, we handle graph data where proxies are not observed. Besides, we are looking at some sensitive attributes related to the ground truth label and specific to spam detection on graphs that have not been formally defined.

Infer unknown subgroup membership. We would like to improve subgroup fairness inside favored group between “pure user” such as who posts all the spam reviews (denoted as ) and “mixed user” such as who posts both spam and non-spam reviews (denoted as ). In fact, we obtain the precise sensitive attribute for each user only after fully observing the label of all the reviews, which is unattainable for test users. Therefore, we have to infer subgroup membership for the test set. It is forthwith another problem: it is hard to identify the subgroup and give them special treatment without enough data about the subgroup users within the favored group. Prior researchers had done lots of works on data augmentation for graph data, such as GraphSomte (Zhao et al., 2021), GraphMixup (Wang et al., 2021; Wu et al., 2021), GraphCrop (Wang et al., 2020b), which expanded the training set by generating synthetic data. However, most data augmentation strategies focus on mitigating the class-imbalance issue but are not designed to provide data to help detect subgroups. Meanwhile, implementing all these data augmentation methods requires knowing which class is in the minority and needs to be expanded. Therefore, we can only augment data for minority subgroups after inferring the subgroup membership.
Maintain subgroup fairness with group fairness. Improving the subgroup fairness after inferring the subgroup membership is another challenge. Most of the related works (Kearns et al., 2018) formulated the optimization problem with multiple fairness constraints for each subgroup combination. In (Ustun et al., 2019)

, it incorporates multiple classifiers and one for each subgroup. However, when the subgroup needs to be inferred, e.g., subgroup membership is probabilistic rather than deterministic, the prior optimization formulation or model designs such as introducing fairness regularizer are not applicable.

To solve the above challenges, we define the subgroup member according to the property of our review graph. This new sensitive attribute for user node reflects the label distribution of reviews posted by each user from group , i.e., a high-degree user posts either single-class reviews (all genuine or all fake) or mixed-class reviews (both fake and genuine). Since when one user posts reviews with the same labels, no matter which reviews is at the root of the computation graph, GNN’s detection will not be affected by aggregating other reviews’ messages. Then, we have the true sensitive attribute for the training users by accessing the ground truth labels for their associated reviews. Next, We infer the unknown subgroup memberships for the test users using the data augmentation method. By duplicating subgroup users with their reviews from the training set then pruning part of their associated spams, we synthesize nodes to enlarge the size of the minority subgroup. Rather than introducing more fairness constraints for subgroups data, we directly treat the new sensitive attribute as an indicator variable appending to the node feature. Finally, we construct a joint-training framework which adopts two GNNs: the first GNN is to infer indicator variables while the second GNN utilizes the output of and classifies the node class.

2. Preliminaries

2.1. Spam detection based on GNN

We study the spam detection on the review-graph defined as , where denotes the set of nodes and represents the set of undirected edges. Each node has a feature vector , where the subscript is the node index. There are three types of nodes in , i.e., user, review, and product, respectively, and each node can be of only one of the three types. We denote the subsets of each type of node as . The neighbor of node is represented as .

GNN (Kipf and Welling, 2016) is the state-of-the-art method for node prediction task containing multiple layers. For the GNN classifier , let be the learned representation of node at layer , where . is calculated from the message-passing as follows:

where AGGREGATE and COMBINE are two functions that can take various forms. In this work, AGGREGATE computes the mean of the input vectors, and COMBINE consists of the ReLU and an affine mapping with parameters

. The input vector is treated as the representation at layer 0. Let be the prediction for node given by GNN, where . We minimize the cross-entropy loss for the training node set :


where is the target label for node . We list the main notations in Table 1.

Notations Definitions
Review graph
Nodes of graph
Edges of graph
Feature and label of node
Set of direct neighbors of
Cardinality of a set
Training nodes, test nodes
User, review, and product nodes
Binary sensitive attributes
The set of review, user nodes with
The set of user nodes with attributes
GNN models with parameters and
Output of model for node
Representation of node from layer
Synthetic node by mixup between node and
Label for the synthetic node
Synthetic representation from and
Table 1. Notations and definitions.

2.2. Fairness regularizer

Group. Based on the node degrees (sensitive attribute ), we split user nodes into “protected group” whose degree is smaller than the 95-th percentile of all the user nodes’ degree, and “favored group” for the remaining users. The subscript denotes the value of the sensitive attribute. Then, we divide the review nodes into and following the group of their associated users, i.e., the user and its associated reviews have the same value of . The fairness of GNN is evaluated between the output of and .

Fairness regularizer

Due to the highly-skewed class distribution in spam detection task, i.e., most of the reviews are genuine (

), it is appropriate to select the ranking-based metrics such as NDCG to evaluate the detector’s output for the two groups. Therefore, we calculate the gap between the NDCG ranking score on the prediction for review groups and given by GNN. Since we would like to improve the performance (NDCG) of the detector on the “favored” group, we start from the GNN model with a fairness regularizer on the ”favored group”. In other words, all the GNN models referred to in this paper below are regularized by negative NDCG for . We adopt a differentiable surrogate NDCG (Burkholder et al., 2021) on “protected” group and use the negative NDCG as a regularizer:


where is the total number of positive and negative pairs of reviews used for evaluating fairness in the training . Then, the objective function for training our GNN model becomes:


where is the coefficient of the regularization term.

3. Methodology

3.1. Subgroup definition and selection

Subgroups. We let the sensitive attribute indicate the “favored” () and “protected” () groups based on node degree. For users who post a large number of reviews (), GNN can still treat individual users differently based on the label distribution of their posted reviews. In other words, the GNN model can be biased by the neighborhood label distribution. In particular, due to the aggregation operator of GNN, the suspiciousness of a spam review of the user will be reduced by other non-spams posted by . We use an additional sensitive attribute to identify whether one user will be “favored” due to its non-spams reviews:


where and are reviews posted by and . When , it represents a user who posts both fake () and genuine () reviews. indicates that a user posts multiple reviews belonging to either the fake or genuine class but not both. The unfairness issue among high-degree users lead to the split of into and , where the first and second term in the subscript represents the value of and , respectively. Almost all users from post reviews in just one class; thus, the sensitive attribute is not relevant.

Infer unknown subgroup membership. We can determine the value of the subgroup indicator for users whose reviews are known to be spam or not. In practice, most of the review labels are unknown, while we hypothesize that inferring for all user nodes within the larger group will help resolve the subgroup unfairness, as GNN can use the inferred attribute values to strike a more equitable treatment of the two subgroups. Therefore, we introduce a second GNN model that helps infer for users whose reviews are not fully labeled. In principle, we can use any predictive model to map to , but we choose GNN due to its capability of modeling neighborhood data distribution that can be helpful for the inference. Let be the predicted for user . The loss of becomes to:


where is the ground truth value of sensitive attribute for those user nodes whose reviews are fully labeled. However, this elementary model is not sufficient: on a review graph, very few user nodes are labeled as (see Table 3), while the majority of users nodes are accessible during training time (transductive learning), though their reviews are not fully labeled to provide a value of . We propose two novel methods to address these two challenges in the next two sections .

3.2. Data augmentation for minority groups

3.2.1. Augmentation for minority user subgroup

The lack of users in the subgroup makes the training of difficult, leading to poor performance of the inference of . We will augment the data in to address this issue.

First, the augmentation is to mimic the original distribution of data in . Oversampling (Chawla et al., 2002) adds multiple copies of the minority data as an augmentation and is straightforward. We replicate the minority user nodes in the subgroup , along with their reviews and their connections the same products.

Rather than using exactly the same duplication, we slightly perturb the copies of user and review nodes to generate more variations. Similar to augmentation methods for image data, including flipping (Krizhevsky et al., 2012), cropping (Perez and Wang, 2017), rotation (Cubuk et al., 2019), noise injection (Bishop, 1995), and so on, we randomly prune the reviews of the replicated users to create diverse neighbor distributions of the replicated users. GNN will therefore generate diverse representations of the minority user nodes that are similar but exactly the same as that of the original user node. Given the labels of all reviews posted by , the genuine reviews () dominate all reviews, and we remove some randomly selected genuine reviews and always keep the more scarce spam reviews (). As a result, the replicated user will have a higher ratio of fake reviews, which can preserve the subgroup membership of the user (that is, so that it has multiple reviews and so that it has both fake and genuine reviews). The augmentation of user nodes in is demonstrated in Figure 2.

Figure 2. An example of data augmentation for minority subgroup. We duplicate the minority subgroup “mixed user” and their associated reviews, then randomly prune edges linked to the non-spam reviews.

3.2.2. Augmentation for minority review group

The number of reviews posted by group , denoted by , is minor compared to the other group, and the model may have difficulty modeling the small set of reviews. Mixup (Zhang et al., 2017)

, one of the data augmentation methods, generates synthetic image data using convex combinations between any two original labeled data points to interpolate the otherwise sparse training distribution. Our method for augmenting the minority group

is based on the specific mixup framework (Wang et al., 2021) for graph data which considers both the node features and topology structures. Unlike their work, we do not implement the mixup over all the training nodes. Precisely, our augmentation method carefully delimits nodes for mixup so that the synthetic data will not violate fairness across groups. Here are the mixup for GNN’s input, node embeddings at each layer, and label for the synthetic data:


where and represents the mixture of node attributes and at the input layer. denotes the mixture at the

-th layer generated from the two aggregated node hidden representations

and (). denotes the label for the synthetic data .

First, since we would like the synthetic reviews to be similar to the existing reviews from the minority group , we ensure that at least one of the two nodes to be mixed up is sampled from users in . Second, we also want to mitigate the class-imbalance issue within the minority group . We require one of the mixed node to be a spam review (). Formally, the first node to be mixed is sampled from


We propose three sources from which the second node can be sampled. The first case is treated as our method, and the second and third cases can be seen as two baselines.

First, we need to augment spam reviews and sample the second node from the spam review set. Compared to , there are more spam reviews inside the “protected” reviews group (see Table 3), so that we sample the second node from


Second, during transductive training, we can access the node degree of the test nodes (denoted as ), which can be divided into the “protected” and “favored” groups according to node degrees. At this time, we only ensure that the second node belongs to the same group () to the first node. The second node can be sampled from


The third option is to utilize the “protected” reviews from the test set. In this case, the second node comes from a different group to the first node.


Since we do not know the label for the second node in the second and third cases, the synthetic node will share the same label to the first node, i.e., . To ensure the synthetic reviews are similar to ones from the minority group , we let the mixup weight .

3.3. Joint model

In this section we present how to improve model fairness with the help of inferring subgroup membership .

Fairness for the subgroup. Suppose we had known the subgroup membership, the most straightforward way to improve subgroup fairness is to introduce an additional fairness regularizer, such as disparate impact, over the subgroups to Eq. (3) when training . However, we do not observe the value of but can only infer its value probabilistically using the model

. Such probability output cannot be used by existing fairness regularizers that require deterministic sensitive attributes. We design the following joint model that can find

and optimize for fairness simultaneously.

For the inferred subgroup membership of test user nodes, we treat it as additional information to inform the GNN model about how to treat the two subgroups differently. At this time, the concatenated feature of test user node fed into the classifier becomes to where is the probability that one user has sensitive attribute inffered by . For the training subgroup user node, we append the ground truth of to the feature vector so that where can be obtained from Eq.4.

Optimization for the joint models. There are two GNNs, and , to be trained. We propose to optimize and jointly so that the two models can co-adopt to each other during training. The attribute is not considered an observed constant but a function of the parameter of the model , as we train the model . The concatenated user feature becomes , which involves as parameters. Therefore, the loss for optimizing classifier in Eq. (3) becomes . During the training, and will be updated simultaneously:


where and are two learning rates for updating parameters and respectively. See Algorithm 1 for a full description.

Input: graph ; node features ; sensitive attribute

; number of training epochs

; hyper-parameter , , and .
Output: optimal model parameters and .
Initialize parameters and of the two GNN models.
Replicate user nodes times as in section 3.2.1. Augment data for minority subgroup
for  do
     Prune replicated edges as in section 3.2.1. Add data variations
     Infer for test users using .
     Concatenate (ground truth , resp.) to test (training, resp.) user feature vectors.
     Mixup using users sampled from with users sampled from one of . Data augmentation for minority group
     Evaluate in Eq. (5) and in Eq. (3).
     Update and following Eq. (13) and (14).
end for
Algorithm 1 Joint training for subgroup fairness

A baseline is to optimize the two models separately. On the training set, we can observe of the users since we have access to the label of their posted reviews. Therefore, we can optimize the classifier

by minimizing the loss function

in Eq. (5) on the training users. When predicting classes for the test users, we concatenate to the test user feature vectors and applied the classification GNN that was trained separately using the ground truth on the training set.

4. Experiments

4.1. Results

YelpNYC YelpZip
W/O Pre-trained Joint (Ours) W/O Pre-trained Joint (Ours)
GNN 85.2 85.1 85.2 88.4 87.6 88.6
85.1 85.3 85.2 88.4 87.6 88.6
21.9 21.8 21.3 36.3 34.3 34.8
GNN- (Ours) 85.8 85.9 85.9 89.7 89.7 89.6
85.9 86.0 86.0 89.7 89.7 89.6
19.1 19.0 17.9 38.7 36.0 34.3
GNN- 85.3 85.4 85.4 89.4 89.6 89.0
85.3 85.5 85.5 89.4 89.0 89.1
21.9 21.9 20.9 38.9 36.9 34.9
GNN- 85.7 85.8 85.8 89.6 89.6 89.5
85.8 85.8 85.8 89.6 89.6 89.5
21.0 19.8 19.3 38.7 36.2 34.6
Table 2. NDCG for GNN’s prediction on two Yelp datasets. Shown is the average results over ten different splits.

We first demonstrate the subgroup fairness issue commonly found in a predictive GNN model, even with a fairness regularizer as in Eq. 2. Then, we demonstrate that subgroup fairness can be improved by introducing the subgroup membership . Lastly, we show the advantages of inferring and node class simultaneously and study the sensitivity of the impacts of three subgroup augmentation strategies. We seek to answer the following research questions:

  • Q1: Do fairness issues exist between the subgroups of “mixed” and “pure” users and between the groups of “favored” and “protected” users when using GNN for spam detection?

  • Q2: How to infer the subgroup memberships with very limited data from one subgroup?

  • Q3: Will inferring subgroup membership help improve group and subgroup fairness?

4.2. Datasets

Dataset Data Statistics
(, )
YelpNYC 923 358911 160220 (0.76%, 0.009%) 0.0479
YelpZip 5044 608598 260277 (0.27%, 0.002%) 0.0426
Table 3. Statistics of dataset. We list the number product, review and user nodes with the proportion of high-degree user () and “mixed” user () in each dataset. The last column gives the ratio of spams in group and .

We use two commonly used Yelp review datasets (see Table 3) in previous spam detection (Burkholder et al., 2021; Dou et al., 2020b, a). For sensitive attribute , we set as the cutoff degree of user nodes to distinguish “favored” (top high-degree user nodes, denoted as ) from “protected” groups (the remaining user nodes, denoted as ). All the review nodes have the same value of sensitive attribute corresponding to their associated users. In the last column of Table 3, we give the ratios of spam between groups and . The imbalanced distribution of spam reflects the fact that one review posted by a “favored user” is less likely to be spam. To study the subgroup fairness inside the favored group (), we further split the “favored user” into “pure” () and “mixed” () subgroups based on the label of their posted reviews following Eq. (4). We split all user nodes into training (), validation (), and test () sets and divide the review nodes according to users who post the reviews. Note that is known only on the training set.

4.3. Experimental Settings

4.3.1. Evaluation Metrics.

Because of the imbalanced distributions of spams and non-spams, we utilize NDCG to evaluate the detector’s capability of ranking spams on the top for human inspection. NDCG can be evaluated on the entire test set or on individual (sub)groups. As we are also interested in the subgroup fairness within group , we propose another metric called “Average False Ranking Ratio” to evaluate the average of relative ranking between spams from “mixed” and “pure” users,

where denotes the total number of spams from a subgroup. The inner ratio of AFRR calculates the proportion of non-spams ranked higher than spam over all the non-spams from group . Then, we take the average of these ratios for all the spam reviews. NDCG for each subgroup only tells us whether spams are ranked higher than non-spams from the same subgroup, while AFRR considers all the non-spams across different subgroups. The lower the AFRR, the higher the ranks of spams over all non-spams. It is reasonable since we would like the detector to give a higher probability of being suspicious to spams than non-spams from all users rather than just a (sub)group of users.

4.3.2. Baselines.

To demonstrate the importance of including subgroup information during the training and answering the above questions, we introduce two sets of baselines. The proposed method is denoted as “Joint+GNN-”. We use “a + b” to denote various experimental settings where “a” represents a variant for model and “b” represents a variant for model .

Variants of :

  • [leftmargin=*]

  • W/O: vanilla GNN without the subgroup membership .

  • Random: randomly concatenates as subgroup memberships to user feature vectors in the group on the test set.

  • GT: concatenates ground truth of to user feature vectors in group on the test set. This is the ideal and yet unrealistic case, as is unknown and has to infer .

  • Pre-trained: is a variant of Joint. We pre-trained then infer for the test users and fixed this inferred membership when training .

Variants of the GNN :

  • [leftmargin=*]

  • : sample the second node for mixup from set in Eq. (11).

  • : sample the second node for mixup from set in Eq. (12).

We set the number of training epochs , hyper-parameter , learning rate , weight decay equals to for both and , mixup weight . Besides, we have 10 training-validation-test splits of the YelpNYC, and 9 training-validation-test splits of YelpZip. The following results are all based on the average over all the splits.

Figure 3. Box plot for AFRR on spams from training and test “mixed” and “pure” users over nine splits of YelpZip. (a) only second GNN without any mixup method. (b)-(d) Jointly training and with three types of mixup method. Subgroup fairness improved by introducing and joint training method.

(Sub) Group fairness. To answer question Q1, we measure the difference in the NDCGs of reviews from groups of and as the group fairness metric and the difference in AFRRs of spams between subgroups of and as the subgroup fairness metric. In Table 2, the columns represent three variants of on the two datasets, and the rows represent three variants of the GNN . We showed that the NDCG over all test reviews , the NDCG over test reviews from the “protected” group , and the NDCG gap between groups and (denoted as ). Based on the for the method W/O+GNN, there is an evident gap in detection efficiency between the favored and protected groups. A lower NDCG score indicates that the unfair detector tends to assign a lower suspiciousness to spams from the “favored” group.

In Figure 3, we demonstrate the training and the test AFRR of the subgroups of “pure” and “mixed” with four methods, averaged over 9 training-validation-test splits of the YelpZip dataset. Based on the first subfigure representing W/O+GNN, it is clear that the detector has already generated unfairness predictions between the “pure” and “mixed” users during the training,. In other words, inside group , the GNN tends to rank spams from “pure” users higher than spams from “mixed” users. Besides, by checking the median and the interquartile range of AFRR for the test “mixed” users, our method (subfigure (b)) can improve the subgroup fairness by raising the rank of spams from “mixed” users.

Carefully checking the results in Table 2, if we fix the method for , our “Joint” training method has the smallest fairness gap in most cases (only one exception for Joint-GNN on YelpZip). Meanwhile, fixing the method for , our data augmentation method “GNN-” has the best results among all the baselines on YelpNYC, and one case in YelpZip. Since YelpZip is a sparse and large graph with fewer minority subgroup nodes (cf. in Table 3), it is difficult to augment data without modifying the original subgroup distribution. Besides, detectors are prone to overfit if we start from an extremely small training set.

Impact of . We attempt to answer questions and by studying the impact of subgroup membership and its inference accuracy. In Figure LABEL:fig:_impact_of_A'_accuracy., we demonstrate the average impact of to the group fairness by generating in different methods over nine splits on the YelpZip test set. There are four groups of stacked bars which represent four different methods for . Within each group of stacked bars, there are five variants for generating , with less and less noise in going from the left-most to the right-most bars. It is clear that the fairness gap will be reduced as receives more accurate inference of : the closer the output of to the ground truth , the smaller the group fairness gap. In particular, the setting of using the unknown ground truth provides a theoretical lower bound of fairness gap for other four methods.

Method YelpNYC YelpZip
Joint+GNN 0.505 0.596
Joint+GNN- 0.640 0.668
Joint+GNN- 0.658 0.660
Joint+GNN- 0.687 0.660
Pre-trained 0.585 0.628
Table 4. Performance of first GNN . AUCs for the prediction on sensitive attribute for the test Group users. The top rows are under the joint training strategy with different mixup methods for the second GNN . The bottom gray row is AUC of under the pre-trained stratefy where is fixed when training .

Since accurate inference of is vital to the group fairness, we further analyze the advantage of the joint training strategy from the perspective of prediction AUC of using the model . In Table 4, we show the average of test AUC of under the “Joint” and “Pre-trained” settings on the two datasets. Interestingly, the AUC over the test users from the group is improved using most “Joint” training methods. The only two exceptions are “Joint+GNN” on both two datasets. Different from “Pre-trained”, the “Joint” training method updates the parameters using the gradient from (see Eq. 14). When there is no data augmentation (i.e., Joint+GNN), this gradient will bias to overfit the training group data resulting in a low AUC on the test set. We find that the proposed mixup strategies over review nodes for can mitigate the overfitting problem of .

Ablation and sensitivity study. We study the average impact of our mixup method (GNN-) and other mixup baselines (GNN-, GNN-, and no mixup) on group fairness. In Figure 5, we show the group fairness gap between “protected” and “favored” groups with different mixup strategies. Each line represents one variant of the model , and the dash lines depict the methods without mixup. The detector has the smallest NDCG gap for each solid line by using our mixup stragety GNN-S. It can demonstrate that delimiting appropriate nodes for mixup is important for the fairness problem. Comparing the solid and dash lines, we can see that the solid lines are below their corresponding dash lines in most cases on YelpNYC, and has the smallest fairness gap using the proposed mixup method. However, on YelpZip, the mixup method improves group fairness only using our “Joint” training strategy (green line). The sparsity of the YelpZip graph causes this effect. Although mixup can mitigate overfitting problems for regardless of graph characteristics such as sparsity and size, we still need to consider those characteristics when using mixup to improve the group fairness.

Besides, we study the sensitivity of duplication times and the impact of pruning non-spam edges during the augmentation for the minority “mixed” subgroup. In Figure 6, we show the test AUCs of the model on two datasets with different numbers of replications and whether to prune replicated edges. It is clear that pruning replicated edges has better AUCs than only only replication only in most cases. We fix the for YelpNYC and for YelpZip in other experiments.

Figure 5. Ablation study for the mixup. The average test for different mixup strategies on three methods for . Each solid lines represent one method for ; the dashed line with the corresponding color represents that specific method+GNN without any mixup strategy.
Figure 6. Sensitivity analysis for duplication time and pruning non-spam edges. The average test AUCs of on graphs with different duplication times and w/ or w/o pruning edges.

5. Related Work

Fairness on graphs. People studied fairness on graphs from many perspectives. In (Buyl and De Bie, 2020; Rahman et al., 2019) researchers attempt to obtain fairness graph embedding and representation, where the sensitive attribute is known and able to divide the graph into several groups. In (Dai and Wang, 2020; Bose and Hamilton, 2019; Agarwal et al., 2021), people introduce the adversarial framework to eliminate the unfairness bias concerning the sensitive attribute. They add fairness regularization terms to constrain their model insensitive to specific protected (sensitive) attributes. Besides, other works solve unfairness node embedding problems by modifying graph structure or re-weighting the edges. FairAdj (Li et al., 2020) adjusts the graph connections and learns a fair adjacency matrix by adding graph structural constraints. FairEdit (Loveland et al., 2022) proposes model-agnostic algorithms which perform edge addition and deletion by leveraging the gradient of their fairness loss. FairDrop (Spinelli et al., 2021) excludes the biased edges to counter-act homophily which causes the unfairness issue in graph representation learning. However, these methods depend on the known or observable sensitive attributes so that these methods only aim to de-bias the node representation toward each group defined by sensitive attributes.

Subgroup fairness. Some works study the subgroup fairness problem on the I.I.D. data. If we force the group fairness with respect to the pre-defined sensitive attribute, then there will be some fairness violations on the subgroups of these pre-defined groups. In (Kearns et al., 2018, 2019), they learn classifiers subject to fairness constraints when the number of protected groups is large. The classifier satisfies the fairness constraints for the combinatorially large collection of structured subgroups definable over protected attributes. In (Ma et al., 2021), they study the subgroup generalization and fairness on a graph and demonstrate that distance between a test subgroup and the training set leads to the performance of GNN.

Augmentation on the graph. Data augmentation on the graph has increasingly received attention recently. In (Wang et al., 2021; Feng et al., 2020), they study the graph augmentation on the node-level that synthetic data by mixup nodes or removing nodes from the original graph. Besides, some researchers operate graph augmentation on edge-level where they modify (adding or removing edges) in deterministic (Zhao et al., 2020) or stochastic way (Rong et al., 2019).

6. conclusion

In this paper, we studied the subgroup fairness problem on the graph-based spam detection task. We first present that subgroup unfairness exists inside the “favored” group divided by the pre-defined sensitive attribute (node-degree, denoted as ). To address this subgroup unfairness problem, we propose a new sensitive attribute () for users based on the label distribution of users’ posted reviews. For users associated with unlabeled reviews, we introduce another GNN model to infer with the graph augmentation method (duplicating the minority subgroup and pruning the non-spam edges). Furthermore, we treat the inference of as an indicator variable concatenating to the original user node feature and train another GNN with implementing the mixup method on the minority group for the spam detection task. According to the experimental results, our joint training strategy for two GNN models with two augmentation methods (one for minority subgroup, one for minority group) effectively reduces the NDCG gap between groups and the AFRR between subgroups.


  • C. Agarwal, H. Lakkaraju, and M. Zitnik (2021) Towards a unified framework for fair and stable graph representation learning. In

    Uncertainty in Artificial Intelligence

    pp. 2114–2124. Cited by: §1, §5.
  • P. Awasthi, M. Kleindessner, and J. Morgenstern (2020)

    Equalized odds postprocessing under imperfect group information

    In International Conference on Artificial Intelligence and Statistics, pp. 1770–1780. Cited by: §1.
  • C. M. Bishop (1995) Training with noise is equivalent to tikhonov regularization. Neural computation 7 (1), pp. 108–116. Cited by: §3.2.1.
  • A. Bose and W. Hamilton (2019) Compositional fairness constraints for graph embeddings. In

    International Conference on Machine Learning

    pp. 715–724. Cited by: §5.
  • K. Burkholder, K. Kwock, Y. Xu, J. Liu, C. Chen, and S. Xie (2021) Certification and trade-off of multiple fairness criteria in graph-based spam detection. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 130–139. Cited by: §1, §2.2, §4.2.
  • M. Buyl and T. De Bie (2020) Debayes: a bayesian method for debiasing network embeddings. In International Conference on Machine Learning, pp. 1220–1229. Cited by: §5.
  • L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi (2021) Fair classification with noisy protected attributes: a framework with provable guarantees. In International Conference on Machine Learning, pp. 1349–1361. Cited by: §1.
  • N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, pp. 321–357. Cited by: §3.2.1.
  • D. Chen, Y. Lin, G. Zhao, X. Ren, P. Li, J. Zhou, and X. Sun (2021) Topology-imbalance learning for semi-supervised node classification. Advances in Neural Information Processing Systems 34. Cited by: §1.
  • J. Chen, N. Kallus, X. Mao, G. Svacha, and M. Udell (2019) Fairness under unawareness: assessing disparity when protected class is unobserved. In Proceedings of the conference on fairness, accountability, and transparency, pp. 339–348. Cited by: §1.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 113–123. Cited by: §3.2.1.
  • E. Dai and S. Wang (2020) Fairgnn: eliminating the discrimination in graph neural networks with limited sensitive attribute information. arXiv preprint arXiv:2009.01454. Cited by: §5.
  • E. Dai and S. Wang (2021) Say no to the discrimination: learning fair graph neural networks with limited sensitive attribute information. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 680–688. Cited by: §1.
  • Y. Dou, Z. Liu, L. Sun, Y. Deng, H. Peng, and P. S. Yu (2020a) Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 315–324. Cited by: §1, §4.2.
  • Y. Dou, G. Ma, P. S. Yu, and S. Xie (2020b)

    Robust spammer detection by nash reinforcement learning

    In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 924–933. Cited by: §1, §4.2.
  • W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, Q. Yang, E. Kharlamov, and J. Tang (2020)

    Graph random neural networks for semi-supervised learning on graphs

    Advances in neural information processing systems 33, pp. 22092–22103. Cited by: §5.
  • Z. Jiang, X. Han, C. Fan, Z. Liu, N. Zou, A. Mostafavi, and X. Hu (2022) FMP: toward fair graph message passing against topology bias. arXiv preprint arXiv:2202.04187. Cited by: §1.
  • G. Jin, Q. Wang, C. Zhu, Y. Feng, J. Huang, and J. Zhou (2020)

    Addressing crime situation forecasting task with temporal graph convolutional neural network approach

    In 2020 12th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), pp. 474–478. Cited by: §1.
  • J. Kang, J. He, R. Maciejewski, and H. Tong (2020) Inform: individual fairness on graph mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 379–389. Cited by: §1.
  • M. Kearns, S. Neel, A. Roth, and Z. S. Wu (2018) Preventing fairness gerrymandering: auditing and learning for subgroup fairness. In International Conference on Machine Learning, pp. 2564–2572. Cited by: §1, §5.
  • M. Kearns, S. Neel, A. Roth, and Z. S. Wu (2019) An empirical study of rich subgroup fairness for machine learning. In Proceedings of the conference on fairness, accountability, and transparency, pp. 100–109. Cited by: §1, §5.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25. Cited by: §3.2.1.
  • A. Li, Z. Qin, R. Liu, Y. Yang, and D. Li (2019) Spam review detection with graph convolutional networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2703–2711. Cited by: §1.
  • P. Li, Y. Wang, H. Zhao, P. Hong, and H. Liu (2020) On dyadic fairness: exploring and mitigating bias in graph connections. In International Conference on Learning Representations, Cited by: §1, §1, §5.
  • Z. Liu, Y. Dou, P. S. Yu, Y. Deng, and H. Peng (2020) Alleviating the inconsistency problem of applying graph neural network to fraud detection. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp. 1569–1572. Cited by: §1.
  • D. Loveland, J. Pan, A. F. Bhathena, and Y. Lu (2022) FairEdit: preserving fairness in graph neural networks through greedy graph editing. arXiv preprint arXiv:2201.03681. Cited by: §5.
  • M. Luca (2016) Reviews, reputation, and revenue: the case of yelp. com. Com (March 15, 2016). Harvard Business School NOM Unit Working Paper (12-016). Cited by: §1.
  • J. Ma, J. Deng, and Q. Mei (2021) Subgroup generalization and fairness of graph neural networks. Advances in Neural Information Processing Systems 34. Cited by: §1, §5.
  • A. Mehrotra and L. E. Celis (2021) Mitigating bias in set selection with noisy protected attributes. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 237–248. Cited by: §1.
  • L. Perez and J. Wang (2017)

    The effectiveness of data augmentation in image classification using deep learning

    arXiv preprint arXiv:1712.04621. Cited by: §3.2.1.
  • T. Rahman, B. Surma, M. Backes, and Y. Zhang (2019) Fairwalk: towards fair graph embedding. Cited by: §1, §5.
  • S. Rayana and L. Akoglu (2015) Collective opinion spam detection: bridging review networks and metadata. In Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining, pp. 985–994. Cited by: §1.
  • Y. Rong, W. Huang, T. Xu, and J. Huang (2019) Dropedge: towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903. Cited by: §5.
  • I. Spinelli, S. Scardapane, A. Hussain, and A. Uncini (2021) FairDrop: biased edge dropout for enhancing fairness in graph representation learning. IEEE Transactions on Artificial Intelligence. Cited by: §1, §1, §5.
  • B. Ustun, Y. Liu, and D. Parkes (2019) Fairness without harm: decoupled classifiers with preference guarantees. In International Conference on Machine Learning, pp. 6373–6382. Cited by: §1.
  • D. Wang, J. Lin, P. Cui, Q. Jia, Z. Wang, Y. Fang, Q. Yu, J. Zhou, S. Yang, and Y. Qi (2019) A semi-supervised graph attentive network for financial fraud detection. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 598–607. Cited by: §1.
  • Y. Wang, L. Ge, S. Li, and F. Chang (2020a) Deep temporal multi-graph convolutional network for crime prediction. In International Conference on Conceptual Modeling, pp. 525–538. Cited by: §1.
  • Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi (2020b) Graphcrop: subgraph cropping for graph classification. arXiv preprint arXiv:2009.10564. Cited by: §1.
  • Y. Wang, W. Wang, Y. Liang, Y. Cai, and B. Hooi (2021) Mixup for node and graph classification. In Proceedings of the Web Conference 2021, pp. 3663–3674. Cited by: §1, §3.2.2, §5.
  • L. Wu, H. Lin, Z. Gao, C. Tan, S. Li, et al. (2021) GraphMixup: improving class-imbalanced node classification on graphs by self-supervised context prediction. arXiv preprint arXiv:2106.11133. Cited by: §1.
  • Y. Wu, D. Lian, Y. Xu, L. Wu, and E. Chen (2020) Graph convolutional networks with markov random field reasoning for social spammer detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 1054–1061. Cited by: §1.
  • R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018) Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 974–983. Cited by: §1.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §3.2.2.
  • T. Zhao, X. Zhang, and S. Wang (2021) Graphsmote: imbalanced node classification on graphs with graph neural networks. In Proceedings of the 14th ACM international conference on web search and data mining, pp. 833–841. Cited by: §1.
  • T. Zhao, Y. Liu, L. Neves, O. Woodford, M. Jiang, and N. Shah (2020) Data augmentation for graph neural networks. arXiv preprint arXiv:2006.06830. Cited by: §5.