1. Introduction
Supervised learning was shown to exhibit bias depending on the data it was trained on (Pedreshi et al., 2008). This problem is further amplified in graphs, where the graph topology was shown to exhibit different biases (Kang et al., 2019; Singer et al., 2020). Many popular supervisedlearning graph algorithms, such as graph neural networks (GNNs), employ messagepassing with features aggregated from neighbors; which might further intensify this bias. For example, in social networks, communities are usually more connected between themselves. As GNNs aggregate information from neighbors, it makes it even harder for a classifier to realize the potential of an individual from a discriminated community.
Despite their success (Wu et al., 2020), little work has been dedicated to creating unbiased GNNs, where the classification is uncorrelated with sensitive attributes, such as race or gender. The little existing work, focused on ignoring the sensitive attributes (Obermeyer et al., 2019). However, “fairness through unawareness” has already been shown to predict sensitive attributes from other features (Barocas et al., 2019). Others (Bose and Hamilton, 2019; Dai and Wang, 2021; Buyl and De Bie, 2020; Rahman et al., 2019) focused on the criteria of Statistical Parity (SP) for fairness when training node embeddings, which is defined as follows:
Definition 1.1 (Statistical parity).
A predictor satisfies statistical parity with respect to a sensitive attribute , if and are independent:
(1) 
Recently, (Dwork et al., 2012) showed that SP does not ensure fairness and might actually cripple the utility of the prediction task. Consider the target of college acceptance and the sensitive attribute of demographics. If the target variable correlates with the sensitive attribute, statistical parity would not allow an ideal predictor. Additionally, the criterion allows accepting qualified applicants in one demographic, but unqualified in another, as long as the percentages of acceptance match.
In recent years, the notion of Equalized Odds (EO) was presented as an alternative fairness criteria (Hardt et al., 2016). Unlike SP, EO allows dependence on the sensitive attribute but only through the target variable :
Definition 1.2 (Equalized odds).
A predictor satisfies equalized odds with respect to a sensitive attribute , if and are independent conditional on the true label :
(2) 
The definition encourages the use of features that allow to directly predict , while not allowing to leverage as a proxy for . Consider our college acceptance example. For the outcome of =Accept, we require to have similar true and false positive rates across all demographics. Notice that, aligns with the equalized odds constraint, but we enforce that the accuracy is equally high in all demographics, and penalize models that have good performance on only the majority of demographics.
In this work, we present an architecture that optimizes graph classification for the EO criteria. Given a GNN classifier predicting a target class, our architecture expands it with a sampler and a discriminator components. The goal of the sampler component is to learn the distribution of the sensitive attributes of the nodes given their labels. The sampler generates examples that are then fed into a discriminator. The goal of the latter is to discriminate between true and sampled sensitive attributes. We present a novel loss function the discriminator minimizes – the permutation loss. Unlike crossentropy loss, that compares two independent or unrelated groups, the permutation loss compares items under two separate scenarios – with sensitive attribute or with a generated balanced sensitive attribute.
We start by pretraining the sampler, and then train the discriminator along with the GNN classifier using adversarial training. This joint training allows the model to neglect information regarding the sensitive attribute only with respect to its label, as requested by the equalized odds fairness criteria. To the best of our knowledge, our work is the first to optimize GNNs for the equalized odds criteria.
The contributions of this work are fourfold:

[leftmargin=*,labelindent=1mm,labelsep=1.5mm]

We propose EqGNN, an algorithm with equalized odds regulation for graph classification tasks.

We propose a novel permutation loss which allows us to compare pairs. We use this loss in the special case of nodes in two different scenarios – one under the bias sensitive distribution, and the other under the generated unbiased distribution.

We empirically evaluate EqGNN on several realworld datasets and show superior performance to several baselines both in utility and in bias reduction.

We empirically evaluate the permutation loss over both synthetic and realworld datasets and show the importance of leveraging the pair information.
2. Related Work
Supervised learning in graphs has been applied in many applications, such as proteinprotein interaction prediction (Grover and Leskovec, 2016; Singer et al., 2019), human movement prediction (Yan et al., 2018), traffic forecasting (Yu et al., 2018; Cui et al., 2019) and other urban dynamics (Wang and Li, 2017). Many supervised learning algorithms have been suggested for those tasks on graphs, including matrix factorization approaches (Belkin and Niyogi, 2001; Tenenbaum et al., 2000; Yan et al., 2006; Roweis and Saul, 2000), random walks approaches (Perozzi et al., 2014; Grover and Leskovec, 2016) and graph neural network, which recently showed stateoftheart results on many tasks (Wu et al., 2020). The latter is an adaptation of neural networks to the graph domain. GNNs create different differential layers that can be added to many different architectures and tasks. GNNs utilize the graph structure by propagating information through the edges and nodes. For instance, GCN (Kipf and Welling, 2017) and graphSAGE (Hamilton et al., 2017) update the nodes representation by averaging over the representations of all neighbors, while (Veličković et al., 2017) proposed an attention mechanism to learn the importance of each specific neighbor.
Fairness in graphs was mostly studied in the context of group fairness, by optimizing the SP fairness criteria. (Rahman et al., 2019) creates fair random walks by first sampling a sensitive attribute and only then sampling a neighbor from those who hold that specific sensitive attribute. For instance, if most nodes represent men while the minority represent women, the fair random walk promises that the presence of men and women in the random walks will be equal. (Buyl and De Bie, 2020) proposed a Bayesian method for learning embeddings by using a biased prior. Others, focus on unbiasing the graph prediction task itself rather than the node embeddings. For example, (Bose and Hamilton, 2019) uses a set of adversarial filters to remove information about predefined sensitive attributes. It is learned in a self supervised way by using a graphautoencoder to reconstruct the graph edges. (Dai and Wang, 2021) offers a discriminator that discriminates between the nodes sensitive attributes. In their setup, not all nodes sensitive attributes are known, and therefore, they add an additional component that predicts the missing attributes. (Kang et al., 2020) tackles the challenge of individual fairness in graphs. In this work, we propose a GNN framework optimizing the EO fairness criteria. To the best of our knowledge, our work is the first to study fairness in graphs in the context of EO fairness.
3. EqualizedOdds Fair Graph Neural Network
Let be a graph, where is the list of edges, the list of nodes, the labels, and the sensitive attributes. Each node is represented via features. We denote as the feature matrix for the nodes in . Our goal is to learn a function with parameters , that given a graph , maps a node
represented by a feature vector, to its label.
In this work, we present an architecture that can leverage any graph neural network classifier. For simplicity, we consider a simple GNN architecture for as suggested by (Kipf and Welling, 2017): we define to be two GCN (Kipf and Welling, 2017)
layers, outputting a hidden representation
for each node. This representation then enters a fully connected layer that outputs . The GNN optimization goal is to minimize the distance between and using a loss function . can be categorical crossentropy (CCE) for multiclass classification, binary crossentropy (BCE) for binary classification, or mean square error (L2) for regression problems. In this work, we extend the optimization to while satisfying Eq. 2 for fair prediction.We propose a method, EqGNN, that trains a GNN model to neglect information regarding the sensitive attribute only with respect to its label. The full architecture of EqGNN is depicted in Figure 1. Our method pretrains a sampler (Section 3.1), to learn the distribution , of the sensitive attributes of the nodes given their labels (marked in blue in Figure 1). We train a GNN classifier (marked in green in Figure 1), while regularizing it with a discriminator (marked in red in Figure 1) that discriminates between true and sampled sensitive attributes. Section 3.2 presents the EO regulation. The regularization is done using a novel loss function – the “permutation loss”, which is capable of comparing paired samples (formally presented in Section 3.3, and implementation details are discussed in Section 3.4). For the unique setup of adversarial learning over graphs, we show that incorporating the permutation loss in a discriminator, brings performance gains both in utility and in EO. Section 3.5 presents the full EqGNN model optimization procedure.
3.1. Sampler
To comply with the SP criteria (Eq. 1), given a sample , we wish the prediction of the classifier, , to be independent of the sample’s sensitive attribute . In order to check if this criteria is kept, we can sample a fake attribute out of (e.g., in case of equal sized groups, a random attribute), and check if can predict the true or fake attribute. If it is not able to predict, this means that is independent of and the SP criteria is kept. As all information of is also represented in the hidden representation , one can simply train a discriminator to predict the sensitive attribute given the hidden representation . A similar idea was suggested by (Dai and Wang, 2021; Bose and Hamilton, 2019).
To comply with the EO criteria (Eq. 2), the classifier should not be able to separate between an example with a real sensitive attribute and an example with an attribute sampled from the conditional distribution . Therefore, we jointly train the classifier with a discriminator that learns to separate between the two examples. Formally, given a sample , we would want the prediction of the classifier, , to be independent of the attribute only given its label . Thus, instead of sampling the fake attribute out of we sample the fake attribute out of .
We continue describing the sensitive attribute distribution learning process, , and then present how the model samples dummy attributes that will be used as “negative” examples for the discriminator.
3.1.1. Sensitive Attribute Distribution Learning
Here, our goal is to learn the distribution . For a specific sensitive attribute and label
, the probability can be expressed using Bayes’ rule as:
(3) 
The term can be derived from the data by counting the number of samples that are both with label and sensitive attribute , divided by the number of samples with sensitive attribute . Similarly, is calculated as the number of samples with sensitive attribute
, divided by the total number of samples. In a regression setup, these can be approximated using a linear kernel density estimation.
3.1.2. Fair Dummy Attributes
During training of the endtoend model (Section 3.5), the sampler receives a training example and generates a dummy attribute by sampling . Notice that and are equally distributed given . This ensures that if the classifier holds the EO criteria, then and will receive an identical classification, whereas otherwise it will result in different classifications. In Section 3.2 we further explain the optimization process that utilizes the dummy attributes for regularizing the classifier for EO.
3.2. Discriminator
A GNN classifier without regulation might learn to predict biased node labels based on their sensitive attributes. To satisfy EO, the classifier should be unable to distinguish between real examples and generated examples with dummy attributes. Therefore, we utilize an adversarial learning process and add a discriminator that learns to distinguish between real and fake examples with dummy attributes. Intuitively, this regularizes the classifier to comply with the EO criterion.
Intuitively, one might consider , the last hidden layer of the classifier , as the unbiased representation of node . The discriminator receives two types of examples: (1) real examples and (2) negative examples , where is generated by the pretrained sampler. The discriminator learns a function with parameters , that given a sample, classifies it to be true or fake. The classifier in its turn tries to “fool” it. This ensures the classifier doesn’t hold bias towards specific labels and sensitive attributes and keeps the EO criterion. The formal adversarial loss is defined as:
(4) 
where is the expected value, and represents the concatenation operator. The discriminator tries to maximize to distinguish between true and fake samples, while the classifier tries to minimize it, in order to “fool” him. In our implementation, is a GNN with two GCN layers outputting the probability of being a true sample.
we observe that true and fake attributes are paired (the same node, with a real sensitive attribute or a dummy attribute). A binary loss as defined in Eq. 4 holds for unpaired examples, and does not take advantage of knowing the fake and true attributes are paired. Therefore, we upgrade the loss to handle paired nodes by utilizing the permutation loss. We first formally define and explain the permutation loss in Section 3.3, and then continue discussing its implantation details in our architecture in Section 3.4.
3.3. Permutation Loss
In this section, we formally define the new permutation loss test presented in this work. Let us assume and are two groups of subjects. Many applications are interested of learning and understanding the difference between the two. For example, in the case where represents test results of students in one class, while represents test results of a different class, it will be interesting to check if the two classes are equally distributed (i.e., ).
Definition 3.1 (Ttest).
Given two groups , the statistical difference can be measured by the tstatistic:
Where , , and
are the means, variances, and group sizes respectively.
While this test assumes
are scalars that are normally distributed,
(LopezPaz and Oquab, 2017) proposed a method called C2ST that handles cases where for . They proposed a classifier that is trained to predict the correct group of a given sample, which belongs to either or . By doing so, given a testset, they are able to calculate the tstatistic by simply checking the number of correct predictions:Definition 3.2 (C2ST).
Given two groups (labeled 0 and 1 respectively), the statistical difference can be measured by the tstatistic (LopezPaz and Oquab, 2017):
where , is ’s original group label, is the number of samples in the testset, and is a trained classifier outputting the probability a sample is sampled from group .
The basic (and yet important) idea behind this test is that if a classifier is not able to predict the group of a given sample, then the groups are equality distributed. Mathematically, the tstatistic can be calculated by the number of correct samples of the classifier.
However, the C2ST criterion is not optimal when the samples are paired ( and represent the same subject), as it doesn’t leverage the paired information. Consider the following example: assuming the pairs follow the following rule: and . As the two Gaussians overlap, a simple linear classifier will not be able to detect any significant difference between the two groups, while we hold the information that there is a significant difference ( is always smaller in from its pair ). Therefore, it is necessary to define a new test that can manage paired examples.
Definition 3.3 (Paired Ttest).
Given two paired groups , and , the statistical difference can be measured by the tstatistic:
where ,
are the mean and standard deviation of
, and is the number of pairs.Again, this Paired Ttest assumes
are scalars that are normally distributed. A naive adaptation of (LopezPaz and Oquab, 2017) for paired data with , would be to first map the pairs into scalars and then calculate their differences (as known in the paired student ttest for ). This approach also assumes the samples are normally distributed, and therefore is not robust enough. An alternative to the paired ttest is the permutation test (Odén et al., 1975), which has no assumptions on the distribution. It checks how different the tstatistic of a specific permutation is from many (or all) other random permutations tstatistics. By doing so, it is able to calculate the pvalue of that specific permutation. We suggest a differential version of the permutation test. This is done by using a neural network architecture that receives either a real permutation or a shuffled, and tries to predict the true permutation.In Algorithm 1, we define the full differential permutation loss. The permutation phase consists of four steps: (1) For each pair, each of size , sample a random number (line 4 in the algorithm). (2) If , concatenate the pair in the original order: , while if , concatenate in the permuted order: . This resolves us with a vector of size (lines 56). (3) This sample enters a classifier that tries to predict using binarycrossentropy (lines 78). (4) We update the classifier weights (line 9), and return to step 1, until convergence. Assuming and are exchangeable, the classifier will not be able to distinguish if a permutation was performed or not. The idea behind this test is that if a classifier is not able to predict the true ordering of the pairs, it means that there is no significant difference between this specific permutation or any other permutation. Mathematically, similarly to C2ST, the tstatistic can be calculated by the number of correct samples of the classifier.
While this is similar to the motivation of the naive permutation test (Odén et al., 1975), we offer additional benefits: (1) The test is differential, meaning we can represent it using a neural network as a layer (see Section 3.4). (2) The test can handle . (3) We do not need to define a specific tstatistic for each problem, but rather the classifier checks for any possible signal.
As a real life example that explains the power of the loss, assume a person is given a real coin and fake coin. While she observes each one separately, her confidence of which is which, will be much less if she would rather receive them together. This real life example demonstrates the important difference between C2ST and the permutation loss (see Section 5.3 for an additional example over synthetic data).
3.4. Permutation Discriminator
Going back to our discriminator, we observe that true and fake attributes are paired (the same node, with a real sensitive attribute or a dummy attribute). We create paired samples: . At each step we randomly permute the sensitive attribute and its dummy attribute, creating a sample labeled as permuted or not. Now our samples are with label indicating no permutation was applied, while with label indicating a permutation was applied. The discriminator therefore receives the samples and predicts the probability of whether a permutation was applied. We therefore adapt the adversarial loss of Eq. 4 to:
(5) 
The loss used in the permutation test is a binary crossentropy and therefore convex.
As an additional final regulation and to improve stability of the classifier, similarly to (Romano et al., 2020), we propose to minimize the absolute difference between the covariance of and from the covariance of and :
(6) 
3.5. EqGNN
The sampler is pretrained using Eq. 3. We then jointly train the classifier and the discriminator optimizing the objective function:
(7) 
where are the parameters of the classifier and are the parameters of the discriminator. and are hyperparameters that are used to tune the different regulations. This objective is then optimized for and one step at a time using the Adam optimizer (Kingma and Ba, 2015), with learning rate and weightdecay . The training is further detailed in Algorithm 2.
4. Experimental Setup
In this section, we describe our datasets, baselines, and metrics. Our baselines include fair baselines designed specifically for graphs, and general fair baselines that we adapted to the graph domain.
4.1. Datasets
Table 1 summarizes the datasets’ characteristics used for our experiments. Intragroup edges are the edges between similar sensitive attributes, while intergroup edges are edges between different sensitive attributes.
Pokec (Takac and Zabovsky, 2012). Pokec is a popular social network in Slovakia. An anonymized snapshot of the network was taken in 2012. User profiles include gender, age, hobbies, interest, education, etc. The original Pokec dataset contains millions of users. We sampled a subnetwork of the “Zilinsky“ province. We create two datasets, where the sensitive attribute in one is the gender, and region in the other. The label used for classification is the job of the user. The job field was grouped in the following way: (1)“education“ and “student“, (2)“services & trade“ and “construction“, and (3) “unemployed“.
NBA (Dai and Wang, 2021) This dataset was presented in the FairGNN baseline paper. The NBA Kaggle dataset contains around 400 basketball players with features including performance statistics, nationality, age, etc. This dataset was extended in (Dai and Wang, 2021) to include the relationships of the NBA basketball players on Twitter. The binary sensitive attribute is whether a player is a U.S. player or an overseas player, while the task is to predict whether a salary of the player is over the median.
Dataset  Pokecregion  Pokecgender  NBA 

# of nodes  
# of attributes  
# of edges  
sensitive groups ratio  
# of intergroup edges  
# of intragroup edges  
4.2. Baselines
In our evaluation, we compare to the following baselines:
GCN (Kipf and Welling, 2017): GCN is a classic GNN layer that updates a node representation by averaging the representations of his neighbors. For fair comparison, we implemented GCN as the classifier of the EqGNN architecture (i.e., an unregulated baseline, with the only difference of ).
Debias (Zhang et al., 2018): Debias optimizes EO by using a discriminator that given and predicts the sensitive attribute. While Debias is a nongraph architecture, for fair comparison, we implemented Debias with the exact architecture as EqGNN. Unlike EqGNN, Debias’s discriminator receives as input only and (without the sensitive attribute or dummy attribute) and predicts the sensitive attribute. As the discriminator receives , it neglects the sensitive information with respect to and, therefore optimizes for EO.
FairGNN (Dai and Wang, 2021): FairGNN uses a discriminator that, given , predicts the sensitive attribute. By doing so, they neglect the sensitive information from . As this is without respect to , they optimize SP (further explained in Section 3.1). FairGNN offers an additional predictor for nodes with unknown sensitive attributes. As our setup includes all nodes’ sensitive attributes, this predictor is irrelevant. We opted to use FairGCN for fair comparison. In addition, we generalized their architecture to support multiclass classification.
For all baselines, of nodes are used for training, for validation and for testing. The validation set is used for choosing the best model for each baseline throughout the training. As the classifier is the only part of the architecture used for testing, an early stopping was implemented after its validation loss (Eq. 7) hasn’t improved for epochs. The epoch with the best validation loss was then used for testing. All results are averaged over different train/validation/test splits for Pokec datasets and for the NBA dataset. For fair comparison, we implemented gridsearch for all baselines over for baselines with a discriminator, and for baselines with a covariance expression. For both Pokec datasets and for all baselines and , while for NBA we end up using and expect for FairGNN with . All experiments used a single Nvidia P100 GPU with the average run of 5 minutes per seed for Pokec and 1 minute for NBA. Results where logged and analyzed using the (Biewald, 2020; Falcon, 2019) platforms.
4.3. Metrics
4.3.1. Fairness Metrics
Equalized odds.
The definition in Eq. 2, can be formally written as:
(8) 
The value of can be calculated from the testset as follows: given all samples with label and sensitive attribute , we calculate the proportion of samples that where labeled by the model. As we handle a binary sensitive attribute, given a label , we calculate the absolute difference between the two sensitive attribute values:
(9) 
According to Eq. 8, our goal is to have both probabilities equal. Therefore, we desire to strive to . We finally aggregate for all labels using the max operator to get a final scalar metric:
(10) 
As we propose an equalized odds architecture, is our main fairness metric.
Statistical parity.
The definition in Eq. 1 can be formally written as:
(11) 
The value of can be calculated from the testset the following way: given all samples with sensitive attribute , we calculate the proportion of samples that where labeled by the model. As we handle a binary sensitive attribute, given a label we calculate the absolute difference between the two sensitive attribute values:
(12) 
According to Eq. 11, our goal is to have both probabilities equal. Therefore, we desire to strive to . We finally aggregate for all labels using the max operator to get a final scalar metric:
(13) 
4.3.2. Performance Metrics
As our main classification metric, we used the F1 score. We examined both the micro F1 score, which is computed globally based on the true and false predictions, and the macro F1 score, computed per each class and averaged across all classes. For completeness, we also report the Accuracy (ACC).
5. Experimental Results
In this section, we report the experimental results. We start by comparing EqGNN to the baselines (Section 5.1). We then demonstrate the importance of to the EqGNN architecture (Section 5.2). We continue by showing the superiority of the permutation loss, compared to other loss functions, both over synthetic datasets (Section 5.3) and real datasets (Section 5.4). Finally, we explore two qualitative examples, that visualizes the importance of fairness in graphs (Section 5.5).
5.1. Main Result
Dataset  Metrics  GCN  FairGNN  Debias  EqGNN 

Pokecgender  (%)  
(%)  
ACC (%)  
F1macro (%)  
F1micro (%)  
Pokecregion  (%)  
(%)  
ACC (%)  
F1macro (%)  
F1micro (%)  
NBA  (%)  
(%)  
ACC (%)  
F1macro (%)  
F1micro (%) 
Table 2 reports the results of EqGNN and baselines over the datasets with respect to the performance and fairness metrics. We can notice that, while the performance metrics are very much similar between all baselines (apart from Debias in Pokecregion), EqGNN outperforms all other baselines in both fairness metrics. An interesting observation is that Debias is the second best, after EqGNN, to improve the EO metric, without harming the performance metrics. This can be explained as it is the only baseline to optimize with respect to EO. Additionally, Debias has gained fairness in Pokecregion, but at the cost of performance. This is a general phenomena: the lower the performance, the better the fairness. For example, when the performance is random, surely the algorithm doesn’t prefer any particular group and therefore is extremely fair. Here, EqGNN is able to both optimize the fairness metrics while keeping the performance metrics high. The particularly low performance demonstrated by FairGNN was also validated with the authors of the paper. The previously reported results were validated over a single validation step as opposed to several, to insure statistical significance.
5.2. The Discriminator for Bias Reduction
As a second analysis, we demonstrate the importance of the parameter with respect to the performance and fairness metrics. The hyperparameter serves as a regularization for task performance as opposed to fairness. High values of cause the discriminator EO regulation on the classifier to be higher. While EqGNN results reported in this paper use for Pokec datasets, and for NBA, we show additional results for . In Figure 2, we can observe that the selected s show best results over all metrics for Pokecgender, while similar results where shown over Pokecregion and NBA. Obviously, enlarging the results in a more fair model but at the cost of performance. The hyperparameter is an issue of priority: depending on the task, one should decide what should be the performance vs. fairness prioritization. Therefore, EqGNN can be used with any desired where we chose as it is the elbow of the curve.
5.3. Synthetic Evaluation of the Permutation Loss
In this experiment, we wish to demonstrate the power of the permutation loss over synthetic data. Going back to the notations used in Section 3.3, we generate two paired groups in the following ways:
Rotation:
where , and . This can simply be thought as one group is a 2dimensional Gaussian, while the second is the exact same Gaussian but rotated by . As a rotation over a Gaussian resolves also with a Gaussian, is also a 2dimensional Gaussian. Yet, it is paired to , as given a sample from , we can predict its pair from (simply rotate it by ).
Shift:
where , and . This can simply be thought as one group is a 2dimensional Gaussian, while the second is the exact same Gaussian but shifted by on the first axis. As shifting a Gaussian by a small value, overlaps and therefore, it is hard to distinguish between the two. Yet, it is paired to , as given a sample from , we can predict its pair from (simply add ).
Over these two synthetic datasets, we train four classifiers:
Ttest  adapted: As the original unpaired ttest requires a one dimensional data, we first map the samples into a single scalar using a fully connected layer and train it using the the tstatistic defined in Section 3.1.
Paired Ttest  adapted: Similar to Ttest, but using the paired tstatistic defined in Section 3.3.
C2ST (LopezPaz and Oquab, 2017): A linear classifier that given a sample, tries to predict to which group it belongs to.
Permutation (Section 3.3): A linear classifier that given a randomly shuffled pair, predicts if it was shuffled or not. For detailed implementation please refer to Algorithm 1.
We sample pairs for train and an additional for test, and average results over 5 runs. In Table 4 we report the pvalue of the classifiers over the different generated datasets.
Model  Shift  Rotation 

Ttest  0.24  0.47 
Paired Ttest  0  0.36 
C2ST  0.5  0.5 
Permutation  0  0 
We can observe that, the permutation classifier captures the difference between the pairs perfectly in both datasets. This does not hold for the Paired Ttest that captures the difference only for the Shift dataset. The reason it classifies well only over the Shift dataset is because it is a linear transformation, which is easier to learn. We can further notice that both unpaired classifiers (Ttest and C2ST) perform poorly over both datasets. The promising results of the permutation classifier on our synthetic datasets, drive us to choose it as the potential discriminator in the EqGNN architecture. We validate this choice over realdatasets next section.
5.4. The Importance of the Permutation Loss
Dataset  Metrics  Unpaired  Paired  Permutation/h  Permutation 

Pokecgender  (%)  
(%)  
ACC (%)  
F1macro (%)  
F1micro (%)  
Pokecregion  (%)  
(%)  
ACC (%)  
F1macro (%)  
F1micro (%)  
NBA  (%)  
(%)  
ACC (%)  
F1macro (%)  
F1micro (%) 
As an ablation study, we compare different loss functions for the discriminator. We choose to compare the permutation loss with three different loss functions: (1) Unpaired: Inspired by (Romano et al., 2020), an unpaired binary crossentropy loss as presented in Eq. 4
. The loss is estimated by a classifier that predicts if a sample represents a real sensitive attribute or a dummy. (2) Permutation/h: A permutation loss without concatenating the hidden representation
to the discriminator samples, while leaving the sample to be . (3) Paired: A paired loss:(14) 
where is the Sigmoid activation. This loss is the known paired student ttest with a neural network version of it (as demonstrated by (LopezPaz and Oquab, 2017) on the unpaired ttest). Implementation of this loss when summing the absolute differences yielded poor results. We therefore, report a version of this loss with a summation over non absolute differences. The results of the different loss functions are reported in Table 4. One can notice that all loss functions have gained fairness over our baselines (as reported in Table 2), while the permutation loss with the hidden representation , outperforms the others, and specifically Permutation/h. This implies that the hidden representation is important. In the Pokec datasets, the performance metrics are not impacted apart from the paired loss. We hypothesize this is caused due to its nonconvexity in adversarial settings. Additionally, the paired loss demonstrates the same phenomena again: the lower the performance, the better the fairness. In the NBA dataset we do not see much difference between the loss functions. This can be explained due to the size of the graph. However, we do see that the permutation loss is the only one to improve fairness metrics while not hurting the performance metrics. Finally, we can notice that, paired loss functions (the permutation loss and the paired loss) perform better than the unpaired loss (apart from NBA, where the unpaired loss hurts the performance metrics). This can be explained by our paired problem, where we check for the difference between two scenarios of a node (real and fake). This illustrates the general importance of a paired loss function for paired problems.
5.5. Qualitative Example
We end this section with a few qualitative examples over the Pokecgender testset. Specifically, we present two qualitative examples, where the central node has the same sensitive attribute as over of its 2hop neighbors, but holds a different label from most of them. We consider 2hops, as our classifier includes 2 GCN layers. Figure 3 presents the example for a sensitive attribute being a male where in Figure 3 being a female; i.e., nodes in Figure 3 are mostly males and in Figure 3 mostly females. A biased approach would be to be inclined to predict the same label for the central node as its samegender neighbors. Above the central node we observe the prediction distribution for and . In Figure 3 we observe that, when no discriminator is applied (), and therefore there is no regularization for bias, the probability is towards most neighbors label. On the other hand, when applying the discriminator (), the probability for that class drops to , and rises to for the correct label. Similarly, in Figure 3, we observe that, when no discriminator is applied (), the probability is towards most neighbors label. Again, this in comparison to the case when applying the discriminator (), the probability for that class drops to , and rises to for the correct label. These qualitative examples show that, a equalized odds regulator over graphs can help make less biased predictions, even when the neighbours at the graphs might cause bias.
6. Conclusions
In this work, we explored fairness in graphs. Unlike previous work which optimize the statistical parity (SP) fairness criterion, we present a method that learns to optimize equalized odds (EO). While SP promises equal chances between groups, it might cripple the utility of the prediction task as it does not give equalized opportunity as EO. We propose a method that trains a GNN model to neglect information regarding the sensitive attribute only with respect to its label. Our method pretrains a sampler to learn the distribution of the sensitive attributes of the nodes given their labels. We then continue training a GNN classifier while regularizing it with a discriminator that discriminates between true and sampled sensitive attributes using a novel loss function – the “permutation loss”. This loss allows comparison of pairs. For the unique setup of adversarial learning over graphs, we show it brings performance gains both in utility and in EO. While this work uses the loss for the specific case of nodes in two scenarios: fake and true, this loss is general and can be used for any paired problem.
For future work, we wish to test the novel loss over additional architectures and tasks. We draw the reader attention that the C2ST discriminator is the commonly used discriminator for many architectures that work over paired data. For instance, the pix2pix architecture
(Isola et al., 2017) is a classic architecture that inspired many works. Although the pix2pix discriminator receives paired samples, it is still just an advanced C2ST discriminator. Alternatively, using a paired discriminator instead, can create a much powerful discriminator and therefore a much powerful generator. Observing many works that apply over paired samples, we haven’t found any architectures that are designed to work over paired samples. We believe that, although this work uses the permutation loss for a specific usecase, it is a general architecture that can be used for any paired problem.We empirically show that our method outperforms different baselines in the combined fairnessperformance metrics, over datasets with different attributes and sizes. To the best of our knowledge, we are the first to optimize GNNs for the EO criteria and hope it will serve as a beacon for works to come.
References
 Fairness and machine learning. fairmlbook.org. Note: http://www.fairmlbook.org Cited by: §1.
 Laplacian eigenmaps and spectral techniques for embedding and clustering.. In Advances in Neural Information Processing Systems, Vol. 14, pp. 585–591. Cited by: §2.
 Experiment tracking with weights and biases. Note: Software available from wandb.com External Links: Link Cited by: §4.2.
 Compositional fairness constraints for graph embeddings. In International Conference on Machine Learning, pp. 715–724. Cited by: §1, §2, §3.1.
 DeBayes: a bayesian method for debiasing network embeddings. In International Conference on Machine Learning, pp. 1220–1229. Cited by: §1, §2.

Traffic graph convolutional recurrent neural network: a deep learning framework for networkscale traffic learning and forecasting
. IEEE Transactions on Intelligent Transportation Systems. Cited by: §2.  Say no to the discrimination: learning fair graph neural networks with limited sensitive attribute information. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 680–688. Cited by: §1, §2, §3.1, §4.1, §4.2.
 Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: §1.
 PyTorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorchlightning 3. Cited by: §4.2.
 Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.
 Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035. Cited by: §2.
 Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems 29, pp. 3315–3323. Cited by: §1.

Imagetoimage translation with conditional adversarial networks.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1125–1134. Cited by: §6.  Conditional network embeddings. International Conference on Learning Representations. Cited by: §1.
 InFoRM: individual fairness on graph mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 379–389. Cited by: §2.
 Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §3.5.
 Semisupervised classification with graph convolutional networks. International Conference on Learning Representations. Cited by: §2, §3, §4.2.
 Revisiting classifier twosample tests. In International Conference on Learning Representations, Cited by: §3.3, §3.3, Definition 3.2, §5.3, §5.4.
 Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 (6464), pp. 447–453. Cited by: §1.
 Arguments for fisher’s permutation test. The Annals of Statistics 3 (2), pp. 518–520. Cited by: §3.3, §3.3.
 Discriminationaware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 560–568. Cited by: §1.
 Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.
 Fairwalk: towards fair graph embedding.. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19, pp. 3289–3295. Cited by: §1, §2.
 Achieving equalized odds by resampling sensitive attributes. Advances in Neural Information Processing Systems. Cited by: §3.4, §5.4.
 Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2.
 Node embedding over temporal graphs. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19, pp. 4605–4612. External Links: Document, Link Cited by: §2.
 On biases of attention in scientific discovery. Bioinformatics. Note: btaa1036 External Links: ISSN 13674803, Document, Link Cited by: §1.
 Data analysis in public social networks. In International scientific conference and international workshop present day trends of innovations, Vol. 1. Cited by: §4.1.
 A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §2.
 Graph attention networks. International Conference on Learning Representations. Cited by: §2.
 Region representation learning via mobility flow. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 237–246. Cited by: §2.
 A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §2.
 Graph embedding and extensions: a general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence 29 (1), pp. 40–51. Cited by: §2.
 Spatial temporal graph convolutional networks for skeletonbased action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32. Cited by: §2.
 Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3634–3640. Cited by: §2.
 Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340. Cited by: §4.2.