DeepAI
Log In Sign Up

EqGNN: Equalized Node Opportunity in Graphs

08/19/2021
by   Uriel Singer, et al.
Technion
0

Graph neural networks (GNNs), has been widely used for supervised learning tasks in graphs reaching state-of-the-art results. However, little work was dedicated to creating unbiased GNNs, i.e., where the classification is uncorrelated with sensitive attributes, such as race or gender. Some ignore the sensitive attributes or optimize for the criteria of statistical parity for fairness. However, it has been shown that neither approaches ensure fairness, but rather cripple the utility of the prediction task. In this work, we present a GNN framework that allows optimizing representations for the notion of Equalized Odds fairness criteria. The architecture is composed of three components: (1) a GNN classifier predicting the utility class, (2) a sampler learning the distribution of the sensitive attributes of the nodes given their labels. It generates samples fed into a (3) discriminator that discriminates between true and sampled sensitive attributes using a novel "permutation loss" function. Using these components, we train a model to neglect information regarding the sensitive attribute only with respect to its label. To the best of our knowledge, we are the first to optimize GNNs for the equalized odds criteria. We evaluate our classifier over several graph datasets and sensitive attributes and show our algorithm reaches state-of-the-art results.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/10/2022

On Graph Neural Network Fairness in the Presence of Heterophilous Neighborhoods

We study the task of node classification for graph neural networks (GNNs...
09/13/2022

Adversarial Inter-Group Link Injection Degrades the Fairness of Graph Neural Networks

We present evidence for the existence and effectiveness of adversarial a...
09/03/2020

FairGNN: Eliminating the Discrimination in Graph Neural Networks with Limited Sensitive Attribute Information

Graph neural networks (GNNs) have shown great power in modeling graph st...
07/25/2022

Estimating and Controlling for Fairness via Sensitive Attribute Predictors

Although machine learning classifiers have been increasingly used in hig...
06/29/2021

Subgroup Generalization and Fairness of Graph Neural Networks

Despite enormous successful applications of graph neural networks (GNNs)...
06/08/2020

Achieving Equalized Odds by Resampling Sensitive Attributes

We present a flexible framework for learning predictive models that appr...
10/30/2020

Deep Active Graph Representation Learning

Graph neural networks (GNNs) aim to learn graph representations that pre...

1. Introduction

Supervised learning was shown to exhibit bias depending on the data it was trained on (Pedreshi et al., 2008). This problem is further amplified in graphs, where the graph topology was shown to exhibit different biases (Kang et al., 2019; Singer et al., 2020). Many popular supervised-learning graph algorithms, such as graph neural networks (GNNs), employ message-passing with features aggregated from neighbors; which might further intensify this bias. For example, in social networks, communities are usually more connected between themselves. As GNNs aggregate information from neighbors, it makes it even harder for a classifier to realize the potential of an individual from a discriminated community.

Despite their success (Wu et al., 2020), little work has been dedicated to creating unbiased GNNs, where the classification is uncorrelated with sensitive attributes, such as race or gender. The little existing work, focused on ignoring the sensitive attributes (Obermeyer et al., 2019). However, “fairness through unawareness” has already been shown to predict sensitive attributes from other features (Barocas et al., 2019). Others (Bose and Hamilton, 2019; Dai and Wang, 2021; Buyl and De Bie, 2020; Rahman et al., 2019) focused on the criteria of Statistical Parity (SP) for fairness when training node embeddings, which is defined as follows:

Definition 1.1 (Statistical parity).

A predictor satisfies statistical parity with respect to a sensitive attribute , if and are independent:

(1)

Recently, (Dwork et al., 2012) showed that SP does not ensure fairness and might actually cripple the utility of the prediction task. Consider the target of college acceptance and the sensitive attribute of demographics. If the target variable correlates with the sensitive attribute, statistical parity would not allow an ideal predictor. Additionally, the criterion allows accepting qualified applicants in one demographic, but unqualified in another, as long as the percentages of acceptance match.

In recent years, the notion of Equalized Odds (EO) was presented as an alternative fairness criteria (Hardt et al., 2016). Unlike SP, EO allows dependence on the sensitive attribute but only through the target variable :

Definition 1.2 (Equalized odds).

A predictor satisfies equalized odds with respect to a sensitive attribute , if and are independent conditional on the true label :

(2)

The definition encourages the use of features that allow to directly predict , while not allowing to leverage as a proxy for . Consider our college acceptance example. For the outcome of =Accept, we require to have similar true and false positive rates across all demographics. Notice that, aligns with the equalized odds constraint, but we enforce that the accuracy is equally high in all demographics, and penalize models that have good performance on only the majority of demographics.

In this work, we present an architecture that optimizes graph classification for the EO criteria. Given a GNN classifier predicting a target class, our architecture expands it with a sampler and a discriminator components. The goal of the sampler component is to learn the distribution of the sensitive attributes of the nodes given their labels. The sampler generates examples that are then fed into a discriminator. The goal of the latter is to discriminate between true and sampled sensitive attributes. We present a novel loss function the discriminator minimizes – the permutation loss. Unlike cross-entropy loss, that compares two independent or unrelated groups, the permutation loss compares items under two separate scenarios – with sensitive attribute or with a generated balanced sensitive attribute.

We start by pretraining the sampler, and then train the discriminator along with the GNN classifier using adversarial training. This joint training allows the model to neglect information regarding the sensitive attribute only with respect to its label, as requested by the equalized odds fairness criteria. To the best of our knowledge, our work is the first to optimize GNNs for the equalized odds criteria.

The contributions of this work are fourfold:

  • [leftmargin=*,labelindent=1mm,labelsep=1.5mm]

  • We propose EqGNN, an algorithm with equalized odds regulation for graph classification tasks.

  • We propose a novel permutation loss which allows us to compare pairs. We use this loss in the special case of nodes in two different scenarios – one under the bias sensitive distribution, and the other under the generated unbiased distribution.

  • We empirically evaluate EqGNN on several real-world datasets and show superior performance to several baselines both in utility and in bias reduction.

  • We empirically evaluate the permutation loss over both synthetic and real-world datasets and show the importance of leveraging the pair information.

2. Related Work

Supervised learning in graphs has been applied in many applications, such as protein-protein interaction prediction (Grover and Leskovec, 2016; Singer et al., 2019), human movement prediction (Yan et al., 2018), traffic forecasting (Yu et al., 2018; Cui et al., 2019) and other urban dynamics (Wang and Li, 2017). Many supervised learning algorithms have been suggested for those tasks on graphs, including matrix factorization approaches (Belkin and Niyogi, 2001; Tenenbaum et al., 2000; Yan et al., 2006; Roweis and Saul, 2000), random walks approaches (Perozzi et al., 2014; Grover and Leskovec, 2016) and graph neural network, which recently showed state-of-the-art results on many tasks (Wu et al., 2020). The latter is an adaptation of neural networks to the graph domain. GNNs create different differential layers that can be added to many different architectures and tasks. GNNs utilize the graph structure by propagating information through the edges and nodes. For instance, GCN (Kipf and Welling, 2017) and graphSAGE (Hamilton et al., 2017) update the nodes representation by averaging over the representations of all neighbors, while (Veličković et al., 2017) proposed an attention mechanism to learn the importance of each specific neighbor.

Fairness in graphs was mostly studied in the context of group fairness, by optimizing the SP fairness criteria. (Rahman et al., 2019) creates fair random walks by first sampling a sensitive attribute and only then sampling a neighbor from those who hold that specific sensitive attribute. For instance, if most nodes represent men while the minority represent women, the fair random walk promises that the presence of men and women in the random walks will be equal. (Buyl and De Bie, 2020) proposed a Bayesian method for learning embeddings by using a biased prior. Others, focus on unbiasing the graph prediction task itself rather than the node embeddings. For example, (Bose and Hamilton, 2019) uses a set of adversarial filters to remove information about predefined sensitive attributes. It is learned in a self supervised way by using a graph-auto-encoder to reconstruct the graph edges. (Dai and Wang, 2021) offers a discriminator that discriminates between the nodes sensitive attributes. In their setup, not all nodes sensitive attributes are known, and therefore, they add an additional component that predicts the missing attributes. (Kang et al., 2020) tackles the challenge of individual fairness in graphs. In this work, we propose a GNN framework optimizing the EO fairness criteria. To the best of our knowledge, our work is the first to study fairness in graphs in the context of EO fairness.

3. Equalized-Odds Fair Graph Neural Network

Figure 1. The full EqGNN architecture. The blue box represents the sampler model that given a label, samples a dummy sensitive attribute (Section 3.1.1 for details). This model is pretrained independently. The green box represents the classifier, which given a graph and node features, tries to predict its label. The red box represents the discriminator, which minimizes a permutation loss (Section 3.3). Purple arrows represent loss functions while the magic box represents a random bit ( or ), which is used for shuffling the sensitive attribute with its dummy.

Let be a graph, where is the list of edges, the list of nodes, the labels, and the sensitive attributes. Each node is represented via features. We denote as the feature matrix for the nodes in . Our goal is to learn a function with parameters , that given a graph , maps a node

represented by a feature vector, to its label.

In this work, we present an architecture that can leverage any graph neural network classifier. For simplicity, we consider a simple GNN architecture for as suggested by (Kipf and Welling, 2017): we define to be two GCN (Kipf and Welling, 2017)

layers, outputting a hidden representation

for each node. This representation then enters a fully connected layer that outputs . The GNN optimization goal is to minimize the distance between and using a loss function . can be categorical cross-entropy (CCE) for multi-class classification, binary cross-entropy (BCE) for binary classification, or mean square error (L2) for regression problems. In this work, we extend the optimization to while satisfying Eq. 2 for fair prediction.

We propose a method, EqGNN, that trains a GNN model to neglect information regarding the sensitive attribute only with respect to its label. The full architecture of EqGNN is depicted in Figure 1. Our method pretrains a sampler (Section 3.1), to learn the distribution , of the sensitive attributes of the nodes given their labels (marked in blue in Figure 1). We train a GNN classifier (marked in green in Figure 1), while regularizing it with a discriminator (marked in red in Figure 1) that discriminates between true and sampled sensitive attributes. Section 3.2 presents the EO regulation. The regularization is done using a novel loss function – the “permutation loss”, which is capable of comparing paired samples (formally presented in Section 3.3, and implementation details are discussed in Section 3.4). For the unique setup of adversarial learning over graphs, we show that incorporating the permutation loss in a discriminator, brings performance gains both in utility and in EO. Section 3.5 presents the full EqGNN model optimization procedure.

3.1. Sampler

To comply with the SP criteria (Eq. 1), given a sample , we wish the prediction of the classifier, , to be independent of the sample’s sensitive attribute . In order to check if this criteria is kept, we can sample a fake attribute out of (e.g., in case of equal sized groups, a random attribute), and check if can predict the true or fake attribute. If it is not able to predict, this means that is independent of and the SP criteria is kept. As all information of is also represented in the hidden representation , one can simply train a discriminator to predict the sensitive attribute given the hidden representation . A similar idea was suggested by (Dai and Wang, 2021; Bose and Hamilton, 2019).

To comply with the EO criteria (Eq. 2), the classifier should not be able to separate between an example with a real sensitive attribute and an example with an attribute sampled from the conditional distribution . Therefore, we jointly train the classifier with a discriminator that learns to separate between the two examples. Formally, given a sample , we would want the prediction of the classifier, , to be independent of the attribute only given its label . Thus, instead of sampling the fake attribute out of we sample the fake attribute out of .

We continue describing the sensitive attribute distribution learning process, , and then present how the model samples dummy attributes that will be used as “negative” examples for the discriminator.

3.1.1. Sensitive Attribute Distribution Learning

Here, our goal is to learn the distribution . For a specific sensitive attribute and label

, the probability can be expressed using Bayes’ rule as:

(3)

The term can be derived from the data by counting the number of samples that are both with label and sensitive attribute , divided by the number of samples with sensitive attribute . Similarly, is calculated as the number of samples with sensitive attribute

, divided by the total number of samples. In a regression setup, these can be approximated using a linear kernel density estimation.

3.1.2. Fair Dummy Attributes

During training of the end-to-end model (Section 3.5), the sampler receives a training example and generates a dummy attribute by sampling . Notice that and are equally distributed given . This ensures that if the classifier holds the EO criteria, then and will receive an identical classification, whereas otherwise it will result in different classifications. In Section 3.2 we further explain the optimization process that utilizes the dummy attributes for regularizing the classifier for EO.

3.2. Discriminator

A GNN classifier without regulation might learn to predict biased node labels based on their sensitive attributes. To satisfy EO, the classifier should be unable to distinguish between real examples and generated examples with dummy attributes. Therefore, we utilize an adversarial learning process and add a discriminator that learns to distinguish between real and fake examples with dummy attributes. Intuitively, this regularizes the classifier to comply with the EO criterion.

Intuitively, one might consider , the last hidden layer of the classifier , as the unbiased representation of node . The discriminator receives two types of examples: (1) real examples and (2) negative examples , where is generated by the pretrained sampler. The discriminator learns a function with parameters , that given a sample, classifies it to be true or fake. The classifier in its turn tries to “fool” it. This ensures the classifier doesn’t hold bias towards specific labels and sensitive attributes and keeps the EO criterion. The formal adversarial loss is defined as:

(4)

where is the expected value, and represents the concatenation operator. The discriminator tries to maximize to distinguish between true and fake samples, while the classifier tries to minimize it, in order to “fool” him. In our implementation, is a GNN with two GCN layers outputting the probability of being a true sample.

we observe that true and fake attributes are paired (the same node, with a real sensitive attribute or a dummy attribute). A binary loss as defined in Eq. 4 holds for unpaired examples, and does not take advantage of knowing the fake and true attributes are paired. Therefore, we upgrade the loss to handle paired nodes by utilizing the permutation loss. We first formally define and explain the permutation loss in Section 3.3, and then continue discussing its implantation details in our architecture in Section 3.4.

3.3. Permutation Loss

In this section, we formally define the new permutation loss test presented in this work. Let us assume and are two groups of subjects. Many applications are interested of learning and understanding the difference between the two. For example, in the case where represents test results of students in one class, while represents test results of a different class, it will be interesting to check if the two classes are equally distributed (i.e., ).

Definition 3.1 (T-test).

Given two groups , the statistical difference can be measured by the t-statistic:

Where , , and

are the means, variances, and group sizes respectively.

While this test assumes

are scalars that are normally distributed,

(Lopez-Paz and Oquab, 2017) proposed a method called C2ST that handles cases where for . They proposed a classifier that is trained to predict the correct group of a given sample, which belongs to either or . By doing so, given a test-set, they are able to calculate the t-statistic by simply checking the number of correct predictions:

Definition 3.2 (C2ST).

Given two groups (labeled 0 and 1 respectively), the statistical difference can be measured by the t-statistic (Lopez-Paz and Oquab, 2017):

where , is ’s original group label, is the number of samples in the test-set, and is a trained classifier outputting the probability a sample is sampled from group .

The basic (and yet important) idea behind this test is that if a classifier is not able to predict the group of a given sample, then the groups are equality distributed. Mathematically, the t-statistic can be calculated by the number of correct samples of the classifier.

However, the C2ST criterion is not optimal when the samples are paired ( and represent the same subject), as it doesn’t leverage the paired information. Consider the following example: assuming the pairs follow the following rule: and . As the two Gaussians overlap, a simple linear classifier will not be able to detect any significant difference between the two groups, while we hold the information that there is a significant difference ( is always smaller in from its pair ). Therefore, it is necessary to define a new test that can manage paired examples.

Definition 3.3 (Paired T-test).

Given two paired groups , and , the statistical difference can be measured by the t-statistic:

where ,

are the mean and standard deviation of

, and is the number of pairs.

Again, this Paired T-test assumes

are scalars that are normally distributed. A naive adaptation of (Lopez-Paz and Oquab, 2017) for paired data with , would be to first map the pairs into scalars and then calculate their differences (as known in the paired student t-test for ). This approach also assumes the samples are normally distributed, and therefore is not robust enough. An alternative to the paired t-test is the permutation test (Odén et al., 1975), which has no assumptions on the distribution. It checks how different the t-statistic of a specific permutation is from many (or all) other random permutations t-statistics. By doing so, it is able to calculate the p-value of that specific permutation. We suggest a differential version of the permutation test. This is done by using a neural network architecture that receives either a real permutation or a shuffled, and tries to predict the true permutation.

Input : , - paired groups in
- learning rate
Output : t - the t-statistic
1 , , , = train_test_split(, )
2
3 while  not converged do
      4
      5
      6
                  
      7
      8
      9
      
10
11
12
13
14 return
Algorithm 1 Permutation Loss

In Algorithm 1, we define the full differential permutation loss. The permutation phase consists of four steps: (1) For each pair, each of size , sample a random number (line 4 in the algorithm). (2) If , concatenate the pair in the original order: , while if , concatenate in the permuted order: . This resolves us with a vector of size (lines 5-6). (3) This sample enters a classifier that tries to predict using binary-cross-entropy (lines 7-8). (4) We update the classifier weights (line 9), and return to step 1, until convergence. Assuming and are exchangeable, the classifier will not be able to distinguish if a permutation was performed or not. The idea behind this test is that if a classifier is not able to predict the true ordering of the pairs, it means that there is no significant difference between this specific permutation or any other permutation. Mathematically, similarly to C2ST, the t-statistic can be calculated by the number of correct samples of the classifier.

While this is similar to the motivation of the naive permutation test (Odén et al., 1975), we offer additional benefits: (1) The test is differential, meaning we can represent it using a neural network as a layer (see Section 3.4). (2) The test can handle . (3) We do not need to define a specific t-statistic for each problem, but rather the classifier checks for any possible signal.

As a real life example that explains the power of the loss, assume a person is given a real coin and fake coin. While she observes each one separately, her confidence of which is which, will be much less if she would rather receive them together. This real life example demonstrates the important difference between C2ST and the permutation loss (see Section 5.3 for an additional example over synthetic data).

3.4. Permutation Discriminator

Going back to our discriminator, we observe that true and fake attributes are paired (the same node, with a real sensitive attribute or a dummy attribute). We create paired samples: . At each step we randomly permute the sensitive attribute and its dummy attribute, creating a sample labeled as permuted or not. Now our samples are with label indicating no permutation was applied, while with label indicating a permutation was applied. The discriminator therefore receives the samples and predicts the probability of whether a permutation was applied. We therefore adapt the adversarial loss of Eq. 4 to:

(5)

The loss used in the permutation test is a binary cross-entropy and therefore convex.

As an additional final regulation and to improve stability of the classifier, similarly to (Romano et al., 2020), we propose to minimize the absolute difference between the covariance of and from the covariance of and :

(6)

3.5. EqGNN

The sampler is pretrained using Eq. 3. We then jointly train the classifier and the discriminator optimizing the objective function:

(7)

where are the parameters of the classifier and are the parameters of the discriminator. and are hyper-parameters that are used to tune the different regulations. This objective is then optimized for and one step at a time using the Adam optimizer (Kingma and Ba, 2015), with learning rate and weight-decay . The training is further detailed in Algorithm 2.

Input :  - node features
- graph
- node sensitive attributes
- node labels
- number of nodes
- learning rate
Output :  - best according to
1   // calculate using equation (3)
2 while  not converged do
      3
      4 // sample dummy sensitive attributes
      5
      6
      7
      8
                 
      9
      10
      11
      12
      13
      14
      
15return
Algorithm 2 EqGNN training procedure

4. Experimental Setup

In this section, we describe our datasets, baselines, and metrics. Our baselines include fair baselines designed specifically for graphs, and general fair baselines that we adapted to the graph domain.

4.1. Datasets

Table 1 summarizes the datasets’ characteristics used for our experiments. Intra-group edges are the edges between similar sensitive attributes, while inter-group edges are edges between different sensitive attributes.

Pokec (Takac and Zabovsky, 2012). Pokec is a popular social network in Slovakia. An anonymized snapshot of the network was taken in 2012. User profiles include gender, age, hobbies, interest, education, etc. The original Pokec dataset contains millions of users. We sampled a sub-network of the “Zilinsky“ province. We create two datasets, where the sensitive attribute in one is the gender, and region in the other. The label used for classification is the job of the user. The job field was grouped in the following way: (1)“education“ and “student“, (2)“services & trade“ and “construction“, and (3) “unemployed“.

NBA (Dai and Wang, 2021) This dataset was presented in the FairGNN baseline paper. The NBA Kaggle dataset contains around 400 basketball players with features including performance statistics, nationality, age, etc. This dataset was extended in (Dai and Wang, 2021) to include the relationships of the NBA basketball players on Twitter. The binary sensitive attribute is whether a player is a U.S. player or an overseas player, while the task is to predict whether a salary of the player is over the median.

Dataset Pokec-region Pokec-gender NBA
# of nodes
# of attributes
# of edges
sensitive groups ratio
# of inter-group edges
# of intra-group edges
Table 1. Datasets’ characteristics.

4.2. Baselines

In our evaluation, we compare to the following baselines:

GCN (Kipf and Welling, 2017): GCN is a classic GNN layer that updates a node representation by averaging the representations of his neighbors. For fair comparison, we implemented GCN as the classifier of the EqGNN architecture (i.e., an unregulated baseline, with the only difference of ).

Debias (Zhang et al., 2018): Debias optimizes EO by using a discriminator that given and predicts the sensitive attribute. While Debias is a non-graph architecture, for fair comparison, we implemented Debias with the exact architecture as EqGNN. Unlike EqGNN, Debias’s discriminator receives as input only and (without the sensitive attribute or dummy attribute) and predicts the sensitive attribute. As the discriminator receives , it neglects the sensitive information with respect to and, therefore optimizes for EO.

FairGNN (Dai and Wang, 2021): FairGNN uses a discriminator that, given , predicts the sensitive attribute. By doing so, they neglect the sensitive information from . As this is without respect to , they optimize SP (further explained in Section 3.1). FairGNN offers an additional predictor for nodes with unknown sensitive attributes. As our setup includes all nodes’ sensitive attributes, this predictor is irrelevant. We opted to use FairGCN for fair comparison. In addition, we generalized their architecture to support multi-class classification.

For all baselines, of nodes are used for training, for validation and for testing. The validation set is used for choosing the best model for each baseline throughout the training. As the classifier is the only part of the architecture used for testing, an early stopping was implemented after its validation loss (Eq. 7) hasn’t improved for epochs. The epoch with the best validation loss was then used for testing. All results are averaged over different train/validation/test splits for Pokec datasets and for the NBA dataset. For fair comparison, we implemented grid-search for all baselines over for baselines with a discriminator, and for baselines with a covariance expression. For both Pokec datasets and for all baselines and , while for NBA we end up using and expect for FairGNN with . All experiments used a single Nvidia P100 GPU with the average run of 5 minutes per seed for Pokec and 1 minute for NBA. Results where logged and analyzed using the (Biewald, 2020; Falcon, 2019) platforms.

4.3. Metrics

4.3.1. Fairness Metrics

Equalized odds.

The definition in Eq. 2, can be formally written as:

(8)

The value of can be calculated from the test-set as follows: given all samples with label and sensitive attribute , we calculate the proportion of samples that where labeled by the model. As we handle a binary sensitive attribute, given a label , we calculate the absolute difference between the two sensitive attribute values:

(9)

According to Eq. 8, our goal is to have both probabilities equal. Therefore, we desire to strive to . We finally aggregate for all labels using the max operator to get a final scalar metric:

(10)

As we propose an equalized odds architecture, is our main fairness metric.

Statistical parity.

The definition in Eq. 1 can be formally written as:

(11)

The value of can be calculated from the test-set the following way: given all samples with sensitive attribute , we calculate the proportion of samples that where labeled by the model. As we handle a binary sensitive attribute, given a label we calculate the absolute difference between the two sensitive attribute values:

(12)

According to Eq. 11, our goal is to have both probabilities equal. Therefore, we desire to strive to . We finally aggregate for all labels using the max operator to get a final scalar metric:

(13)

4.3.2. Performance Metrics

As our main classification metric, we used the F1 score. We examined both the micro F1 score, which is computed globally based on the true and false predictions, and the macro F1 score, computed per each class and averaged across all classes. For completeness, we also report the Accuracy (ACC).

5. Experimental Results

In this section, we report the experimental results. We start by comparing EqGNN to the baselines (Section 5.1). We then demonstrate the importance of to the EqGNN architecture (Section 5.2). We continue by showing the superiority of the permutation loss, compared to other loss functions, both over synthetic datasets (Section 5.3) and real datasets (Section 5.4). Finally, we explore two qualitative examples, that visualizes the importance of fairness in graphs (Section 5.5).

5.1. Main Result

Dataset Metrics GCN FairGNN Debias EqGNN
Pokec-gender (%)
(%)
ACC (%)
F1-macro (%)
F1-micro (%)
Pokec-region (%)
(%)
ACC (%)
F1-macro (%)
F1-micro (%)
NBA (%)
(%)
ACC (%)
F1-macro (%)
F1-micro (%)
Table 2. Fairness and performance results

Table 2 reports the results of EqGNN and baselines over the datasets with respect to the performance and fairness metrics. We can notice that, while the performance metrics are very much similar between all baselines (apart from Debias in Pokec-region), EqGNN outperforms all other baselines in both fairness metrics. An interesting observation is that Debias is the second best, after EqGNN, to improve the EO metric, without harming the performance metrics. This can be explained as it is the only baseline to optimize with respect to EO. Additionally, Debias has gained fairness in Pokec-region, but at the cost of performance. This is a general phenomena: the lower the performance, the better the fairness. For example, when the performance is random, surely the algorithm doesn’t prefer any particular group and therefore is extremely fair. Here, EqGNN is able to both optimize the fairness metrics while keeping the performance metrics high. The particularly low performance demonstrated by FairGNN was also validated with the authors of the paper. The previously reported results were validated over a single validation step as opposed to several, to insure statistical significance.

5.2. The Discriminator for Bias Reduction

Figure 2. The comparisons of different values over the Pokec-gender dataset. Lower-right is better.

As a second analysis, we demonstrate the importance of the parameter with respect to the performance and fairness metrics. The hyper-parameter serves as a regularization for task performance as opposed to fairness. High values of cause the discriminator EO regulation on the classifier to be higher. While EqGNN results reported in this paper use for Pokec datasets, and for NBA, we show additional results for . In Figure 2, we can observe that the selected s show best results over all metrics for Pokec-gender, while similar results where shown over Pokec-region and NBA. Obviously, enlarging the results in a more fair model but at the cost of performance. The hyper-parameter is an issue of priority: depending on the task, one should decide what should be the performance vs. fairness prioritization. Therefore, EqGNN can be used with any desired where we chose as it is the elbow of the curve.

5.3. Synthetic Evaluation of the Permutation Loss

In this experiment, we wish to demonstrate the power of the permutation loss over synthetic data. Going back to the notations used in Section 3.3, we generate two paired groups in the following ways:

Rotation:

where , and . This can simply be thought as one group is a 2-dimensional Gaussian, while the second is the exact same Gaussian but rotated by . As a rotation over a Gaussian resolves also with a Gaussian, is also a 2-dimensional Gaussian. Yet, it is paired to , as given a sample from , we can predict its pair from (simply rotate it by ).

Shift:

where , and . This can simply be thought as one group is a 2-dimensional Gaussian, while the second is the exact same Gaussian but shifted by on the first axis. As shifting a Gaussian by a small value, overlaps and therefore, it is hard to distinguish between the two. Yet, it is paired to , as given a sample from , we can predict its pair from (simply add ).

Over these two synthetic datasets, we train four classifiers:

T-test - adapted: As the original unpaired t-test requires a one dimensional data, we first map the samples into a single scalar using a fully connected layer and train it using the the t-statistic defined in Section 3.1.

Paired T-test - adapted: Similar to T-test, but using the paired t-statistic defined in Section 3.3.

C2ST (Lopez-Paz and Oquab, 2017): A linear classifier that given a sample, tries to predict to which group it belongs to.

Permutation (Section 3.3): A linear classifier that given a randomly shuffled pair, predicts if it was shuffled or not. For detailed implementation please refer to Algorithm 1.

We sample pairs for train and an additional for test, and average results over 5 runs. In Table 4 we report the p-value of the classifiers over the different generated datasets.

Model Shift Rotation
T-test 0.24 0.47
Paired T-test 0 0.36
C2ST 0.5 0.5
Permutation 0 0
Table 3. P-value comparison of different classifiers over synthetic datasets, lower is better.

We can observe that, the permutation classifier captures the difference between the pairs perfectly in both datasets. This does not hold for the Paired T-test that captures the difference only for the Shift dataset. The reason it classifies well only over the Shift dataset is because it is a linear transformation, which is easier to learn. We can further notice that both unpaired classifiers (T-test and C2ST) perform poorly over both datasets. The promising results of the permutation classifier on our synthetic datasets, drive us to choose it as the potential discriminator in the EqGNN architecture. We validate this choice over real-datasets next section.

5.4. The Importance of the Permutation Loss

Dataset Metrics Unpaired Paired Permutation/h Permutation
Pokec-gender (%)
(%)
ACC (%)
F1-macro (%)
F1-micro (%)
Pokec-region (%)
(%)
ACC (%)
F1-macro (%)
F1-micro (%)
NBA (%)
(%)
ACC (%)
F1-macro (%)
F1-micro (%)
Table 4. The comparisons of different loss functions.

As an ablation study, we compare different loss functions for the discriminator. We choose to compare the permutation loss with three different loss functions: (1) Unpaired: Inspired by (Romano et al., 2020), an unpaired binary cross-entropy loss as presented in Eq. 4

. The loss is estimated by a classifier that predicts if a sample represents a real sensitive attribute or a dummy. (2) Permutation/h: A permutation loss without concatenating the hidden representation

to the discriminator samples, while leaving the sample to be . (3) Paired: A paired loss:

(14)

where is the Sigmoid activation. This loss is the known paired student t-test with a neural network version of it (as demonstrated by (Lopez-Paz and Oquab, 2017) on the unpaired t-test). Implementation of this loss when summing the absolute differences yielded poor results. We therefore, report a version of this loss with a summation over non absolute differences. The results of the different loss functions are reported in Table 4. One can notice that all loss functions have gained fairness over our baselines (as reported in Table 2), while the permutation loss with the hidden representation , outperforms the others, and specifically Permutation/h. This implies that the hidden representation is important. In the Pokec datasets, the performance metrics are not impacted apart from the paired loss. We hypothesize this is caused due to its non-convexity in adversarial settings. Additionally, the paired loss demonstrates the same phenomena again: the lower the performance, the better the fairness. In the NBA dataset we do not see much difference between the loss functions. This can be explained due to the size of the graph. However, we do see that the permutation loss is the only one to improve fairness metrics while not hurting the performance metrics. Finally, we can notice that, paired loss functions (the permutation loss and the paired loss) perform better than the unpaired loss (apart from NBA, where the unpaired loss hurts the performance metrics). This can be explained by our paired problem, where we check for the difference between two scenarios of a node (real and fake). This illustrates the general importance of a paired loss function for paired problems.

5.5. Qualitative Example

[Male]  [Female]

Figure 3. In each of the 2-hop sub-graphs there is a central node (highlighted in red). The node (and over of his neighbors) are from a specific gender. The node colors represent the class they belong to. Above the central node we can observe the output of the classifier, representing the probabilities of belonging to each class, both for and .

We end this section with a few qualitative examples over the Pokec-gender test-set. Specifically, we present two qualitative examples, where the central node has the same sensitive attribute as over of its 2-hop neighbors, but holds a different label from most of them. We consider 2-hops, as our classifier includes 2 GCN layers. Figure 3 presents the example for a sensitive attribute being a male where in Figure 3 being a female; i.e., nodes in Figure 3 are mostly males and in Figure 3 mostly females. A biased approach would be to be inclined to predict the same label for the central node as its same-gender neighbors. Above the central node we observe the prediction distribution for and . In Figure 3 we observe that, when no discriminator is applied (), and therefore there is no regularization for bias, the probability is towards most neighbors label. On the other hand, when applying the discriminator (), the probability for that class drops to , and rises to for the correct label. Similarly, in Figure 3, we observe that, when no discriminator is applied (), the probability is towards most neighbors label. Again, this in comparison to the case when applying the discriminator (), the probability for that class drops to , and rises to for the correct label. These qualitative examples show that, a equalized odds regulator over graphs can help make less biased predictions, even when the neighbours at the graphs might cause bias.

6. Conclusions

In this work, we explored fairness in graphs. Unlike previous work which optimize the statistical parity (SP) fairness criterion, we present a method that learns to optimize equalized odds (EO). While SP promises equal chances between groups, it might cripple the utility of the prediction task as it does not give equalized opportunity as EO. We propose a method that trains a GNN model to neglect information regarding the sensitive attribute only with respect to its label. Our method pretrains a sampler to learn the distribution of the sensitive attributes of the nodes given their labels. We then continue training a GNN classifier while regularizing it with a discriminator that discriminates between true and sampled sensitive attributes using a novel loss function – the “permutation loss”. This loss allows comparison of pairs. For the unique setup of adversarial learning over graphs, we show it brings performance gains both in utility and in EO. While this work uses the loss for the specific case of nodes in two scenarios: fake and true, this loss is general and can be used for any paired problem.

For future work, we wish to test the novel loss over additional architectures and tasks. We draw the reader attention that the C2ST discriminator is the commonly used discriminator for many architectures that work over paired data. For instance, the pix2pix architecture

(Isola et al., 2017) is a classic architecture that inspired many works. Although the pix2pix discriminator receives paired samples, it is still just an advanced C2ST discriminator. Alternatively, using a paired discriminator instead, can create a much powerful discriminator and therefore a much powerful generator. Observing many works that apply over paired samples, we haven’t found any architectures that are designed to work over paired samples. We believe that, although this work uses the permutation loss for a specific use-case, it is a general architecture that can be used for any paired problem.

We empirically show that our method outperforms different baselines in the combined fairness-performance metrics, over datasets with different attributes and sizes. To the best of our knowledge, we are the first to optimize GNNs for the EO criteria and hope it will serve as a beacon for works to come.

References

  • S. Barocas, M. Hardt, and A. Narayanan (2019) Fairness and machine learning. fairmlbook.org. Note: http://www.fairmlbook.org Cited by: §1.
  • M. Belkin and P. Niyogi (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering.. In Advances in Neural Information Processing Systems, Vol. 14, pp. 585–591. Cited by: §2.
  • L. Biewald (2020) Experiment tracking with weights and biases. Note: Software available from wandb.com External Links: Link Cited by: §4.2.
  • A. Bose and W. Hamilton (2019) Compositional fairness constraints for graph embeddings. In International Conference on Machine Learning, pp. 715–724. Cited by: §1, §2, §3.1.
  • M. Buyl and T. De Bie (2020) DeBayes: a bayesian method for debiasing network embeddings. In International Conference on Machine Learning, pp. 1220–1229. Cited by: §1, §2.
  • Z. Cui, K. Henrickson, R. Ke, and Y. Wang (2019)

    Traffic graph convolutional recurrent neural network: a deep learning framework for network-scale traffic learning and forecasting

    .
    IEEE Transactions on Intelligent Transportation Systems. Cited by: §2.
  • E. Dai and S. Wang (2021) Say no to the discrimination: learning fair graph neural networks with limited sensitive attribute information. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp. 680–688. Cited by: §1, §2, §3.1, §4.1, §4.2.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214–226. Cited by: §1.
  • W. Falcon (2019) PyTorch lightning. GitHub. Note: https://github.com/PyTorchLightning/pytorch-lightning 3. Cited by: §4.2.
  • A. Grover and J. Leskovec (2016) Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §2.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035. Cited by: §2.
  • M. Hardt, E. Price, and N. Srebro (2016) Equality of opportunity in supervised learning. Advances in Neural Information Processing Systems 29, pp. 3315–3323. Cited by: §1.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1125–1134. Cited by: §6.
  • B. Kang, J. Lijffijt, and T. D. Bie (2019) Conditional network embeddings. International Conference on Learning Representations. Cited by: §1.
  • J. Kang, J. He, R. Maciejewski, and H. Tong (2020) InFoRM: individual fairness on graph mining. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 379–389. Cited by: §2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. International Conference on Learning Representations. Cited by: §3.5.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations. Cited by: §2, §3, §4.2.
  • D. Lopez-Paz and M. Oquab (2017) Revisiting classifier two-sample tests. In International Conference on Learning Representations, Cited by: §3.3, §3.3, Definition 3.2, §5.3, §5.4.
  • Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan (2019) Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 (6464), pp. 447–453. Cited by: §1.
  • A. Odén, H. Wedel, et al. (1975) Arguments for fisher’s permutation test. The Annals of Statistics 3 (2), pp. 518–520. Cited by: §3.3, §3.3.
  • D. Pedreshi, S. Ruggieri, and F. Turini (2008) Discrimination-aware data mining. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 560–568. Cited by: §1.
  • B. Perozzi, R. Al-Rfou, and S. Skiena (2014) Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §2.
  • T. A. Rahman, B. Surma, M. Backes, and Y. Zhang (2019) Fairwalk: towards fair graph embedding.. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 3289–3295. Cited by: §1, §2.
  • Y. Romano, S. Bates, and E. J. Candès (2020) Achieving equalized odds by resampling sensitive attributes. Advances in Neural Information Processing Systems. Cited by: §3.4, §5.4.
  • S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. Cited by: §2.
  • U. Singer, I. Guy, and K. Radinsky (2019) Node embedding over temporal graphs. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 4605–4612. External Links: Document, Link Cited by: §2.
  • U. Singer, K. Radinsky, and E. Horvitz (2020) On biases of attention in scientific discovery. Bioinformatics. Note: btaa1036 External Links: ISSN 1367-4803, Document, Link Cited by: §1.
  • L. Takac and M. Zabovsky (2012) Data analysis in public social networks. In International scientific conference and international workshop present day trends of innovations, Vol. 1. Cited by: §4.1.
  • J. B. Tenenbaum, V. De Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. science 290 (5500), pp. 2319–2323. Cited by: §2.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. International Conference on Learning Representations. Cited by: §2.
  • H. Wang and Z. Li (2017) Region representation learning via mobility flow. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 237–246. Cited by: §2.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §2.
  • S. Yan, D. Xu, B. Zhang, H. Zhang, Q. Yang, and S. Lin (2006) Graph embedding and extensions: a general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence 29 (1), pp. 40–51. Cited by: §2.
  • S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32. Cited by: §2.
  • B. Yu, H. Yin, and Z. Zhu (2018) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3634–3640. Cited by: §2.
  • B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340. Cited by: §4.2.