1. Introduction
As the world becomes more connected, graph mining is playing a crucial role in many domains such as drug discovery and recommendation system (Fan et al., 2019; Chen et al., 2021; Rozemberczki et al., 2022). As one of its major branches, learning informative node representation is a fundamental solution to many realworld problems such as node classification and link prediction (Wang and Derr, 2021; Zhang and Chen, 2018). Numerous datadriven models have been developed for learning node representations, among which Graph Neural Networks (GNNs) have achieved unprecedented success owing to the combination of neural networks and feature propagation (Kipf and Welling, 2017; Klicpera et al., 2019; Liu et al., 2020). Despite the significant progress of GNNs in capturing higherorder neighborhood information (Chen et al., 2020), leveraging multihop dependencies (Wang and Derr, 2021), and recognizing complex local topology contexts (Wijesinghe and Wang, 2022), predictions of GNNs have been demonstrated to be unfair and perpetuate undesirable discrimination (Dai and Wang, 2021; Shumovskaia et al., 2021; Xu et al., 2021; Agarwal et al., 2021; Bose and Hamilton, 2019).
Recent studies have revealed that historical data may include previous discriminatory decisions dominated by sensitive features (Mehrabi et al., 2021; Du et al., 2020). Thus, node representations learned from such data may explicitly inherit the existing societal biases and hence exhibit unfairness when applied in practice. Besides the sensitive features, network topology also serves as an implicit source of societal bias (Dong et al., 2022a; Dai and Wang, 2021). By the principle of network homophily (McPherson et al., 2001), nodes with similar sensitive features tend to form closer connections than dissimilar ones. Since feature propagation smooths representations of neighboring nodes while separating distant ones, representations of nodes in different sensitive groups are further segregated and their corresponding predictions are unavoidably overassociated with sensitive features.
Besides above topologyinduced bias, feature propagation could introduce another potential issue, termed as the sensitive information leakage. Since feature propagation naturally allows feature interactions among neighborhoods, the correlation between two feature channels is likely to vary after feature propagation, which is termed as correlation variation. As such, some original innocuous feature channels that have lower correlation to sensitive channels and encode less sensitive information may become highly correlated to sensitive ones after feature propagation and hence encode more sensitive information, which is termed as sensitive attribute leakage. Some research efforts have been invested in alleviating discrimination made by GNNs. However, they either borrow approaches from traditional fair representation learning such as adversarial debiasing (Dai and Wang, 2021; Bose and Hamilton, 2019) and contrastive learning (Köse and Shen, 2021) or directly debiasing node features and graph topology (Dong et al., 2022a; Agarwal et al., 2021) while overlooking the sensitive attribute leakage caused by correlation variation.
In this work, we study a novel and detrimental phenomenon where feature propagation can vary feature correlations and cause the leakage of sensitive information to innocuous features. To address this issue, we propose a principled framework Fair View Graph Neural Network (FairVGNN) to effectively learn fair node representations and avoid sensitive attribute leakage. Our major contributions are as follows:

Problem: We investigate the novel phenomenon that feature propagation could vary feature correlations and cause sensitive attribute leakage to innocuous feature channels, which could further exacerbate discrimination in predictions.

Algorithm: To prevent sensitive attribute leakage, we propose a novel framework FairVGNN to automatically learn fair views by identifying and masking sensitivecorrelated channels and adaptively clamping weights to avoid leveraging sensitiverelated features in learning fair node representations.

Evaluation: We perform experiments on realworld datasets to corroborate that FairVGNN can approximate the model utility while reducing discrimination.
Section 2 introduces preliminaries. In Section 3, we formally introduce the phenomenon of correlation variation and sensitive attribute leakage in GNNs and design two feature masking strategies to highlight the importance of circumventing sensitive attribute leakage for alleviating discrimination. To automatically identify/mask sensitiverelevant features, we propose FairVGNN in Section 4, which consists of a generative adversarial debiasing module to prevent sensitive attribute leakage from the input perspective by learning fair feature views and an adaptive weight clamping module to prevent sensitive attribute leakage from the model perspective by clamping weights of sensitivecorrelated channels of the encoder. In Section 5, we evaluate FairVGNN by performing extensive experiments. Related work is presented in Section 6. Finally, we conclude and discuss future work in Section 7.
2. Preliminaries
2.1. Notations
We denote an attributed graph by where is the set of nodes with specifying their labels, is the set of edges with being the edge between nodes and , and is the node feature matrix with indicating the features of node , indicating the channel feature. The network topology is described by its adjacency matrix , where when , and otherwise. Node sensitive features are specified by the channel of , i.e., . Details of all notations used in this work are summarized in Table 7 in Appendix A.
2.2. Fairness in Machine Learning
Group fairness and individual fairness are two commonly encountered fairness notions in real life (Du et al., 2020). Group fairness emphasizes that algorithms should not yield discriminatory outcomes for any specific demographic group (Dong et al., 2022a) while individual fairness requires that similar individuals be treated similarly (Dong et al., 2021). Here we focus on group fairness with a binary sensitive feature, i.e., , but our framework could be generalized to multisensitive groups and we leave this as one future direction. Following (Dai and Wang, 2021; Agarwal et al., 2021; Dong et al., 2022a), we employ the difference of statistical parity and equal opportunity between two different sensitive groups, to evaluate the model fairness:
(1) 
(2) 
where measures the difference of the independence level of the prediction (true positive rate) on the sensitive feature
between two groups. Since group fairness expects algorithms to yield similar outcomes for different demographic groups, fairer machine learning models seek lower
and .3. Sensitive attribute leakage and correlation variation
In this section, we study the phenomenon where sensitive information leaks to innocuous feature channels after their correlations to the sensitive feature increase during feature propagation in GNNs, which we define as sensitive attribute leakage . We first empirically verify feature channels with higher correlation to the sensitive channel would cause more discrimination in predictions (Zhao et al., 2022). We denote the Pearson correlation coefficient of the feature channel to the sensitive channel as sensitive correlation and compute it as:
(3) 
where
denote the mean and standard deviation of the channel
. Intuitively, higher indicates that the feature channel encodes more sensitiverelated information, which would impose more discrimination in the prediction. To further verify this assumption, we mask each channel and train a 1layer MLP/GCN followed by a linear layer to make predictions. As suggested by (Agarwal et al., 2021), we do not add any activation function in the MLP/GCN to avoid capturing any nonlinearity.
Figure 2(a)(b) visualize the relationships between the model utility/bias and the sensitive correlation of each masked feature channel. Clearly, we see that the discrimination does still exist even though we mask the sensitive channel (1). Compared with no masking situation, and almost always become lower when we mask other nonsensitive feature channels (24), which indicates the leakage of sensitive information to other nonsensitive feature channels. Moreover, we observe the decreasing trend of and when masking channels with higher sensitive correlation since these channels encode more sensitive information and masking them would alleviate more discrimination.
Following the above observation, one natural way to prevent sensitive attribute leakage and alleviate discrimination is to mask the sensitive features as well as its highlycorrelated nonsensitive features. However, feature propagation in GNNs could change feature distributions of different channels and consequentially vary feature correlations as shown by Figure 2(c) where we visualize the sensitive correlations of the first 8 feature channels on German after a certain number of propagations. We see that correlations between the sensitive features and other channels change after propagation. For example, some feature channels that are originally irrelevant to the sensitive one, such as the feature channel, become highlycorrelated and hence encode more sensitive information.
Encoder  Strategy  German  Credit  
AUC  F1  AUC  F1  
MLP  S  71.98  82.32  29.26  19.43  74.46  81.64  11.85  9.61 
S  69.89  81.37  8.25  4.75  73.49  81.50  11.50  9.20  
S  70.54  81.44  6.58  3.24  73.49  81.50  11.50  9.20  
GCN  S  74.11  82.46  35.17  25.17  73.86  81.92  12.86  10.63 
S  73.78  81.65  11.39  9.60  72.92  81.84  12.00  9.70  
S  72.75  81.70  8.29  6.91  72.92  81.84  12.00  9.70  
GIN  S  72.71  82.78  13.56  9.47  74.36  82.28  14.48  12.35 
S  71.66  82.50  3.01  1.72  73.44  83.23  14.29  11.79  
S  70.77  83.53  1.46  2.67  73.28  83.27  13.96  11.34 

* S: training using the original feature matrix without any masking.

* S/S: training with masking the top channels based on the rank of /.
After observing that feature propagation could vary feature correlation and cause sensitive attribute leakage, we devise two simple but effective masking strategies to highlight the importance of considering correlation variation and sensitive attribute leakage in alleviating discrimination. Specifically, we first compute sensitive correlations of each feature channel according to 1) the original features and 2) the propagated features . Then, we manually mask top feature channels according to the absolute values of correlation given by and , respectively, and train MLP/GCN/GIN on German/Credit dataset shown in Table 1. Detailed experimental settings are presented in Section 5. From Table 1, we have following insightful observations: (1) Within the same encoder, masking sensitive and its related feature channels (S, S) would alleviate the discrimination while downgrading the model utility compared with nomasking (S). (2) GCN achieves better model utility but causes more bias compared with MLP on German and Credit. This implies graph structures also encode bias and leveraging them could aggravate discrimination in predictions, which is consistent with recent work (Dai and Wang, 2021; Dong et al., 2022a). (3) Most importantly, S achieves lower than S for both MLP and GCN on German because the rank of sensitive correlation changes after feature propagation and masking according to S leads to better fairness, which highlights the importance of considering feature propagation in determining which feature channels are more sensitivecorrelated and required to be masked. Applying S achieves the same utility/bias as S on Credit due to less correlation variations shown in Figure 2(d).
To this end, we argue that it is necessary to consider feature propagation in masking feature channels in order to alleviate discrimination. However, the correlation variation heavily depends on the propagation mechanism of GNNs. To tackle this challenge, we formulate our problem as:
Given an attributed network with labels for a subset of nodes , we aim to learn a fair view generator
with the expectation of simultaneously preserving taskrelated information and discarding sensitive information such that the downstream node classifier
trained on could achieve better tradeoff between model utility and fairness.4. Framework
In this section, we give a detailed description of FairVGNN (shown in Figure 2), which includes the generative adversarial debiasing module and the adaptive weight clamping module. In the first module, we learn a generator that generates different fair views of features to obfuscate the sensitive discriminator such that the encoder could obtain fair node representations for downstream tasks. In the second module, we propose to clamp weights of the encoder based on learned fair feature views, and provide a theoretical justification on its equivalence to minimizing the upper bound of the difference of representations between two different sensitive groups. Next, we introduce the details of each component.
4.1. Generative Adversarial Debiasing
This module includes a fair view generator , a GNNbased encoder , a sensitive discriminator , and a classifier parametrized by , respectively. We assume the view generator to be a learnable latent distribution from which we sample different masks and generate corresponding views . The latent distribution would be updated towards generating lessbiased views and the stochasticity of each view would enhance the model generalizability. Then each of these different views are fed to the encoder together with the network topology to learn node representations for downstream classifier . Meanwhile, the learned node representations are used by the sensitive discriminator to predict nodes’ sensitive features. This paves us a way to adopt adversarial learning to obtain the optimal fair view generator where the generated views encode as much taskrelevant information while discarding as much biasrelevant information as possible. We begin with introducing the fairnessaware view generator .
4.1.1. Fairnessaware View Generator
As observed in Table 1, discrimination could be traced back to the sensitive features as well as their highlycorrelated nonsensitive features. Therefore, we propose to learn a view generator that automatically identifies and masks these features. More specifically, assuming the view generator as a conditional distribution parametrized by , since bias originates from the node features and is further varied by the graph topology
, the conditional distribution of the view generator can be further expressed as a joint distribution of the attribute generator and the topological generator as
. Since our sensitive discriminator is directly trained on the learned node representations from GNNbased encoder as described in Section 4.1.3, we already consider the proximityinduced bias in alleviating discrimination and hence the network topology is assumed to be fixed here, i.e., . We will leave the joint generation of fair feature and topological views as one future work.Instead of generating from scratch that completely loses critical information for GNN predictions, we generate conditioned on the original node features , i.e., . Following the preliminary experiments, we model the generation process of as identifying and masking sensitive features and their highlycorrelated features in . One natural way is to select features according to their correlations to the sensitive features as defined in Eq. (3). However, as shown by Figure 2
(c), feature propagation in GNNs triggers the correlation variation. Thus, instead of masking according to initial correlations that might change after feature propagation, we train a learnable mask for feature selections in a datadriven fashion. Denote our mask as
so that:(4) 
then learning the conditional distribution of the feature generator is transformed to learning a sampling distribution of the masker
. We assume the probability of masking each feature channel independently follows a Bernoulli distribution, i.e.,
with each feature channel being masked with the learnable probability . In this way, we can learn which feature channels should be masked to achieve less discrimination through gradientbased techniques. Since the generator aims to obfuscate the discriminator that predicts the sensitive features based on the alreadypropagated node representations from the encoder , the generated fair feature view would consider the effect of correlation variation by feature propagation rather than blindly follow the order of the sensitive correlations computed by the original features . Generating fair feature view and forwarding it through the encoder and the classifier to make predictions involve sampling masks from the categorical Bernoulli distribution, the whole process of which is nondifferentiable due to the discreteness of masks. Therefore, we apply GumbelSoftmax trick (Jang et al., 2017) to approximate the categorical Bernoulli distribution. Assuming for each channel , we have a learnable sampling score with score keeping while score masking the channel . Then the categorical distribution is softened by^{3}^{3}3We use instead of thereafter to represent the probability of keeping channel .:(5) 
where and is the temperature factor controlling the sharpness of the GumbelSoftmax distribution. Then, to generate after we sample masks based on probability , we could either directly multiply feature channel by the probability or solely append the gradient of to the sampled hard mask^{4}^{4}4, both of which are differentiable and can be trained end to end. After we approximate the generator via GumbelSoftmax, we next model the GNNbased encoder to capture the information of both node features and network topology .
4.1.2. GNNbased Encoder
In order to learn from both the graph topology and node features, we employ layer GNNs as our encoderbackbone to obtain node representations . Different graph convolutions adopt different propagation mechanisms, resulting in different variations on feature correlations. Here we select GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), and GIN (Xu et al., 2019) as our encoderbackbones. In order to consider the variation induced by the propagation of GNNbased encoders, we apply the discriminator and classifier on top of the obtained node representations from the GNNbased encoders. Since both of the classifier and the discriminator are to make predictions, one towards sensitive groups and the other towards class labels, their model architectures are similar and therefore we introduce them together next.
4.1.3. Classifier and Discriminator
Given node representations obtained from any layer GNNbased encoder , the classifier and the discriminator predict node labels and sensitive attributes as:
(6) 
where we use two different multilayer perceptrons (MLPs):
for the classifier and the discriminator, and is the sigmoid operation. After introducing the fairnessaware view generator, the GNNbased encoder, the MLPbased classifier and discriminator, we collect them together and adversarially train them with the following objective function.4.1.4. Adversarial Training
Our goal is to learn fair views from the original graph that encode as much taskrelevant information while discarding as much sensitiverelevant information as possible. Therefore, we aim to optimize the whole framework from both the fairness and model utility perspectives. According to statistical parity, to optimize the fairness metric, a fair feature view should guarantee equivalent predictions between sensitive groups:
(7) 
where is the predicted distribution given the sensitive feature. Assuming and are conditionally independent given (Kamishima et al., 2011), to solve the global minimum of Eq. (7), we leverage adversarial training and compute the loss of the discriminator and generator as:
(8) 
(9) 
where and regularizes the mask to be dense, which avoids masking out sensitiveuncorrelated but taskcritical information. is the hyperparamter. Intuitively, Eq. (8) encourages our discriminator to correctly predict the sensitive features of each node under each generated view and Eq. (9) requires our generator to generate fair feature views that enforce the welltrained discriminator to randomly guess the sensitive features. In Theorem 1, we show that the global minimum of Eq. (8)(9) is equivalent to the global minimum of Eq. (7):
Theorem 1 ().
Proof.
Based on Proposition 1. in (Goodfellow et al., 2014) and Proposition 4.1. in (Dai and Wang, 2021), the optimal discriminator is , which is exactly the probability when discriminator randomly guesses the sensitive features. Then we further substituted it into Eq. (9) and the optimal generator is achieved when . Then we have:
which is obviously the global minimum of Eq. (7). ∎
Note that node representations have already been propagated in GNNbased encoder and therefore, the optimal discriminator could identify sensitiverelated features after correlation variation. Besides the adversarial training loss to ensure the fairness of the generated view, the classification loss for training the classifier is used to guarantee the model utility:
(10) 
4.2. Adaptive Weight Clamping
Although the generator is theoretically guaranteed to achieve its global minimum by applying adversarial training, in practice the generated views may still encode sensitive information and the corresponding classifier may still make discriminatory decisions. This is because of the unstability of the training process of adversarial learning (Goodfellow et al., 2014) and the entanglement with training classifier.
To alleviate the above issue, we propose to adaptively clamp weights of the encoder
based on the learned masking probability distribution from the generator
. After adversarial training, only the sensitive and its highlycorrelated features would have higher probability to be masked and therefore, declining their contributions inby clamping their corresponding weights in the encoder would discourage the encoder from capturing these features and hence alleviate the discrimination. Concretely, within each training epoch after the adversarial training, we compute the probability of keeping features
by sampling masks and calculate their mean . Then assuming the weights of the first layer in the encoder is , we clamp it by:(11) 
where
is a prefix cutting threshold selected by hyperparameter tuning and
takes the sign of . Intuitively, feature channels masked with higher probability (remained with lower probability ) would have lower threshold in weight clamping and hence their contributions to the representations are weakened. Next, we theoretically rationalize this adaptive weight clamping by demonstrating its equivalence to minimizing the upper bound of the difference of representations between two sensitive groups:Theorem 2 ().
Given a 1layer GNN encoder with rownormalized adjacency matrix as the PROP and weight matrix
as TRAN and further assume that features of nodes from two sensitive groups in the network independently and identically follow two different Gaussian distributions, i.e.,
, then the difference of representations also follows a Gaussian with the 2norm of its mean as:(12) 
where and denote the sensitive and nonsensitive features, and is the network homophily.
Proof.
Substituting the rownormalized adjacency matrix , we have , for any pair of nodes coming from two different sensitive groups , we have:
(13) 
if the network homophily is and further assuming that neighboring nodes strictly obey the network homophily, i.e., among neighboring nodes of the center node , of them come from the same feature distribution as while of them come from the other feature distribution as , then symmetrically we have:
(14) 
Combining Eq. (14) and Eq. (13), the distribution of their difference would also be a Gaussian , where:
(15) 
(16) 
Taking the norm on the mean , splitting channels into sensitive ones and nonsensitive ones , i.e., and expanding based on the input channel, we have:
(17) 
where represent the weights of the encoder from feature channel
to the hidden neuron
. Since we know that , we substitute the upper bound here into Eq. (17) and finally end up with:∎
The left side of Eq. (12) is the difference of representations between two sensitive groups and if it is large, i.e., is very large, then the predictions between these two groups would also be very different, which reflects more discrimination in terms of the group fairness. Additionally, Theorem 2 indicates that the upper bound of the group fairness between two sensitive groups depends on the network homophily , the initial feature difference and the masking probability . As the network homophily decreases, more neighboring nodes come from the other sensitive group and aggregating information of these neighborhoods would smooth node representations between different sensitive groups and reduce the bias. To the best of our knowledge, this is the first work relating the fairness with the network homophily. Furthermore, Eq. (12) proves that clamping weights of the encoder upper bounds the group fairness.
4.3. Training Algorithm
Here we present a holistic algorithm of the proposed FairVGNN. In comparison to vanilla adversarial training, additional computational requirements of FairVGNN come from generating different masks. However, since within each training epoch we can precompute the masks as Step 4 before adversarial training and the total number of views becomes constant compared with the whole time used for adversarial training as Step 614, the time complexity is still linear proportional to the size of the whole graph, i.e., . The total model complexity includes parameters of the feature masker , the discriminator/classifier and the encoder , which boils down to and hence the same as any other layer GNN backbones.
5. Experiments
In this section, we conduct extensive experiments to evaluate the effectiveness of FairVGNN.
5.1. Experimental Settings
5.1.1. Datasets
We validate the proposed approach on three benchmark datasets (Agarwal et al., 2021; Dong et al., 2022a) with their statistics shown in Table 2.


Dataset  German  Credit  Bail 


#Nodes  1000  30,000  18,876 
#Edges  22,242  1,436,858  321,308 
#Features  27  13  18 
Sens.  Gender  Age  Race 
Label  Good/bad Credit  Default/no default Payment  Bail/no bail 

Encoder  Method  German  Credit  Bail  Avg. (Rank)  
AUC ()  F1 ()  ACC ()  ()  ()  AUC ()  F1 ()  ACC ()  ()  ()  AUC ()  F1 ()  ACC ()  ()  ()  
GCN  Vanilla  74.110.37  82.460.89  73.441.09  35.177.27  25.175.89  73.870.02  81.920.02  73.670.03  12.860.09  10.630.13  87.080.35  79.020.74  84.560.68  7.350.72  4.960.62  9.17 
NIFTY  68.782.69  81.400.54  69.921.14  5.735.25  5.084.29  71.960.19  81.720.05  73.450.06  11.680.07  9.390.07  78.202.78  64.763.91  74.192.57  2.441.29  1.721.08  9.69  
EDITS  69.412.33  81.550.59  71.600.89  4.054.48  3.894.23  73.010.11  81.810.28  73.510.30  10.901.22  8.751.21  86.442.17  75.583.77  84.492.27  6.640.39  7.511.20  9.89  
FairGNN  67.352.13  82.010.26  69.680.30  3.492.15  3.402.15  71.951.43  81.841.19  73.411.24  12.642.11  10.412.03  87.360.90  77.501.69  82.941.67  6.900.17  4.650.14  9.17  
FairVGNN  72.412.10  82.140.42  70.160.86  1.711.68  0.880.58  71.340.41  87.080.74  78.040.33  5.025.22  3.604.31  85.680.37  79.110.33  84.730.46  6.530.67  4.951.22  5.67  
GIN  Vanilla  72.711.44  82.780.50  73.840.54  13.565.23  9.474.49  74.360.21  82.280.64  74.020.73  14.482.44  12.352.86  86.140.25  76.490.57  81.700.67  8.551.61  6.991.51  9.56 
NIFTY  67.614.88  80.463.06  69.923.64  5.263.24  5.345.67  70.900.24  84.050.82  75.590.66  7.094.62  6.223.26  82.334.61  70.646.73  74.469.98  5.571.11  3.411.43  8.56  
EDITS  69.351.64  82.800.22  72.080.66  0.860.76  1.721.14  72.351.11  82.470.85  74.070.98  14.1114.45  15.4015.76  80.194.62  68.075.30  73.745.12  6.712.35  5.983.66  11.36  
FairGNN  72.950.82  83.160.56  72.241.44  6.884.42  2.061.46  68.664.48  79.475.29  70.335.50  4.673.06  3.941.49  86.140.89  73.671.17  77.902.21  6.331.49  4.741.64  7.64  
FairVGNN  71.651.90  82.400.14  70.16 0.32  0.430.54  0.340.41  71.360.72  87.440.23  78.180.20  2.852.01  1.721.80  83.221.60  76.362.20  83.861.57  5.670.76  5.771.26  5.44  
SAGE  Vanilla  75.740.69  81.251.72  72.241.61  24.306.93  15.557.59  74.581.31  83.380.77  75.280.83  15.651.30  13.341.34  90.710.69  80.990.55  86.720.48  2.161.53  0.840.55  7.31 
NIFTY  72.052.15  79.201.19  69.601.50  7.747.80  5.172.38  72.890.44  82.601.25  74.391.35  10.651.65  8.101.91  92.040.89  77.816.03  84.115.49  5.740.38  4.071.28  8.06  
EDITS  69.765.46  81.041.09  71.681.25  8.427.35  5.692.16  75.040.12  82.410.52  74.130.59  11.346.36  9.385.39  89.072.26  77.833.79  84.422.87  3.743.54  4.463.50  11.36  
FairGNN  65.859.49  82.290.32  70.640.74  7.658.07  4.184.86  70.820.74  83.972.00  75.291.62  6.175.57  5.064.46  91.530.38  82.550.98  87.680.73  1.940.82  1.720.70  5.83  
FairVGNN  73.840.52  81.910.63  70.000.25  1.361.90  1.221.49  74.050.20  87.840.32  79.940.30  4.941.10  2.390.71  91.561.71  83.581.88  88.411.29  1.140.67  1.691.13  2.92 
5.1.2. Baselines
Several stateoftheart fair node representation learning models are compared with our proposed FairVGNN. We divide them into two categories: (1) Augmentationbased: this type of methods alleviates discrimination via graph augmentation, where sensitiverelated information is removed by modifying the graph topology or node features. NIFTY (Agarwal et al., 2021) simultaneously achieves the Counterfactual Fairness and the stability by contrastive learning. EDITS (Dong et al., 2022a) approximates the inputs’ discrimination via Wasserstein distance and directly minimizes it between sensitive and nonsensitive groups by pruning the graph topology and node features. (2) Adversarialbased: The adversarialbased methods enforce the fairness of node representations by alternatively training the encoder to fool the discriminator and the discriminator to predict the sensitive attributes. FairGNN (Dai and Wang, 2021)
deploys an extra sensitive feature estimator to increase the amount of sensitive information
Since different GNNbackbones may cause different levels of sensitive attribute leakage, we consider to equip each of the above three biasalleviating methods with three GNNbackbones: GCN (Kipf and Welling, 2017), GIN (Xu et al., 2019), GraphSAGE (Hamilton et al., 2017), e.g., GCNNIFTY represents the GCN encoder with NIFTY.5.1.3. Setup
Our proposed FairVGNN is implemented using PyTorchGeometric
(et al., 2019). For EDITS^{6}^{6}6https://github.com/yushundong/edits, NIFTY^{7}^{7}7https://github.com/chirag126/nifty and FairGNN^{8}^{8}8https://github.com/EnyanDai/FairGNN, we use the original code from the authors’ GitHub repository. We aim to provide a rigorous and fair comparison between different models on each dataset by tuning hyperparameters for all models individually and detailed hyperparamter configuration of each baseline is in Appendix B.2. Following (Agarwal et al., 2021) and (Dong et al., 2022a), we use 1layer GCN, GIN convolution and 2layer GraphSAGE convolution respectively as our encoder , and use 1 linear layer as our classifier and discriminator . The detailed GNN architecture is described in Appendix B.1. We fix the number of hidden unit of the encoder as 16, the dropout rate as 0.5, the number of generated fair feature views during each training epoch . The learning rates and the training epochs of the generator , the discriminator , the classifier and the encoder are searched from and , the prefix cutting threshold in Eq. (11) is searched from , the whole training epochs as , and . We use the default data splitting following (Agarwal et al., 2021; Dong et al., 2022a) and experimental results are averaged over five repeated executions with five different seeds to remove any potential initialization bias.5.2. Node Classification
5.2.1. Performance comparison
The model utility and fairness of each baseline is shown in Table 3
. We observe that our FairVGNN consistently performs the best compared with other biasalleviating methods in terms of the average rank for all datasets and across all evaluation metrics, which indicates the superiority of our model in achieving better tradeoff between model utility and fairness. Since no fairness regularization is imposed on GNN encoders equipped with vanilla methods, they generally achieve better model utility. However for this reason, sensitiverelated information is also completely free to be encoded in the learned node representations and hence causes higher bias. To alleviate such discrimination, all other methods propose different regularizations to constrain sensitiverelated information in learned node representations, which also remove some taskrelated information and hence sacrifice model utility as expected in Table
3. However, we do observe that our model can yield lower biased predictions with less utility sacrifice, which is mainly ascribed to two reasons: (1) We generate different fair feature views by randomly sampling masks from learned GumbelSoftmax distribution and make predictions. This can be regarded as a data augmentation technique by adding noise to node features, which decreases the population risk and enhances the model generalibility (Shorten and Khoshgoftaar, 2019) by creating novel mapping from augmented training points to the label space. (2) The weight clamping module clamps weights of encoder based on feature correlations to the sensitive feature channel, which adaptively remove/keep the sensitive/taskrelevant information.Encoder  Model Variants  German  Credit  Bail  
AUC ()  F1 ()  ACC ()  ()  ()  AUC ()  F1 ()  ACC ()  ()  ()  AUC ()  F1 ()  ACC ()  ()  ()  
GCN  FairV  72.69 1.67  81.86 0.49  69.840.41  0.77 0.39  0.46 0.34  71.340.41  87.080.74  78.040.33  5.025.22  3.604.31  85.680.37  79.110.33  84.730.46  6.530.67  4.951.22 
FairV w/o fm  73.63 1.14  82.280.28  70.881.09  5.563.89  4.413.59  72.510.32  86.152.18  77.832.15  6.942.86  4.642.73  86.980.32  78.080.53  84.590.29  7.240.26  5.750.68  
FairV w/o wc  72.08 1.83  82.72 0.50  71.04 1.23  3.19 3.51  0.59 1.12  71.800.47  87.270.47  78.470.34  9.054.55  5.943.61  85.930.38  79.220.29  85.380.25  6.61 