Log In Sign Up

FairNorm: Fair and Fast Graph Neural Network Training

Graph neural networks (GNNs) have been demonstrated to achieve state-of-the-art for a number of graph-based learning tasks, which leads to a rise in their employment in various domains. However, it has been shown that GNNs may inherit and even amplify bias within training data, which leads to unfair results towards certain sensitive groups. Meanwhile, training of GNNs introduces additional challenges, such as slow convergence and possible instability. Faced with these limitations, this work proposes FairNorm, a unified normalization framework that reduces the bias in GNN-based learning while also providing provably faster convergence. Specifically, FairNorm employs fairness-aware normalization operators over different sensitive groups with learnable parameters to reduce the bias in GNNs. The design of FairNorm is built upon analyses that illuminate the sources of bias in graph-based learning. Experiments on node classification over real-world networks demonstrate the efficiency of the proposed scheme in improving fairness in terms of statistical parity and equal opportunity compared to fairness-aware baselines. In addition, it is empirically shown that the proposed framework leads to faster convergence compared to the naive baseline where no normalization is employed.


page 1

page 2

page 3

page 4


Fair Node Representation Learning via Adaptive Data Augmentation

Node representation learning has demonstrated its efficacy for various a...

On Structural Explanation of Bias in Graph Neural Networks

Graph Neural Networks (GNNs) have shown satisfying performance in variou...

EDITS: Modeling and Mitigating Data Bias for Graph Neural Networks

Graph Neural Networks (GNNs) have recently demonstrated superior capabil...

Improving Fairness in Graph Neural Networks via Mitigating Sensitive Attribute Leakage

Graph Neural Networks (GNNs) have shown great power in learning node rep...

GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training

Normalization plays an important role in the optimization of deep neural...

On Graph Neural Network Fairness in the Presence of Heterophilous Neighborhoods

We study the task of node classification for graph neural networks (GNNs...

EqGNN: Equalized Node Opportunity in Graphs

Graph neural networks (GNNs), has been widely used for supervised learni...

1 Introduction

Graphs are powerful structures in modeling complex systems and the relations within them. Hence, they are widely employed to represent various real-world systems, such as gene networks, traffic networks and social networks to name a few. Such expressiveness has led to a rise in the attention towards learning over graphs, and it has been shown that graph neural networks (GNNs) achieve the state-of-the-art for several tasks over graphs Gori et al. (2005); Scarselli et al. (2008); Hamilton et al. (2017); Kipf and Welling (2017); Veličković et al. (2018); Xu et al. (2018). GNNs create node representations by repeatedly aggregating information from the neighbors, which can be employed on ensuing tasks such as traffic forecasting Opolka et al. (2019), crime forecasting Jin et al. (2020), and recommendation systems Ying et al. (2018).

Due to their success, machine learning (ML) models have widespread use in our everyday lives to make life-changing decisions, which makes it essential to prevent any discriminatory behaviour in these models towards under-represented groups. However, several studies have demonstrated that ML models propagate the historical bias within the training data

Dwork et al. (2012); Beutel et al. (2017) and lead to discriminative results in ensuing applications. Particular to GNNs, it has been shown that in addition to propagating the already existing bias, GNN-based learning may even amplify it due to the utilization of biased graph topologies Dai and Wang (2021). This well-motivates the studies in fairness-aware GNN-based learning.

Normalization operations are introduced to shift and scale the hidden representations created in deep neural networks (DNNs) in order to accelerate the optimization process in training

Ioffe and Szegedy (2015); Ulyanov et al. (2016); Ba et al. (2016); Salimans and Kingma (2016); Xiong et al. (2020); Miyato et al. (2018); Wu and He (2018); Santurkar et al. (2018). While the other aspects of GNN-based learning are theoretically investigated, such as generalization Scarselli et al. (2018); Xu et al. (2019), expressiveness Xu et al. (2018); Loukas (2020); Ying et al. (2021), the optimization of GNNs is analytically an under-explored area. Practically, training GNNs generally has a slow convergence rate and is accompanied by instability issues Xu et al. (2018). Inspired by this, Cai et al. (2021) investigates the effect of a shift operation on a simple GNN-based learning environment and proposes a normalization framework that is suitable for GNNs. The proposed framework, GraphNorm Cai et al. (2021), is demonstrated to be more effective in improving convergence speed over graphs compared to previously presented normalization strategies in other domains.

It has been shown in Balunovic et al. (2021) that the distribution discrepancy among different sensitive groups is one of the leading factors to bias in general ML algorithms. For GNNs, fairness analyses have also shown that the distributions of the representations of different sensitive groups affect the resulting bias Li et al. (2020); Kose and Shen (2022)

. In the meantime, normalization layers learn the parameters that manage the sample mean and variance of these hidden representations. Herein, we emphasize that the normalization layer inherently provides an ability to manipulate said parameters in order to mitigate bias, while also improving the convergence. Motivated by this, this study proposes a unified framework that mitigates bias in GNN-based learning, while also providing a faster convergence through the employment of a normalization layer. Overall, our contributions in this paper can be summarized as follows:

c1) We propose a framework that can reduce bias while providing a higher convergence speed for a GNN-based learning environment. To the best of our knowledge, FairNorm is the first attempt to improve fairness and convergence speed in a unified framework.
c2) The effect of the fairness-aware shift operations on convergence rate is investigated in a simple GNN-based learning framework. It is analytically demonstrated that the proposed shift operations can improve the convergence rate for node classification compared to the case where no shift is employed.
c3) For fairness considerations, two fairness-aware regularizers are introduced for the trainable parameters of normalization layers. The designs of said regularizers are based on the theoretical understanding regarding the sources of bias in GNN-based learning.
c4) Empirical results are obtained over real-world networks in terms of utility and fairness metrics for node classification. It is demonstrated that compared to fairness-aware baselines, FairNorm leads to an improvement in fairness metrics while providing comparable utility. Meanwhile, it is shown that applying FairNorm enhances the convergence speed with respect to the no-normalization baseline.

2 Related Work

Fairness-aware learning over graphs: In fairness-aware learning over graphs, Rahman et al. (2019) serves as a seminal work based on random walks. In addition, Dai and Wang (2021); Bose and Hamilton (2019); Fisher et al. (2020) propose to use adversarial regularization to reduce bias in GNNs. Another approach is to utilize a Bayesian approach where the sensitive information is modeled in the prior distribution to enhance fairness over graphs Buyl and De Bie (2020). Furthermore, Ma et al. (2021) performs a PAC-Bayesian analysis and links the notion of subgroup generalization to accuracy disparity, and Zeng et al. (2021) proposes several strategies including GNN-based ones to reduce bias for the representations of heterogeneous information networks. Specifically for fairness-aware link prediction, while Buyl and De Bie (2021) introduces a regularizer, Li et al. (2020); Laclau et al. (2021) propose strategies that alter the adjacency matrix. With a specific consideration of individual fairness over graphs, Dong et al. (2021) proposes a ranking-based framework. Another research direction in fairness-aware graph-based learning is to modify the graph structure to combat bias resulted from the graph connectivity Agarwal et al. (2021); Spinelli et al. (2021); Kose and Shen (2022); Köse and Shen (2021). Differing from all previous works, the proposed framework herein proposes a unified framework that can mitigate bias in GNN-based learning together with an enhanced convergence speed.

Normalization: Batch Normalization (BatchNorm) Ioffe and Szegedy (2015) is the pioneering study that proposes to shift and scale the hidden representations in a batch to accelerate the convergence of training for DNNs. As a following work, Instance Normalization (InstanceNorm) Ulyanov et al. (2016) is proposed for real-time image generation, which applies normalization over individual images instead of the samples in a batch. For permutation-equivalent data processing, adaptations of InstanceNorm Yi et al. (2018); Sun et al. (2020) have also been presented. Specifically for GNNs, Xu et al. (2018) employs BatchNorm within the framework of graph isomorphism networks, while a prior version of Dwivedi et al. (2020) normalizes node features based on the graph size. A size-agnostic normalization for graphs, GraphNorm Cai et al. (2021), is further proposed, which improves InstanceNorm for graphs with a learnable shift to prevent degradation in expressiveness. However, none of the aforementioned normalization schemes consider fairness.

3 Preliminaries

This study develops a unified training scheme for GNNs that can improve fairness while at the same time enhance the convergence speed, given an input graph , where denotes the node set, and is the edge set. Matrices and are the feature and adjacency matrices, respectively, where if and only if . Degree matrix is defined to be a diagonal matrix with the th diagonal entry denoting the degree of node . In this study, the sensitive attributes of the nodes are denoted by , where the existence of a single, binary sensitive attribute is considered. Furthermore, and denote the set of nodes whose sensitive attributes are and , respectively. Node representations at th layer are represented by , where denotes the representation of node and is the th feature of

. Vectors

and will be used to denote the feature vector and the sensitive attribute of node . Throughout the paper, outputs the element-wise maximum vector of its argument vectors, and denotes the sample mean operator.

GNNs produce node embeddings by repeatedly aggregating information from neighbors. Different GNN structures that are based on different aggregation strategies are proposed so far Kipf and Welling (2017); Veličković et al. (2018); Xu et al. (2018). A general formulation of GNNs in matrix form follows as:

where represents the weight matrix of GNN at th layer and

denotes activation function. In this formulation,

matrix specifies the information aggregation process from neighbors, which changes in different GNN frameworks. For example, for Graph Convolutional Networks (GCN) Kipf and Welling (2017), where with

denoting the identity matrix, and

is the degree matrix corresponding to . Finally, the representations created after one aggregation process are denoted by . Note that the superscript for layer number is dropped in the remaining of the paper, as the proposed framework is applicable to every layer in the same way.

3.1 Normalization for GNNs

For deep neural networks (DNNs), normalization methods have been shown to accelerate training through shifting and scaling the hidden representations Ioffe and Szegedy (2015); Ulyanov et al. (2016); Ba et al. (2016); Salimans and Kingma (2016); Xiong et al. (2020); Miyato et al. (2018); Wu and He (2018); Santurkar et al. (2018). Normalization methods differ in the set of features over which the normalization is applied. For example, Layer normalization (LayerNorm) normalizes the feature vectors at each instance in a sequence Ba et al. (2016), while BatchNorm executes the normalization over individual features across different samples in a batch. In general, different normalization methods are proposed for different domains, and there is not a universal normalization strategy that suits every domain Cai et al. (2021)

. For example, while LayerNorm is presented for natural language processing

Ba et al. (2016), InstanceNorm seeks specifically to improve the optimization for style transfer tasks Ulyanov et al. (2016). For GNNs, Cai et al. (2021) demonstrates that mean normalization can degrade the expressiveness of the neural networks, as mean statistics incorporate graph structural information. Motivated by this, Cai et al. (2021) proposes GraphNorm, which employs a learnable shift to preserve the mean statistics to a certain extent. The study reports that GraphNorm consistently achieves superior convergence speed and training stability on graph classification for GNNs over other normalization strategies.

3.2 Bias in GNNs

ML models can lead to discriminative results towards certain under-represented groups, as they propagate the bias within the training data Dwork et al. (2012); Beutel et al. (2017). It has been demonstrated that the utilization of graph structure in GNNs amplifies the already existing bias Dai and Wang (2021). Thus, understanding the sources of bias in graph structure is crucial to develop a remedy for it. Motivated by this, Li et al. (2020) and Kose and Shen (2022) investigate the sources of bias in GNN-based learning. In Li et al. (2020), the representation discrepancy between different sensitive groups is examined, whereas in Kose and Shen (2022), the bias analysis is based on the correlation between the aggregated representations and sensitive attributes . Though through different approaches, both analyses in (Li et al., 2020, Theorem 4.1) and (Kose and Shen, 2022, Theorem 3.1) demonstrate the parallelism between the terms , and bias in GNN-based learning. Here, and are the sample means of node representations respectively across each sensitive group, where , and stands for the maximal deviations of hidden representations, that is and . The superscript in is utilized to specify the sensitive group index. Specifically, the hidden representation corresponds to a node .

As the analyses in Li et al. (2020); Kose and Shen (2022) suggest that the distributions of hidden representations corresponding to different sensitive groups and influence the resulted bias by GNNs, a tool that can shift these group-wise distributions can effectively decrease bias-related terms, and hence the overall bias.

4 FairNorm: A Fair and Fast Training Framework for GNNs

This section presents the proposed unified framework that achieves fairness improvement together with faster convergence speed for GNN-based learning.

4.1 Group-wise Normalization

It has been demonstrated in Li et al. (2020); Kose and Shen (2022) that decreasing and

can effectively reduce bias in GNN-based learning. Note that both terms are affected by the distributions of representations from different sensitive groups. On the other hand, the mean and standard deviation of the hidden representations, and in turn their distributions, are affected by the learnable parameters of a normalization layer. Thus, employing such a layer can enable manipulating said distributions, which can be used to improve fairness. Inspired by this, the proposed framework, FairNorm, first applies normalizations to different sensitive groups individually, which results in individual learnable parameters affecting

, and , as well as their difference. For any input matrix , given that the columns of can be divided into two sensitive groups and , the corresponding multiple group-wise normalization operations can be mathematically described as:


where , and are learnable parameters. The superscript in specifies that the representation corresponds to a node from the sensitive group . Considering that mean normalization can degrade the expressiveness of GNNs Cai et al. (2021), the proposed framework employs the learnable parameter that manages the amount of mean normalization.

It is demonstrated in Cai et al. (2021) that applying a shift operation over the whole graph can speed up the convergence for graph classification. However, as the proposed framework herein applies multiple shift operations individually over subgraphs corresponding to different sensitive groups and considers the node classification task, the effect of the proposed strategy on the convergence speed becomes unclear. Hence, the analysis in Cai et al. (2021) cannot be directly applied to this case. Motivated by this, this study analytically examines the influence of group-wise shifts on the convergence speed.

Shift operations over different sensitive groups can be applied in matrix forms via the matrices and , where for . In this formulation, is created such that if , and otherwise. Therefore, for any vector , . Hence, the group-wise shift operations applied to hidden representations can be written as:


The following lemma demonstrates that acts as a preconditioner of , whose proof is presented in Appendix A.

Lemma 1.


be the singular values of

. We have as the two of the singular values of . Let the remaining singular values of be . Then, the following holds:


where or , only if have a right singular vector such that and .

In other domains, such as DNNs or iterative algorithms, a similar preconditioning is considered to help the training Kingma and Ba (2014); Axelsson (1985). Such a preconditioning of the aggregation matrix is also demonstrated to accelerate the optimization of GNNs Cai et al. (2021). In order to theoretically investigate such an effect in our setting, we considered a basic linear GNN model for node classification that is optimized via gradient descent, and presented its convergence analysis in Theorem 1. Appendix B presents all assumptions and considered learning settings employed in Theorem 1 in detail, as well as its proof.

Theorem 1.

For a linear GNN model, let the parameters of the model at time with applied shift operations through be denoted by

. Then, with high probability,

converges to the optimal parameters linearly with a rate :


The same also holds for the parameters of the model without any shift, , for convergence rate :


Thus, it concludes that the shift operations applied through lead to faster convergence with high probability compared to the scheme where no shift is applied.

Theorem 1 demonstrates that the individual shift operations applied over different sensitive groups indeed improve the convergence rate compared to the naive baseline. While the result of Theorem 1 seems to be similar to the result of (Cai et al., 2021, Proposition 3.1), the analysis in Cai et al. (2021) cannot be easily extended to our proof due to the employment of group-wise shifts in this work and the fact that we consider node classification instead of graph classification.

4.2 Fairness-aware Regularizers

Consider the conventional case where the normalization is applied after linear transformations

Ioffe and Szegedy (2015); Xiong et al. (2020); Cai et al. (2021). For this case, the hidden representations can be expressed in matrix form as:


In Equation (6), denotes the submatrix consisting the columns of whose corresponding nodes are in . Furthermore, as the proposed strategy applies normalizations individually over different sensitive groups, these group-wise normalization layers are differentiated by the superscript . Let denote the sample mean of representations after normalization for the sensitive group . In the proposed framework, recalling from Subsection 4.1, individual normalization layers are employed to create individual learnable parameters for the distributions of different sensitive groups, so that the bias-related terms derived in Li et al. (2020); Kose and Shen (2022) can be reduced. However, in order to manipulate and for possible bias reduction, the relationship between and ’s should be investigated. To this end, we present the following theorem, the proof of which can be found in Appendix C.

Theorem 2.

Let be Lipschitz continuous with Lipschitz constant , and let denote the normalized representations in group . Then, is bounded above by


Here, is the maximal deviation of from (i.e., we have ).

Theorem 2 demonstrates that decreasing results in a decreased upper bound for , which can possibly reduce the actual value of . Based on this result, as a second step after applying individual normalization layers over different sensitive groups, FairNorm proposes the use of a regularizer term

to decrease bias for GNN-based learning. We note that many commonly used activation functions such as ReLU, sigmoid, tanh,

etc. have a Lipschitz constant equal to .

Furthermore, Theorem 2 shows that the upper bound for can also be decreased by reducing the norms of maximal deviations and . Inspired by this finding, is also introduced as a regularizer to reduce the norms of maximal deviations of the normalized representations. Hence, the overall learning objective for the considered node classification task can be written as:


where is the classification loss, for M-Norm in (1). , and for M-Norm defined in (1) , where denotes the representations input to the normalization layer. GNN parameters are denoted by , and and

are hyperparameters specifying the focus on the fairness regularizers.

Remark 1 (Order of normalization and activation). Although the proposed fairness regularizers are designed for the conventional case where the normalization is used before nonlinear activation, it can be demonstrated that they can also reduce bias when the normalization is applied after activation, where


In this case, it holds that and for . Therefore, the employment of the proposed fairness regularizers can naturally be extended to the case where the normalization is utilized after activation, as the analyses in Li et al. (2020); Kose and Shen (2022) demonstrate that reducing and can help mitigate bias in GNN-based learning.

Remark 2 (Non-binary sensitive attributes). It is worth mentioning that, while the fairness analyses in Li et al. (2020); Kose and Shen (2022) are carried out only for a single and binary sensitive attribute, the proposed framework can be extended to non-binary sensitive attributes as well. For non-binary attributes, the normalization layers can still be applied independently for each sensitive group. Afterwards, via regularizers, the sample means for different groups can be brought closer based on distance measures of choice (e.g., the maximum of the -norm distances between all pairs), as well as the norms of the maximal deviations of normalized representations corresponding to different sensitive groups can be decreased.

Remark 3 (Applicability to other normalization methods). Note that the proposed FairNorm framework can be readily utilized together with other normalization techniques where the distribution of the normalized representations depends on learnable parameters, e.g., BatchNorm Ioffe and Szegedy (2015).

5 Experiments

In this section, experimental results obtained on real-world datasets for a supervised node classification task are presented. The performance of the proposed framework, FairNorm, is compared with baseline schemes in terms of node classification accuracy and fairness metrics. Furthermore, the influence of the proposed fairness-aware normalization strategy on convergence speed is examined.

5.1 Datasets and Settings

Datasets. In the experiments, three real-world networks are used: Pokec-z, Pokec-n Dai and Wang (2021), and the Recidivism graph Jordan and Freiburger (2015). Pokec-z and Pokec-n are created by sampling the anonymized, 2012 version of Pokec Takac and Zabovsky (2012), which is a Facebook-like social network used in Slovakia Dai and Wang (2021)

. In Pokec networks, the region information is utilized as the sensitive attribute, where the nodes of these graphs are the users living in two major regions. Labels to be used in node classification are assigned to be the binarized working field of the users. The information of defendants (corresponding to nodes) who got released on bail at the U.S. state courts during 1990-2009

Jordan and Freiburger (2015)

is utilized to build the Recidivism graph, where the edges are formed based on the similarity of past criminal records and demographics. Race is used as the sensitive attribute for this graph, and the node classification task classifies defendants into bail (i.e., the defendant is not likely to commit a violent crime if released) or no bail (i.e., the defendant is likely to commit a violent crime if released)

Agarwal et al. (2021). Further statistical information for datasets are presented in Table 2 in Appendix D.

Evaluation Metrics. Accuracy is used as the utility measure for node classification. Two quantitative measures of group fairness metrics are also reported in terms of statistical parity: and equal opportunity: , where is the ground truth label, and denotes the predicted label. Lower values for and signify better fairness performance Dai and Wang (2021).

Implementation details. To comparatively evaluate our proposed framework, node classification is utilized in a supervised setting. A two-layer GCN Kipf and Welling (2017) followed by a linear layer is employed for the classification task, which is identical to the experimental setting used in Dai and Wang (2021). A normalization layer follows after every GNN layer, where the normalization is applied after linear transformations and before the non-linear activation, as suggested in Ioffe and Szegedy (2015); Cai et al. (2021). For the hyperparameter selection of the GCN model, See Appendix E. This experimental framework is kept the same for all baselines. Furthermore, training of the model is executed over of the nodes, while the remaining nodes are equally divided to be used as the validation and test sets. For each experiment, results for five random data splits are obtained, and the average of them together with standard deviations are presented. The hyperparameters of the proposed fairness-aware framework and all other baselines are tuned via a grid search on cross-validation sets, see Appendix E for the utilized hyperparameter values. A sensitivity analysis is also provided in Appendix F for the hyperparameters of the proposed FairNorm framework.

Baselines. This work aims to mitigate bias via employing fairness-aware regularizers, as well as to provide a faster convergence through its utilized normalization layers. We note that similar to the proposed regularizers, any other fairness-aware regularizer can be employed together with a normalization layer, for these same purposes. In order to demonstrate the performance improvement of the proposed regularizers over said alternatives, we compare the proposed framework with other fairness-aware regularizers. To this end, the performance of different baselines is presented. For improving fairness in a supervised setting, FairGNN Dai and Wang (2021)

employs adversarial debiasing and a covariance-based regularizer (the absolute covariance between the sensitive attribute and estimated label

). The results for these regularizers are obtained both individually and together, where the framework that utilizes both regularizers is called FairGNN Dai and Wang (2021). Furthermore, hyperbolic tangent relaxation of the difference of demographic parity () that is proposed in Padh et al. (2021) is utilized as another baseline. Note that, as DDP is not differentiable, its relaxations are used as fairness-aware regularizers for a gradient-based optimization. It is worth emphasizing that the fairness regularizers proposed in this study are also applicable to an unsupervised setting, while the covariance-based (also FairGNN) and regularizers can only be used in a supervised framework.

5.2 Experimental Results

width= 0.99 Pokec-z Pokec-n Recidivism ReLU Acc () () () Acc () () % Acc () () % NoNorm M-Norm Covariance Adversarial FairGNN FairNorm Pokec-z Pokec-n Recidivism Sigmoid Acc () () () Acc () () % Acc () () % NoNorm M-Norm Covariance Adversarial FairGNN FairNorm

Table 1: Comparative Results with Baselines for Different Activation Function Selections

The results of node classification are presented in Table 1 in terms of fairness and utility metrics for both the proposed framework and baselines. The results are obtained for two commonly utilized activation functions: ReLU and sigmoid, in order to demonstrate the efficacy of the proposed framework over different activations. In Table 1, “NoNorm” denotes the scheme where no normalization layer is employed. “M-Norm” stands for the proposed framework where only individual normalizations are applied to the nodes belonging to different sensitive groups, without using the proposed fairness regularizers. Furthermore, “Covariance” is for the covariance-based regularizer Dai and Wang (2021), “Adversarial” stands for the adversarial regularizer Dai and Wang (2021), and “” denotes hyperbolic tangent relaxation of the difference of demographic parity Padh et al. (2021). It should be noted that the results for baselines are obtained with the best performing normalization layer framework (individual normalizations over different sensitive groups vs. normalization over all nodes in the graph) in terms of fairness measures.

The results in Table 1 demonstrate that FairNorm achieves superior fairness performance, together with similar utility, compared to all baselines on all datasets, for both of the utilized activation functions. Compared to its natural baseline “M-Norm”, FairNorm achieves approximately improvement in all fairness measures on Pokec-z. Furthermore, on the Recidivism graph with sigmoid activation, while the improvement in fairness metrics is accompanied by a decrease in accuracy for the baselines, FairNorm achieves better fairness performance without a deterioration in utility. Overall, the results in Table 1 show the efficacy of the proposed fairness regularizers in reducing bias while providing similar utility on different real-world networks. Note that, in addition to their superior fairness performance, the proposed regularizers of FairNorm can be flexibly applied to both supervised and unsupervised settings, whereas some of the baselines (“Covariance”, “FairGNN”, “”) require predicted labels for their regularizer designs.

Remark 4 (Ablation study). An ablation study is also provided in Appendix G in order to demonstrate the influences of and independently. Overall, the ablation study signifies that while has a greater effect on fairness improvement compared to , the utilization of both regularizers typically leads to the largest improvement in fairness measures.

The proposed framework herein aims to mitigate bias by also providing a faster convergence speed. The results in Table 1 confirm that the proposed fairness regularizers within FairNorm do provide said bias reduction. In order to evaluate the convergence speed of FairNorm’s group-wise normalizations, Figure 7 is presented. The baselines in Figure 7 consist of GraphNorm Cai et al. (2021), and the framework where no normalization is applied. We note that in Figure 7, Fairnorm is employed with both its individual normalizations as well as its fairness regularizers.

The results on both Pokec datasets and the Recidivism network confirm that the employed normalization can indeed lead to a faster convergence in training compared to NoNorm. Figure 7 also demonstrates that compared to GraphNorm, the convergence improvement of FairNorm is slightly less on Pokec-z, whereas it provides approximately the same improvement on Pokec-n and Recidivism.

(a) Convergence for Pokec-n (ReLU)
(b) Convergence for Pokec-z (ReLU)
(c) Convergence for Recidivism (ReLU)
(d) Convergence for Pokec-n (Sigmoid)
(e) Convergence for Pokec-z (Sigmoid)
(f) Convergence for Recidivism (Sigmoid)
Figure 7: Convergence for different graph data sets when the normalization is not applied (Nonorm) and applied with/without fairness consideration (FairNorm/GraphNorm).

6 Conclusions and Limitations

This study proposes a unified framework, FairNorm, that mitigates bias in GNN-based learning and provides faster convergence in training. FairNorm applies normalization independently over different sensitive groups, and employs two novel fairness regularizers that manipulate the parameters of these normalization layers. The designs of these regularizers are based on theoretical fairness analyses on GNNs. Experimental results on real-world social networks show the fairness improvement of FairNorm over fairness-aware baselines in terms of statistical parity and equal opportunity, as well as its similar utility performance in node classification tasks. Furthermore, it is demonstrated that FairNorm improves the convergence speed of the naive baseline where no normalization is used.

The present framework considers only a single sensitive attribute in its design of normalization layers and fairness regularizers. One possible future direction of this study is the extension of the current design to a case with multiple sensitive attributes, which may be essential in certain applications. Furthermore, while the experimental results are obtained in a supervised setting for fair comparison with certain baselines, FairNorm is also applicable to an unsupervised setting. Thus, another future direction is to investigate the performance of FairNorm in an unsupervised learning setting.


  • C. Agarwal, H. Lakkaraju, and M. Zitnik (2021) Towards a unified framework for fair and stable graph representation learning. In

    Uncertainty in Artificial Intelligence (UAI)

    pp. 2114–2124. Cited by: §2, §5.1.
  • O. Axelsson (1985) A survey of preconditioned iterative methods for linear systems of algebraic equations. BIT Numerical Mathematics 25 (1), pp. 165–187. Cited by: §4.1.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, §3.1.
  • M. Balunovic, A. Ruoss, and M. Vechev (2021) Fair normalizing flows. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1.
  • A. Beutel, J. Chen, Z. Zhao, and E. H. Chi (2017) Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075. Cited by: §1, §3.2.
  • A. Bose and W. Hamilton (2019) Compositional fairness constraints for graph embeddings. In Proc. International Conference on Machine Learning (ICML), pp. 715–724. Cited by: §2.
  • M. Buyl and T. De Bie (2020) Debayes: a bayesian method for debiasing network embeddings. In Proc. International Conference on Machine Learning (ICML), pp. 1220–1229. Cited by: §2.
  • M. Buyl and T. De Bie (2021) The kl-divergence between a graph model and its fair i-projection as a fairness regularizer. arXiv preprint arXiv:2103.01846. Cited by: §2.
  • T. Cai, S. Luo, K. Xu, D. He, T. Liu, and L. Wang (2021) Graphnorm: a principled approach to accelerating graph neural network training. In Proc. International Conference on Machine Learning (ICML), pp. 1204–1215. Cited by: Appendix B, Table 4, Table 6, §1, §2, §3.1, §4.1, §4.1, §4.1, §4.1, §4.2, §5.1, §5.2.
  • W. Cui, X. Zhang, and Y. Liu (2019) Covariance matrix estimation from linearly-correlated gaussian samples. IEEE Transactions on Signal Processing 67 (8), pp. 2187–2195. Cited by: Lemma 3.
  • E. Dai and S. Wang (2021) Say no to the discrimination: learning fair graph neural networks with limited sensitive attribute information. In Proc. 14th ACM International Conference on Web Search and Data Mining (WSDM), pp. 680–688. Cited by: Appendix E, §1, §2, §3.2, §5.1, §5.1, §5.1, §5.1, §5.2.
  • Y. Dong, J. Kang, H. Tong, and J. Li (2021) Individual fairness for graph neural networks: a ranking based approach. In Proc. ACM Conference on Knowledge Discovery & Data Mining (SIGKDD), pp. 300–310. Cited by: §2.
  • V. P. Dwivedi, C. K. Joshi, T. Laurent, Y. Bengio, and X. Bresson (2020) Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982. Cited by: §2.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proc. Innovations in Theoretical Computer Science (ITCS), pp. 214–226. Cited by: §1, §3.2.
  • J. Fisher, A. Mittal, D. Palfrey, and C. Christodoulopoulos (2020)

    Debiasing knowledge graph embeddings

    In Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7332–7345. Cited by: §2.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 249–256. Cited by: Appendix E.
  • M. Gori, G. Monfardini, and F. Scarselli (2005) A new model for learning in graph domains. In Proc. IEEE International Joint Conference on Neural Networks (IJCNN), Vol. 2, pp. 729–734. Cited by: §1.
  • A. Hagberg, P. Swart, and D. S Chult (2008) Exploring network structure, dynamics, and function using networkx. In Proc. Python in Science Conference (SciPy), Cited by: Appendix H.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. Advances in Neural Information Processing Systems (NeurIPS) 30. Cited by: §1.
  • R. A. Horn and C. R. Johnson (2012) Matrix analysis. Cambridge university press. Cited by: Appendix B, Lemma 2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. International Conference on Machine Learning (ICML), pp. 448–456. Cited by: §1, §2, §3.1, §4.2, §4.2, §5.1.
  • G. Jin, Q. Wang, C. Zhu, Y. Feng, J. Huang, and J. Zhou (2020)

    Addressing crime situation forecasting task with temporal graph convolutional neural network approach

    In Proc. International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), pp. 474–478. External Links: Document Cited by: §1.
  • K. L. Jordan and T. L. Freiburger (2015) The effect of race/ethnicity on sentencing: examining sentence type, jail length, and prison length. Journal of Ethnicity in Criminal Justice 13 (3), pp. 179–196. Cited by: §5.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix E, §4.1.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1, §3, §5.1.
  • O. D. Kose and Y. Shen (2022) Fair node representation learning via adaptive data augmentation. arXiv preprint arXiv:2201.08549. Cited by: §1, §2, §3.2, §3.2, §4.1, §4.2, §4.2, §4.2.
  • Ö. D. Köse and Y. Shen (2021) Fairness-aware node representation learning. arXiv preprint arXiv:2106.05391. Cited by: §2.
  • C. Laclau, I. Redko, M. Choudhary, and C. Largeron (2021) All of the fairness for edge prediction with optimal transport. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1774–1782. Cited by: §2.
  • P. Li, Y. Wang, H. Zhao, P. Hong, and H. Liu (2020) On dyadic fairness: exploring and mitigating bias in graph connections. In Proc. International Conference on Learning Representations (ICLR, Cited by: §1, §2, §3.2, §3.2, §4.1, §4.2, §4.2, §4.2.
  • A. Loukas (2020) How hard is to distinguish graphs with graph neural networks?. Advances in Neural Information Processing Systems (NeurIPS) 33, pp. 3465–3476. Cited by: §1.
  • J. Ma, J. Deng, and Q. Mei (2021) Subgroup generalization and fairness of graph neural networks. arXiv preprint arXiv:2106.15535. Cited by: §2.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018)

    Spectral normalization for generative adversarial networks

    In Proc. International Conference on Learning Representations (ICLR), Cited by: §1, §3.1.
  • F. L. Opolka, A. Solomon, C. Cangea, P. Veličković, P. Liò, and R. D. Hj elm (2019) Spatio-temporal deep graph infomax. arXiv preprint arXiv:1904.06316. Cited by: §1.
  • K. Padh, D. Antognini, E. Lejal-Glaude, B. Faltings, and C. Musat (2021) Addressing fairness in classification with a model-agnostic multi-objective algorithm. In Uncertainty in Artificial Intelligence (UAI), pp. 600–609. Cited by: §5.1, §5.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Proc. International Conference on Neural Information Processing Systems (NeurIPS), Vol. 32. Cited by: Appendix H.
  • T. A. Rahman, B. Surma, M. Backes, and Y. Zhang (2019) Fairwalk: towards fair graph embedding.. In Proc. International Joint Conference on Artificial Intelligence (IJCAI), pp. 3289–3295. Cited by: §2.
  • T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems (NeurIPS) 29. Cited by: §1, §3.1.
  • S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. Advances in Neural Information Processing Systems (NeurIPS) 31. Cited by: §1, §3.1.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §1.
  • F. Scarselli, A. C. Tsoi, and M. Hagenbuchner (2018) The vapnik–chervonenkis dimension of graph and recursive neural networks. Neural Networks 108, pp. 248–259. Cited by: §1.
  • I. Spinelli, S. Scardapane, A. Hussain, and A. Uncini (2021) FairDrop: biased edge dropout for enhancing fairness in graph representation learning. IEEE Transactions on Artificial Intelligence. Cited by: §2.
  • W. Sun, W. Jiang, E. Trulls, A. Tagliasacchi, and K. M. Yi (2020) Acne: attentive context normalization for robust permutation-equivariant learning. In

    Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 11286–11295. Cited by: §2.
  • L. Takac and M. Zabovsky (2012) Data analysis in public social networks. In International Scientific Conference and International Workshop. ’Present Day Trends of Innovations’, Vol. 1. Cited by: §5.1.
  • D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §1, §2, §3.1.
  • P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1, §3.
  • Y. Wu and K. He (2018) Group normalization. In Proc. European conference on computer vision (ECCV), pp. 3–19. Cited by: §1, §3.1.
  • R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu (2020) On layer normalization in the transformer architecture. In Proc. International Conference on Machine Learning (ICML), pp. 10524–10533. Cited by: §1, §3.1, §4.2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2018) How powerful are graph neural networks?. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1, §2, §3.
  • K. Xu, C. Li, Y. Tian, T. Sonobe, K. Kawarabayashi, and S. Jegelka (2018) Representation learning on graphs with jumping knowledge networks. In Proc. International Conference on Machine Learning (ICML), pp. 5453–5462. Cited by: §1.
  • K. Xu, J. Li, M. Zhang, S. S. Du, K. Kawarabayashi, and S. Jegelka (2019) What can neural networks reason about?. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1.
  • K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua (2018) Learning to find good correspondences. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2666–2674. Cited by: §2.
  • C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T. Liu (2021) Do transformers really perform badly for graph representation?. Advances in Neural Information Processing Systems (NeurIPS) 34. Cited by: §1.
  • R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018) Graph convolutional neural networks for web-scale recommender systems. In Proc. ACM International Conference on Knowledge Discovery & Data Mining (SIGKDD), pp. 974–983. Cited by: §1.
  • Z. Zeng, R. Islam, K. N. Keya, J. Foulds, Y. Song, and S. Pan (2021) Fair representation learning for heterogeneous information networks. arXiv preprint arXiv:2104.08769. Cited by: §2.

Appendix A Proof of Lemma 1

First, we present the following Lemma, as it will be utilized in the proof of Lemma 1.

Lemma 2.

(Cauchy Interlace Theorem, Horn and Johnson (2012)). Let be a Hermitian matrix of order , and let be a principal submatrix of of order , such that . If

lists the eigenvalues of

and the eigenvalues of , then:


where only when there is a nonzero such that and ; if then there is nonzero such that .

The shift operations over different sensitive groups are defined to be:


where for . Let be the singular values of . Then, eigenvalues of are . Let denote the eigenvalues of .

is a projection matrix, for which the following holds:


as both and are symmetric projection matrices onto the orthogonal complement spaces of the subspaces spanned by and , respectively, and commutes. Then, the following decomposition can be written: , where . Note that creates a diagonal matrix with th diagonal entry being equal to

. This decomposition implies that the eigenvalues corresponding to the eigenvectors

are zero, which can be shown as:


The same analysis also holds for . Let and denote zero eigenvalues, and hold. Based on the decomposition of , the following can be written:


where , if the eigenvalues of and are the same. Furthermore, denote by :



denotes an all-zeros matrix with dimensions

. Let denote .


As, has the eigenvalues , and , (16) shows that has the eigenvalues .

Let denote , then has the eigenvalues , as .


Further, define matrix such that:


together with eigenvalues . Then, utilizing Lemma 2, (17) and (18), we can conclude that


where or , if and only if there is a right singular vector of such that . The proof of the condition for or is presented below in italic.

Proof: Cauchy interlace theorem states that inequalities in (10) become equalities if there is a nonzero such that and or if there is a nonzero such that and . Therefore, for the result in (19), inequalities become equalities, if there is a nonzero such that:


Note that we dropped the subscript of in (20), as it is enough to hold these conditions for any of the s to turn one of the inequalities into equality in (19).


Let , where forms an orthogonal basis for . Therefore, . The conditions presented in (20) can be rewritten based on this definition:


The second condition in (22) demonstrates that for inequalities in (19) become equalities, should lie in the orthogonal complement space of , which is spanned by . Therefore, if the second condition in (22) is satisfied, there exists a vector such that:


In this case, the first condition in (22) becomes:


Therefore, the following equality should hold to meet both criterion in (20):


(25) demonstrates that the conditions in (20) are met, if is the eigenvector of associated with eigenvalue . This eigenvector lies in the orthogonal complement space of and the eigenvector of is the right singular vector of . Therefore, inequalities in (19) become equalities, if there is a right singular vector of such that , which concludes the proof.

is created in the following way:


together with eigenvalues . Furthermore, (16) shows that has the eigenvalues . Again, we can apply Cauchy interlace theorem presented in Lemma 2, which concludes that:


where or , if and only if there is a right singular vector of such that and . For these conditions leading to equalities or , the proof follows in the same manner as the previous one.

Finally, by unifying the results of (19) and (27), the Theorem 1 can be proved, such that:


where or , only if have a right singular vector such that and . Note that s and s are defined to be non-negative, thus we can omit the powers of in the final result.

Appendix B Learning Environment and Proof for Theorem 1

We first introduce the following Lemma that will be utilized in the proof.

Lemma 3.

(Theorem 1, Cui et al. (2019)) Let be independent Gaussian vectors, where is an real positive definite matrix. Let be a fixed symmetric real matrix. Consider the compound Wishart matrix with . Then for any , the following event