Log In Sign Up

Fair Interpretable Learning via Correction Vectors

by   Mattia Cerrato, et al.

Neural network architectures have been extensively employed in the fair representation learning setting, where the objective is to learn a new representation for a given vector which is independent of sensitive information. Various "representation debiasing" techniques have been proposed in the literature. However, as neural networks are inherently opaque, these methods are hard to comprehend, which limits their usefulness. We propose a new framework for fair representation learning which is centered around the learning of "correction vectors", which have the same dimensionality as the given data vectors. The corrections are then simply summed up to the original features, and can therefore be analyzed as an explicit penalty or bonus to each feature. We show experimentally that a fair representation learning problem constrained in such a way does not impact performance.


page 1

page 2

page 3

page 4


Fair Interpretable Representation Learning with Correction Vectors

Neural network architectures have been extensively employed in the fair ...

Isolating effects of age with fair representation learning when assessing dementia

One of the most prevalent symptoms among the elderly population, dementi...

MMD-B-Fair: Learning Fair Representations with Statistical Testing

We introduce a method, MMD-B-Fair, to learn fair representations of data...

FairNN- Conjoint Learning of Fair Representations for Fair Decisions

In this paper, we propose FairNN a neural network that performs joint fe...

Fair Representation Learning for Heterogeneous Information Networks

Recently, much attention has been paid to the societal impact of AI, esp...

Holographic Neural Architectures

Representation learning is at the heart of what makes deep learning effe...

Learning Less-Overlapping Representations

In representation learning (RL), how to make the learned representations...

1 Introduction

The issue of fairness in machine learning relates to analyzing the outcomes of automated decision systems which may impact people’s well-being. In group fairness, one is dealing with statistically disparate outcomes for individuals belonging to different groups (e.g., women and men, black and white people) (

Zafar et al. (2017)). One case which attracted much attention is the COMPAS software for recidivity prediction, which can be seen as biased against black people111For a complete view of the Northpointe/ProPublica debate, we refer the reader to the original report by Angwin et al. (2016) and Northpointe’s rebuttal by Dieterich et al. (2016).. Focusing on the fact that biased models derive from biased data, many authors have focused on learning fair representations for individuals (Zemel et al. (2013); Cerrato et al. (2020); Moyer et al. (2018); Chouldechova (2017); Louizos et al. (2015)

). In this setting, a representation algorithm such as a feedforward neural net trained via backpropagation is paired with an explicit fairness objective. Previous proposals have employed Maximum Mean Discrepancy (

Louizos et al. (2015); Gretton et al. (2012)), adversarial learning (Cerrato et al. (2020); Ganin et al. (2016); Xie et al. (2017)) and Mutual Information bounds (Moyer et al. (2018)).

One issue with fair representation learning algorithms based on neural networks is their opaqueness. These methods project the original data space into a latent space whose dimensions are incomprehensible to humans, as they are non-linear combinations of the original features. This is a noteworthy problem especially in the context of the “right to an explanation” as required in the EU by the GDPR, Recital 71. Therefore, these methodologies might be inapplicable in the real world.

In this context, we propose a new fair representation learning framework which learns feature corrections instead of an entirely new space of opaque parameters. In practice, this is akin to a pre-processing technique which changes the original features so to balance them between individuals belonging to different groups. This guarantees a “right to an explanation” in the sense that it is always possible to extract the “fair correction” that has been applied to the data belonging to each individual. Furthermore, as the correction is computed via neural networks, our framework still enjoys all the benefits of the universal approximation theorems (see Cybenko (1989)) and may therefore compute any debiasing function. Our framework is flexible, as it imposes only architectural constraints on the neural network without impacting the training objective: therefore, all neural debiasing methodologies may be extended so to belong in our framework.

Our contributions can be summarized as follows:

  • We develop a new family of fair representation learning algorithms based on neural networks. Our framework is interpretable as it relies on computing “correction vectors”, which are simply added to the original representation.

  • We discuss how to modify various existing algorithms so that they may belong in our framework.

  • We show that extending a state-of-the-art fair representation learning algorithm to be interpretable does not affect performance negatively on both relevance and debiasing.

2 The Interpretable Fair Framework

In this section we describe our framework for interpretable fair representation learning. Our framework makes interpretability possible by means of computing correction vectors. Commonly, the learning of fair representations is achieved by learning a new feature space starting from the input space . To this end, a parameterized function is trained on the data and some debiasing component which looks at the sensitive data is included. After training, debiased data is available by simply applying the learned function . Any off-the-shelf model can then be employed on the debiased vectors. Various authors have investigated techniques based on different base algorithms.

The issue with the aforementioned strategy is one of interpretability. While it is possible to guarantee invariance to the sensitive attribute

– with much effort – by training classifiers on the debiased data to predict the sensitive attribute, it is unknown what each of the dimensions of

represent. Depending on the relevant legislation, this can severely limit the applicability of fair representation learning techniques in industry. Our proposal is to mitigate this issue by instead learning fair corrections for each of the dimensions in . Fair corrections are then added to the original vectors so that the semantics of the algorithm are as clear as possible. For each feature, an individual will have a clear penality or bonus depending on the sign of the correction. Thus, we propose to learn the latent feature space by learning fair corrections : and .

It is very practical to modify existing neural network architectures so that they can belong in the aforementioned framework. While there are some architectural constraints that have to be enforced, the learning objectives and training algorithms may be left unchanged. The main restriction is that only “autoencoder-shaped” architectures may belong in our framework. Plainly put, the depth of the network is still a free parameter, just as the number of neurons in each hidden layer. However, to make interpretability possible, the last layer in the network must have the same number of neurons as there are features in the dataset. In a regular autoencoder architecture, this makes it possible to train the network with a “reconstruction loss” which aims for the minimization of the difference between the original input

and the output , where is a neural network architecture. This is not necessarily the case in our framework. On top of this restriction, we also add a parameter-less “sum layer” which adds the output of the network to its input, the original features. Another way to think about the required architecture under our framework is as a skip-connection in the fashion of ResNets (He et al. (2016)) between the input and the reconstruction layer (see Figure 1).

Figure 1: A gradient reversal-based neural network constrained for interpretability so to belong in our interpretable framework. The vector matches in size with , and can then be summed with the original representation and analyzed for interpretability. This architectural constraint can be applied to other neural architectures.

Constraining the architecture in the aforementioned way has the effect of making it possible to interpret the neural activations of the last layer in feature space. As mentioned above, our framework is flexible in the sense that many representation learning algorithms can be constrained so to enjoy interpretability properties. To provide a running example, we start from the debiasing models based on the Gradient Reversal Layer of  Ganin et al. (2016) originally introduced in the domain adaptation context and then employed in fairness by various authors (e.g. McNamara et al. (2017); Xie et al. (2017)). The debiasing effect here is enforced by training a subnetwork to predict the sensitive attribute and inverting its gradient when backpropagating it through the main network . Another sub-network learns to predict . Both networks are connected to a main “feature extractor”

. The two models are pitted against one another in extracting useful information for utility purposes (estimating

) and removing information about (which can be understood as minimizing , see Cerrato et al. (2020)). Here no modification is needed to the learning algorithm, while the architecture has to be restricted so that the length of the vector is the same as the original features

. One concerning factor is whether the neural activations can really be interpreted in feature space, as features can take arbitrary values or be non-continuous (e.g. categorical). We circumvent this issue by coupling the commonly employed feature normalization step and the activation functions of the last neural layer. More specifically, the two functions must map to two coherent intervals of values. As an example, employing standard scaling (feature mean is normalized to 0, standard deviation is normalized to 1) will require an hyperbolic tangent activation function in the last layer. The model will then be enabled in learning a negative or positive correction depending on the sign of the neural activation. It is still possible to use sigmoid activations when the features are normalized in

by means of a min-max normalization (lowest value for the feature is 0 and highest is 1). Summing up, the debiasing architecture of Ganin et al. can be modified via the following steps:

  1. Normalize the original input features via some normalization function .

  2. Set up the neural architecture so that the length of is equal to the length of .

  3. Add a skip-connection between the input and the reconstruction layer.

After training, the corrected vectors and the correction vectors can be interpreted in feature space by computing the inverse normalization and .

Other neural algorithms can be modified similarly so to belong in the interpretable fair framework, and similar steps can be applied to e.g. the Variational Fair Autoencoder by Louizos et al. (2015) and the variational bound-based objective of Moyer et al. (2018). In our experiments, we will however focus on the state-of-the-art fair ranking model of Cerrato et al. (2020), which is based on the gradient reversal layer by Ganin et al. (2016).

3 Experiments

In the experiments we constrain the fair ranking model of Cerrato et al. (2020) to belong in our framework. This ranker employs the gradient reversal concept introduced in Ganin et al. (2016). Therefore, as explained in depth in Section 3, it is sufficient to constrain its architecture to extract features which have the same dimensionality as

. After adding the skip-connection between the first layer and the last feature extraction layer, no other changes are needed to the training algorithm, which we leave unchanged from the original work. Therefore, we train the model employing SGD and select hyperparameters (the number of hidden layers; the fairness-relevance parameter

; the learning rate for SGD) employing a nested cross-validation with . We relied on the Bayesian Optimization implementation provided by Weights & Biases (Biewald (2020)) and stopped after 200 model fits. To evaluate the models, we computed their nDCG, rND (a disparate impact fairness metric defined in Yang & Stoyanovich (2017)) and GPA (a disparate mistreatment metric defined in Narasimhan et al. (2020)) in Table 1. We selected the best models with the assumption that each metric has equal importance.

Fair DR Interpretable Fair DR
COMPAS 1-rND 0.841411 0.073 0.822243 0.065
1-GPA 0.927383 0.034 0.939985 0.036
nDCG@500 0.474789 0.085 0.526513 0.067
Bank 1-rND 0.813426 0.004 0.812811 0.023
1-GPA 0.918763 0.002 0.925992 0.008
nDCG@500 0.671236 0.005 0.652672 0.014
Table 1: Results for the experimentation performed on the COMPAS and Bank datasets. For all the metrics, higher is better. We observe that the performance of the Fair DirectRanker (Fair DR) by Cerrato et al. (2020) is not impacted meaningfully when constrained for interpretability.

3.1 The COMPAS dataset

We focus our discussion on the COMPAS dataset, one of the most popular datasets in fair classification and ranking. This dataset has been published by ProPublica (Angwin et al. (2016)) after a long-term evaluation of the COMPAS tool, short for “Correctional Offender Management Profiling for Alternative Sanctions”. This tool is made available to US judges, who may employ it when deciding to allow an individual to be released on parole. The rationale here is that an individual that is evaluated as “low risk” could be allowed to pay bail and avoid incarceration while waiting for trial. The opposite follows for individuals that are deemed “high risk”. ProPublica evaluated that the tool is biased against black people in the sense that it mis-assigns high risk scores to black people on a higher rate than to white people. We employ, as widely done in the literature, the 10 COMPAS classes as relevance classes for our ranking algorithm. On top of evaluating relevance (via nDCG) and fairness (via rND), we analyze the correction vectors focusing on the “priors_count”. This feature represents the number of previous crimes committed by an individual, and investigating how a fair model changes this feature over the two different groups available (white and black people) can provide insights into the model’s reasoning. We provide the average correction value for our best model in Table 2. In the table, we observe that making the two groups more similar translates into disparate corrections which impact black people more than white. Here, one could make the argument that changing the attributes of individuals is equivalent to rewriting their personal history, and could be seen as unlawful. This objection merits attention, and needs to be investigated further. At this time, we would posit that this issue is common to all fair representation learning algorithms, with the difference that the correction is computed by projecting individuals’ data into non-readable latent dimensions. The benefit of our framework is that it is possible to investigate this transformation, and possibly refuse the decisions if they are seen as problematic.

Black White Avg. Difference
priors_count, original 2.406494 1.578894 0.8276
priors_count, corrected 2.211587 (-8.1%) 1.486147 (-5.8%) 0.72544 (-12.4%)
Table 2: Average values for the priors_count feature in the COMPAS dataset over the two ethnicity groups. We observe disparate corrections, i.e. black individuals receive a stronger negative correction.

4 Conclusions and Future Work

In this paper, we presented a new framework for interpretable fair representation learning, which computes correction vectors. Our experimentation shows that losses in performance and fairness metrics are negligible when constraining a state-of-the-art fair ranker for interpretability. One point that needs further investigation is whether learning corrections is a desirable property. While we argue that all representation algorithms learn corrections – usually by projecting into a human-unreadable latent space – we are currently investigating this matter, collaborating with experts in IT law and equality rights.


  • Angwin et al. (2016) Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias, 2016. URL
  • Biewald (2020) Lukas Biewald. Experiment tracking with weights and biases, 2020. URL Software available from
  • Cerrato et al. (2020) M. Cerrato, M. Köppel, A. Segner, R. Esposito, and S. Kramer. Fair pairwise learning to rank. In

    2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)

    , pp. 729–738, 2020.
    doi: 10.1109/DSAA49011.2020.00083.
  • Cerrato et al. (2020) Mattia Cerrato, Roberto Esposito, and Laura Li Puma. Constraining deep representations with a noise module for fair classification. In ACM SAC, 2020.
  • Chouldechova (2017) Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153–163, 2017.
  • Cybenko (1989) George Cybenko.

    Approximation by superpositions of a sigmoidal function.

    Mathematics of control, signals and systems, 2(4):303–314, 1989.
  • Dieterich et al. (2016) William Dieterich, Christina Mendoza, and Tim Brennan. Compas risk scales: Demonstrating accuracy equity and predictive parity. 2016.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. J. Mach. Learn. Res., 17(1):2096–2030, 2016. ISSN 1532-4435.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pp. 770–778, 2016.
  • Louizos et al. (2015) Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair autoencoder. preprint arXiv:1511.00830, 2015.
  • McNamara et al. (2017) Daniel McNamara, Cheng Soon Ong, and Robert C Williamson. Provably fair representations. preprint arXiv:1710.04394, 2017.
  • Moyer et al. (2018) Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, and Greg Ver Steeg. Invariant representations without adversarial training. Advances in Neural Information Processing Systems, 31:9084–9093, 2018.
  • Narasimhan et al. (2020) Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, and Serena Wang. Pairwise fairness for ranking and regression. In AAAI, 2020.
  • Xie et al. (2017) Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Graham Neubig. Controllable invariance through adversarial feature learning. In NIPS, 2017.
  • Yang & Stoyanovich (2017) Ke Yang and Julia Stoyanovich. Measuring fairness in ranked outputs. In SSDBM ’17, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450352826.
  • Zafar et al. (2017) Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In WWW, 2017.
  • Zemel et al. (2013) Richard Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In ICML, 2013.