1 Introduction
Recent breakthrough results in artificial intelligence and machine learning, in particular with deep neural networks (DNNs), also push their usage in tasks where the wellbeing of individuals is at stake, such as college admissions or loan assignment. In this situation, one needs to be concerned with the
fairness of a neural network, i.e. whether it is relying on sensitive, lawprotected information such as ethnicity or gender to undertake its decisions. The usual approach here is to either constrain the output of the network deltr; fair_pair_metric so that it is fair under some definition or to remove information about the sensitive attribute from the model’s internal representations xie2017controllable; cerrato2020pairwise; mcnamara2017provably; Louizos2016TheVF. The latter techniques might be referred to as “fair representation learning” zemel2013learning. These methodologies learn a projection into a latent space where it can be shown that the information about is minimal cerrato2020constraining. This can be evaluated experimentally to correlate with what is known as group fairness under different definitions and metrics cerrato2020pairwise; Louizos2016TheVF. In group fairness, one is concerned with e.g. avoiding situations where a machine learning model might assign positive outcomes with different rates to individuals belonging to different groups (disparate impact zafar2017fairness). Another problematic scenario can be observed when a model displays different error rates over different groups of people (disparate mistreatment zafar2017fairness).However, one issue in the area of fair representation learning is interpretability. The projection into a latent space makes it hard to investigate why the decisions have been undertaken. This is in open contradiction with recent EU legislation, which calls for a “right to an explanation” for individuals which are subject to automatic decision systems (General Data Protection Regulation, Recital 71 malgieri2020gdpr). In this context, neural networks that are fair might still not be trustworthy enough to fulfill increasingly strict legal requirements.
The interpretability issues in deep neural networks may be summarized in three points interpretabilitysurvey1:

Complexity
. Employing models which are simply deeper and “bigger” has enabled performance breakthroughs in both Computer Vision (ResNets
he2016deep) and Natual Language Processing (the GPT family of models gpt3). However, these models have incredibly high parameter counts (40 millions in ResNets, 100 billion in GPT3). This fact, coupled with the complex nonlinearities usually involved in deep networks, makes them unreadable by humans.

Algorithmic opacity. The training objectives involved are highly nonconvex, and multiple solutions of similar quality may be obtained while keeping the same architecture and training data.

Nondecomposability
: Complex engineering systems are usually understood in their fundamental parts and the interaction between them. In machine learning, decision trees are one model that is clearly decomposable: each node represents a decision over a feature on a welldefined criterion (e.g., based on the Gini index). While some interpretation of the layers of neural networks is possible, it is still hard to summarize the functionality of complex neural networks over a larger number of layers in a humanunderstandable fashion.
In this paper, we present a novel framework to constrain fair representation learning so that it is decomposable and therefore humanreadable. Our framework is centered around the concept of a correction vector, i.e. a vector of features which is interpretable in feature space and represents the “fairness correction” each data point is subject to so that the results will be statistically fair. We assume vectorial data of not very high dimensionality, i.e., image/video data, data from signal processing, text, or highdimensional omics data, e.g., are currently outside the scope of the method. In general, our framework computes a function and achieves decomposability by “mapping back” from the space of latent representations into the original feature space. Therefore, the overall fair representation learning is clearly divided into two stages: a fair preprocessing stage which computes feature corrections and a classification (or ranking) stage which optimizes for a given task. We introduce the computation of fair correction vectors in two different ways:

Explicit Computation. We propose to add a small set of architectural constraints to commonly used fair representation learning models. These constraints, discussed in Section 3.1, allow these architectures to explicitly compute a correction for each data point instead of a projection into a noninterpretable latent space.

Implicit Computation. We leverage invertible neural network architectures (i.e., Normalizing Flows nice; dinh2017realnvp; normflows) and present an algorithm which can map individuals belonging to different groups into a single one. The final result of this computation is new feature values for each individual as if they belonged to the same group. Here, a correction may still be computed as to understand what changes have been determined to be necessary to make individuals from different groups indistinguishable from one another. This methodology is discussed in Section 3.2.
Our contributions can be summarized as follows: (i) We propose a new theoretical framework for fair representation learning based around the concept of correction vectors (Section 3). (ii) This framework may be exploited to constrain existing fair representation learning approaches so that they explicitly compute correction vectors. We describe the relevant techniques in Section 3.1. We constrain four different classification and ranking models in such a way and show that losses on fairness and relevance metrics with respect to their noninterpretable versions are negligible. (iii) We show how to implicitly compute correction vectors by employing a pair of Normalization Flow models. This method achieves stateoftheart results in fair classification (and ranking) and is described in Section 3.2. We show our experimental results in Section 4. This section also presents the analysis of actual correction vectors. (iv) We discuss the law standing of the current stateoftheart for fair representation learning, especially in light of the recent developments in EU legislation in Section 5.
2 Related Work
Fairness in Machine Learning.
The study of the concept of “fairness” in Machine Learning and Data Mining has a relatively long history, dating back to the 90s when Friedman and Nisselbaum friedman1996bias first expressed concern about automatic decisionmaking performed by “machines”. The reasoning here is that automatic decision making should still give a reasonable chance to appeal or discuss the decision. Furthermore, Friedman and Nisselbaum posit that there is a concrete risk of discrimination and “unfairness” which should be defined with particular attention to systemic discrimination and moral reasoning. Since then, various authors (see e.g. Verma and Rubin verma2018fairness; Mehrabi et al. mehrabi2021survey for a survey) have coalesced around a definition involving protected and unprotected groups. Individuals belong to a “protected group” if their innate characteristics have been the subject of systemic discrimination in the past, and a ML algorithm can be said to be groupfair if e.g. it assigns positive outcomes in a balanced fashion between groups (see e.g. Zafar et al. zafar2017fairness for a discussion of actionable group fairness definitions). For this reason, group fairness techniques have also been characterized as “algorithmic affirmative action” by some authors affirmativeaction; zehlike2018reducing. Research on this topic has historically focused on preprocessing methods kamiran2009classifying, fair classification techniques via regularization kamishima2012fairness; zafar2017fairness; dwork2012fairness and rulebased models ruggieri2010data. Fair ranking has also attracted more attention in recent years, in which both regularization techniques zehlike2018reducing; cerrato2020pairwise; fair_pair_metric and postprocessing zehlike2017fa have been considered. More theoretical research has focused on the incompatibility of fairness definition and the calibration property pleiss2017calibration, while another recent family of approaches strives to ground the group fairness concept in casual and counterfactual reasoning kusner2017counterfactual; wu2019counterfactual.
Fair Representation Learning.
The field of fair representation learning focuses on the learning of representations which are invariant to a sensitive attribute. One of the first contributions to the field by Zemel et al. zemel2013learning employed probabilistic modelling. Representation learning techniques based on neural networks have also become popular. Various authors here have taken inspiration from a contribution by Ganin et al. ganin2016jmlr that focuses on domain adaptation. Ganin et al. propose a “gradient reversal layer”, which may be employed to eliminate information about a source domain when the task is to perform well on a separate but related target domain. When applying this methodology in fair machine learning, the concept of “domains” is adapted to be different values of a sensitive attribute. A line of research has focused on gradient reversalbased models by developing theoretical grounding xie2017controllable; mcnamara2017provably, architectural extensions cerrato2020constraining, and adaptations to fair ranking scenarios cerrato2020pairwise. Other fair representation learning strategies are based on Gretton et al.’s Maximum Mean Discrepancy (MMD) gretton2012kernel, a kernelbased methodology to test whether two data samples have been sampled from different distributions. The relevance of this test to the fair representation learning setting is that it may be formulated in a differentiable form: as such, it has been employed in both domain adaptation tzeng2014deep and fairness Louizos2016TheVF.
Real NVP.
The real NVP (real NonVolume Preserving transformations) architecture used in this work to obain fair representations from biased data is a special kind of normalizing flow, a class of learning algorithms designed to perform transformations of probability densities. Recent work has shown that such transformations can be learned by deep neural networks, see e.g. NICE
nice and Autoregressive Flows autoregressive. Furthermore it has been shown that evaluating likelihoods via the changeofvariables formula can be done efficiently by employing the real NVP architecture, which is based on invertible coupling layers dinh2017realnvp. In domain adaptation, the recently developed AlignFlow alignflow, which is a latent variable generative framework that uses normalizing flows normflows; nice; dinh2017realnvp, has been used to transform samples from one domain to another. The data from each domain is modeled with an invertible generative model having a single latent space over all domains. In the context of domain adaptation, the domain class is also known during testing, which is however not the case for fairness. Nevertheless, the general idea of training two Real NVP models is used for AlignFlow as well for the model proposed in this paper.3 The Correction Vector Framework
In this section we describe our framework for interpretable fair representation learning. Our framework makes interpretability possible by means of computing correction vectors. Commonly, the learning of fair representations is achieved by learning a new feature space starting from the input space . To this end, a parameterized function is trained on the data and some debiasing component is included that takes care of the sensitive data . After training, debiased data is available by simply applying the learned function . Any offtheshelf model can then be employed on the debiased vectors. Various authors have investigated techniques based on different base algorithms.
The issue with the aforementioned strategy is one of interpretability. While it is possible to guarantee invariance to the sensitive attribute
– with much effort – by showing that classifiers trained on the debiased data would not be able to predict the sensitive attribute, there is still no interpretation for each dimension in the latent representation
. Depending on the relevant legislation, this can severely limit the applicability of fair representation learning techniques in industry and society. Our proposal is to mitigate this issue by instead learning fair corrections for each of the dimensions in . Fair corrections are then added to the original vectors so that the semantics of the algorithm are as clear as possible. For each feature, one can obtain a clear understanding of how that feature has been changed as to counteract the bias in the data. Thus, we propose to learn the latent feature space by learning fair corrections : and .In our framework, correction vectors may be learned with two different methodologies. Explicit computation requires constraining a neural network architecture so that it does not leave feature space and is presented in Section 3.1. Different neural fairness methodologies may be constrained in such a way. Implicit computation relies on a pair of invertible normalizing flow models to map individuals belonging into different groups into a single one. Here a correction vector may still be computed as the new representations are still interpretable in feature space. This methodology is explained in Section 3.2.
3.1 Explicit Computation of Correction Vectors with Feedforward Networks
It is very practical to modify existing neural network architectures so that they can belong in the aforementioned framework. While there are some architectural constraints that have to be enforced, the learning objectives and training algorithms may be left unchanged. The main restriction is that only “autoencodershaped” architectures may belong in our framework. Plainly put, the depth of the network is still a free parameter, just as the number of neurons in each hidden layer. However, to make interpretability possible, the last layer in the network must have the same number of neurons as there are features in the dataset. In an autoencoding architecture, this makes it possible to train the network with a reconstruction loss that aims for the minimization of the difference between the original input
and the output , where is a neural network model. However, our framework only introduces the aforementioned architectural constraint and is not restricted to a specific training objective. On top of this restriction, we also add a parameterless “sum layer” which adds the output of the network to its input, the original features. Another way to think about the required architecture under our framework is as a skipconnection in the fashion of ResNets he2016deep between the input and the reconstruction layer (see Figure 2).Constraining the architecture in the aforementioned way has the effect of making it possible to interpret the neural activations of the last layer in feature space. As mentioned above, our framework is flexible in the sense that many representation learning algorithms can be constrained so to enjoy interpretability properties. To provide a running example, we start from the debiasing models based on the Gradient Reversal Layer ganin2016jmlr originally introduced in the domain adaptation context and then employed in fairness by various authors mcnamara2017provably; xie2017controllable. The debiasing effect here is enforced by training a subnetwork to predict the sensitive attribute. Another subnetwork learns to predict . Both networks are connected to a main “feature extractor”
. The two models are pitted against one another in extracting useful information for e.g. classification purposes (estimating
) and removing information about (which can be understood as minimizing , see cerrato2020constraining) by inverting the gradient ofwhen backpropagating through the main feature extractor network
. Here no modification is needed to the learning algorithm, while the architecture has to be restricted so that the length of the vector is the same as the original features. One concerning factor is whether the neural activations can really be interpreted in feature space, as features can take arbitrary values or be noncontinuous (e.g., categorical). We circumvent this issue by coupling the commonly employed feature normalization step and the activation functions of the last neural layer. More specifically, the two functions must map to two coherent intervals of values. As an example, employing standard scaling (feature mean is normalized to 0, standard deviation is normalized to 1) will require an hyperbolic tangent activation function in the last layer. The model will then be enabled in learning a negative or positive correction depending on the sign of the neural activation. It is still possible to use sigmoid activations when the features are normalized to
by means of a minmax normalization (lowest value for the feature is 0 and highest is 1). Summing up, the debiasing architecture by Ganin et al. can be constrained for explicit computation of correction vectors via the following steps:
Normalize the original input features via some normalization function .

Set up the neural architecture so that the length of is equal to the length of .

Add a skipconnection between the input and the reconstruction layer.
After training, the corrected vectors and the correction vectors can be interpreted in feature space by computing the inverse normalization and .
Other neural algorithms can be modified similarly so to belong in the interpretable fair framework, and similar steps can be applied to, e.g., the Variational Fair Autoencoder Louizos2016TheVF and the variational boundbased objective of a related approach moyer2018invariant. As previously mentioned, our framework does not require a specific training objective and is therefore flexible. We show this by focusing our experimental validation on extending four different fair representation learning approaches so that they may compute correction vectors. Our experimentation on a stateoftheart fair ranking model (AdvDR cerrato2020pairwise), a fair classifier (AdvCls xie2017controllable; cerrato2020constraining; mcnamara2017provably), the aforementioned Variational Fair Autoencoder (VFAE Louizos2016TheVF) and a listwise ranker (DELTR deltr).
3.2 Implicit Computation of Correction Vectors with Normalizing Flows
This model builds upon the real NVP (realvalued NonVolume Preserving transformations) architecture dinh2017realnvp. For a brief introduction to this method, see Section 7.1 in the supplementary material.
Our proposed method makes use of this methodology to create fair representations that make it hard to distinguish different sensitivity groups which are identified by their corresponding value of the sensitive attribute . The learned transformation maps back into feature space and can therefore be employed to implicitly compute correction vectors.
In the following, let be the feature space of a dataset biased towards the sensitive attribute that can take different values. We assume that the dataset is drawn from an unknown distribution function with each sensitivity group being drawn from their respective distribution function on .
In order to obfuscate the information on the sensitive attribute, we want to perform a transformation , where is any of the sensitivity groups represented in the data. Since the distributions in question are not generally known, but only sampled with the given dataset, we split the transformation into two transformations and , where
is an intermediate latent space on which the transformed data are distributed according to a Gaussian distribution
(1) 
This is simple to train with the loss function
(2) 
with and being the Jacobian of . The target transformation is then constructed as
(3) 
as depicted in Figure 3.
The fair representation constructed this way, however, might not be useful for a given task and does not take advantage of ground truth information which may be available. In the case that is a classification or ranking dataset, we add another classification or ranking model into the chain that takes the fair representation for as input and predicts a target value or ranking based on these transformed data. In order to achieve good performance on both the fairness and the prediction objectives, we train the whole chain at once with a loss function for the prediction objective. This is done by evaluating the gradients of , , and separately, where is computed using only , is evaluated using only , and is evaluated using the complete chain. The parameters of the model implementing are then updated with the gradients of
those for are updated with the gradients of
and those of the classification or ranking model are only updated with the gradients of , where is a tuning parameter for the tradeoff between prediction and fairness.
This procedure, however, has a caveat; namely that since is diffeomorphic, information on is technically not lost, but only obscured. We find in our experiments, discussed in Section 4, that different values of are reasonably hard to distinguish in a representation learned in this fashion. Nonetheless, a further modification can improve the fairness still. To this end, we introduce a projection function which sets the first component of to a constant, e.g.
. This reduces the degrees of freedom of the modified transformation
(4) 
which is not invertible. The goal is now to maximize the overlap of with the lost degree of freedom to make sure that as little information on is present in the fair representation as possible. This is done by adding another loss term to that is aimed at predicting the sensitive attribute in the removed dimension in .
This leads to three different models, the first one being the base model, which only consists of the normalizing flow followed by a prediction network, which in the following we call FairNF. The other two models are the FairNF with the projection on included (FairNF+), and then the FairNF+ with a binary cross entropy (BCE) loss for , called FairNF+BCE^{2}^{2}2In general, there is no restriction on the loss, but for convenience we consider BCE in the following.. If the predictive model is a ranker, we call the complete model FairNF and in the case of classification we call it FairNFCls.
Finally, if the feature space is a vector space (as is usually the case), these three models can be used to compute correction vectors. That is, since the fair representation lives in a linear subspace of , a correction vector for any vector is given by
(5) 
Therefore, these models do not explicitly compute corrections but are still mapping back into the original feature space . We call this technique implicit computation of correction vectors.
After learning the latent representation , it is possible to chain either a classifier or a ranker depending on the data at hand. The model may then be trained in an endtoend fashion. In the classification experiments reported in Section 4, we employed a simple neural classifier trained via crossentropy. When dealing with ranking data, we instead employed a pairwise strategy inspired by the DirectRanker approach koppel2019pairwise. This model is able to learn a total quasiorder in feature space and is competitive with more complex listwise approaches both in performance and in fairness cerrato2020pairwise. Our chained ranking model is therefore defined as follows:
(6) 
where is a fullyconnected neural network and is a antisymmetric, signconserving function such as the hyperbolic tangent. The ranking loss is then simply the squared error with respect to the indicator function :
(7) 
To the end of better understanding which FairNF architecture behaves the best with respect to fairness and invariant representations, we have performed experimentation on a simple synthetic dataset. For space and presentation reasons, we include our findings in the supplementary material (Section 7.2).
4 Experiments
For evaluating the different models, we performed experiments on ranking and classification datasets commonly used in the fairness literature. We evaluated each model using different relevance and fairness metrics and evaluated everything on a 3 times 3 fold gridsearch to find the best hyperparameter setting. In the following we describe our experimentation in detail, including the employed evaluation metrics and datasets.
4.1 Evaluation Metrics
Besides the wellknown AUROC, which we will call AUC in the following, different fairness and performance measures are used to evaluate the different algorithms. In the following, we will give a short overview of the metrics used in this work.
4.1.1 Ranking Metrics
The NDCG Metric.
The normalized discounted cumulative gain of top documents retrieved (or NDCG@) is a commonly used measure for performance in the field of learning to rank. Based on the cumulative gain of top documents (DCG@), the NDCG@ can be computed by dividing the DCG@ by the ideal (maximum) discounted cumulative gain of top documents retrieved (IDCG@):
where is the list of documents sorted by the model with respect to a single query and is the relevance label of document .
rND.
To the end of measuring fairness in our models, we employ the rND metric ke2017measuring. This metric is used to measure group fairness and is defined as follows:
(8) 
The goal of this metric is to measure the difference between the ratio of the protected group in the top documents and in the overall population. The maximum value of this metric is given by , which is also used as normalization factor. This value is computed by evaluating the metric with a dummy list, where the protected group is placed at the end of the list. This biased ordering represents the situation of “maximal discrimination”.
This metric also penalizes if protected individuals at the top of the list are overrepresented compared to their overall representation in the population.
Groupdependent Pairwise Accuracy.
Let be a set of protected groups such that every instance inside the dataset belongs to one of these groups. The groupdependent pairwise accuracy fair_pair_metric is then defined as the accuracy of a ranker on instances which are labeled more relevant belonging to group and instances labeled less relevant belonging to group . Since a fair ranker should not discriminate against protected groups, the difference should be close to zero. In the following, we call the Groupdependent Pairwise Accuracy GPA. We also note that this metric may be employed in classification experiments by considering a classifier’s accuracy when computing and .
4.1.2 Classification Metrics
Area Under Discrimination Curve.
We take the discrimination as a measure of the bias with respect to the sensitive feature in the classification zemel2013learning, which is given by:
(9) 
where denotes that the th example has a value of equal to 1. This measure can be seen as a statistical parity which measures the difference of the proportion of the two different groups for a positive classification. Similar to the AUC we can evaluate this measure for different classification thresholds and calculate the area under this curve. Using different thresholds, dependencies on the yDiscrim measure can be taken into account. We call this metric in the following AUDC (Area Under the Discrimination Curve). One issue with this metric is that it may hide high discrimination values on certain thresholds as they will be “averaged away”. We show plots analyzing the accuracy/discrimination of our models tradeoff at various thresholds in Figure 16 in the supplementary material. These were obtained by computing the discrimination and accuracy of our models at 20 different thresholds on the interval .
Absolute Distance to Random Guess.
To evaluate the invariance of the representation with respect to the sensitive attribute, we report classifier accuracy as the absolute distance from a random guess (the majority class ratio in the dataset), which we call ADRG in the following.
4.2 Datasets
We focused our evaluation on four realworld datasets commonly employed in the fairness literature. The COMPAS dataset was released as part of an investigative journalism effort in tackling automated discrimination machine_bias. The dataset contains two ground truth values. First a risk score (from 0 to 10) of how likely a person will commit a crime and second whether that person committed a crime in the future. For the ranking experiments, we took the risk score as the target value, while for the classification ones, we tried to predict if the person committed a crime in the future. In terms of correction vectors analysis (see Section 4.7) we evaluated the feature “prior crimes” and how much it was changed by the interpretable models to make the representations fair.
Moreover, the Adult dataset is used, where the ground truth represents whether an individual’s annual salary is over 50K$ per year or not adult. It is commonly used in fair classification, since it is biased against gender Louizos2016TheVF; zemel2013learning; cerrato2020constraining. For the correction vectors analysis, the feature “capital gain” was analyzed.
The third dataset used in our experiments is the Bank Marketing Data Set banks, where the classification goal is whether a client will subscribe a term deposit. The dataset is biased against people under 25 or over 65 years. The “employment variation rate” was studied for the correction vectors analysis.
The last dataset we used is the Law Students dataset, which contains information relating to 21,792 USbased, firstyear law students and was collected to the end of understanding whether the Law Students Admission Test in the US is biased against ethnic minorities law_student. As done previously zehlike2018reducing, we subsampled 10% of the total samples while maintaining the distribution of gender and ethnicity, respectively. We used ethnicity as the sensitive attribute, while the ground truth of the dataset is to sort students based on their predicted academic performance. Here, we analysed the change in LSAT score during our correction vectors analysis.
4.3 Experimental Setup
To overcome statistical fluctuations during the experiments, we split the datasets into 3 internal and 3 external folds. On the 3 internal folds, a Bayesian optimization technique, maximizing the fairness measure (1rND for the ranking datasets and 1AUDC for the classification datasets) is used to find the best hyperparameter setting. The best setting is then evaluated on the 3 external folds. We relied on the Weights & Biases platform for an implementation of Bayesian optimization and overall experiment tracking wandb.
We experiment with a stateoftheart algorithm called Fair Adversarial DirectRanker (AdvDR in the rest of this paper), which showed good results on commonly used fairness datasets cerrato2020pairwise, a Debiasing Classifier (AdvCls) based on gradient reversal ganin2016jmlr; cerrato2020constraining; xie2017controllable; mcnamara2017provably, a fair listwise ranker (DELTR deltr) and the Variational Fair Autoencoder Louizos2016TheVF^{3}^{3}3Code was taken from https://github.com/yevgeniintegrateai/VFAE and included into our experimental framework. (VFAE). All these methods were also constrained for explicit computation of correction vectors. In the following we will indicate the models constrained in such a way with “Int” before the name of the model (IntAdvDR, IntVFAE, etc.).
Since it is from a theoretical point (to the best of our knowledge) not clear how exactly a real NVP model is treating discrete features, the implicit correction vector computation technique (FairNF for ranking and FairNFCls for classification) was trained on continuous features only^{4}^{4}4For our implementation of the models, the experimental setup, and the used datasets, see https://zenodo.org/record/5572596.
4.4 Model Results
We report our experimentation results for fair ranking in Figure 4. Results for fair classification are included in Figure 5. In all the figures we plot the line of optimal tradeoff as found by a model included in the experimentation. The optimal tradeoff is defined as the smallest value of with . The line for the models already present in the literature (AdvDR, DELTR, AdvCls, VFAE) is drawn dashed; the line for the models we introduced (the explicitly constrained IntAdvDR, IntDELTR, IntAdvCls, IntVFAE; the implicit correction vector computation model FairNF/FairNFCls) is dotted. We also test the relevance/fairness tradeoff in both disparate impact (via the rND and AUDC metrics for ranking and classification respectively) and disparate mistreatment (via GPA) situations.
Overall, we find that the proposed models can be just as good or better than the models we compare against when considering the optimal tradeoff defined above. The performance of the IntAdvDR model is especially impressive, showing higher fairness and relevance than its noninterpretable counterpart in all considered scenarios but one (Figure 4(b)). The fair classifier IntAdvCls shows slightly reduced fairness on average when compared to the AdvCls model, which is however within the margin of error. Results for DELTR and IntDELTR appear to be suffering from larger errors than the other models we considered. Our interpretable model compares favorably on COMPAS (Figures 4(a) and 4(d)) and Banks (Figures 4(c) and 4(f)) in terms of fairness; however, it had issues converging to a fair solution on the LawRace dataset (Figures 4(d) and 4(a)). We hypothesize that this is due to overfitting. The implicit correction vector model displays impressive fair classification performance, finding the optimal tradeoff in all the datasets and metrics we considered (Figures 5(a) through 5(f), FairNFCls). It is also able to find consistently fair rankings (Figures 4(a) through 4(f), FairNF) at the cost of slightly reduced relevance when compared to other methodologies.
4.5 Direct Comparison of Interpretable vs. NonInterpretable Models
To the end of understanding the fairness impact of the architectural constraints we imposed to allow for explicit computation of fairness vectors (Section 3.1
) we perform further experimentation. Each of the architectures that were constrained for direct computation of correction vectors were compared to their noninterpretable versions trained as described in the respective papers introducing them. We trained each interpretable/noninterpretable pair of architectures 100 times while searching for the best hyperparameters. Each dataset is split into 3 internal folds which are employed for hyperparameter selection purposes, while 15 external folds are kept separate until test time. We then analyze the average performance on the relevant fairness metrics (1GPA, 1rND for ranking; 1AUDC for classification) on the external folds. We correct the variance estimation as described by Nadeau and Bengio
nadeau2003inferenceand perform a significance ttest. Here the number of runs averaged is the number of test (external) folds
following the recommendations by Nadeau and Bengio, who found little benefit in higher values in terms of statistical power. We report the pairwise comparisons between interpretable and noninterpretable models in Figure 10. Each comparison also reports the pvalue obtained by the application of the ttest. We found that the data does not support the presence of statistically significant differences between the fairness performance of our interpretable models and the noninterpretable ones. One model/dataset combination does take exception to this, however. This exception is represented by the LawRace dataset when learning AdvDR ranking models. In Figure 6(b) we observe a statistically significant advantage in GPA when employing the interpretable model, whereas this trend is reversed when considering rND (Figure 6(c)). We hypothesize here that our hyperparameter search, which seeks to find models with the best values of , has found an impressively fair interpretable model for 1GPA while sacrificing rND performance. The interpretable AdvDR does not seem to suffer much in terms of rND on other datasets, which suggests that it has no issue optimizing rND in general. We conclude that our framework for explicit computation of correction vectors warrants strong consideration in situations where transparency is needed to comply with existing laws and regulations.4.6 Fair Representations
In fair representation learning, the ultimate objective is removing information about the sensitive attribute in the obtained representation. We investigate this matter by training offtheshelf rankers and classifiers on the transformed data. The task is to classify the sensitive attribute, i.e. recovering information about it. We show results for this task in Figure 14 and Figure 15
of the supplementary material. The employed models are plain Linear Regression (LR) and Random Forests (RF). We evaluate the absolute difference in accuracy from the random guess (ADRG
cerrato2020constraining; cerrato2020pairwise; Louizos2016TheVF). As explained above, this metric is defined as the absolute value of the difference between the accuracy of an external classifier and the accuracy of a classifier which always predicts the majority class. Here, the rationale is that a perfectly fair representation will force a classifier to always predict the majority class as there is no other way to obtain a higher accuracy value. Therefore, lower scores are better. We also computed the AUC of the employed classifiers. We observe only minimal differences between the algorithms we propose in this paper and the ones already present in the literature. One issue we observed with the VFAE algorithm is that it highly degrades external classifier performance to the point that the accuracy becomes lower than the random guess. In a binary classification setting, one could invert the decisions undertaken by the model and obtain a higher performance than random guessing: therefore, this model displays fairly high ADRG.As an aside, gains in representation invariance with respect to the sensitive attribute are very small when considering AUC. To the best of our knowledge, this is the first time that AUC has been considered for this kind of evaluation, with previous literature focusing mostly on accuracybased measures such as ADRG (see e.g. Louizos2016TheVF; zemel2013learning; cerrato2020constraining). We conclude that employing AUC should be strongly considered in the future when evaluating fair representation learning algorithms.
4.7 Correction Vector Analysis
In this section, we offer a qualitative analysis of the obtained correction vectors for all the models introduced in the paper. To provide a more visual example, we also provide pre and postcorrection visualizations for a simple twoGaussians dataset, which can be found in Figure 11. This dataset consists of samples from two separate Gaussian distributions which only differ in their mean. More specifically, the mean vectors are and , forming two point clouds which partially overlap as we opted for an unitvalued variance for both. We employ this dataset in training an interpretable AdvCls cerrato2020constraining; xie2017controllable, which performs explicit corrections (see Section 3.1), and a FairNF model (see Section 3.2). What we observe in the plots is the AdvCls model performing rather intuitive translations, which push the two Gaussian distributions closer in feature space after correction. The FairNF model instead opts to push both distributions into a linelike manifold. We reason that this is a side effect of our strategy to break the bijection feature of Normalizing Flows via discarding one of the features in the latent space . We do note that the examples are centered around after correction, meaning that the second Gaussian has been pushed onto the first. It is noteworthy, however, that the original distribution shapes were not kept equal, even if the variance appears to be similar.
After having built a visual intuition about how explicit and implicit corrections might look and differ, we now move on to analyzing the corrections on the realworld datasets described in Section 4.2. Our analysis was performed as follows. For each dataset, we pick a feature for which we expect correction vectors to make individuals belonging to different groups more similar. We then compute the average correction for each group and compute the average difference in the chosen feature in absolute and relative terms. A table with all average corrections for each model and dataset can be found in Figure 13 in the Supplementary Material. What stands out from these results is that different models often disagree about both the sign and the intensity of the correction that should be applied. For instance, on the COMPAS dataset, we would expect a correction to reduce the difference in the average amount of previous crimes committed between white and black people. While our implicit computation methodology does just that, the VFAE constrained for explicit computation of correction vector opts to widen that gap. Nonetheless, it obtains an overall fair result (see Figures 5(a), 5(d), 6(d) and 6(b)). Similar extreme corrections, which are definitely not intuitive, were computed by the interpretable AdvCls and FairNF models on the Adult dataset. One noteworthy trend is that simpler architectures computed more intuitive corrections. To be more specific, both the AdvDR and DELTR models employ only a single neural layer after computing the corrected features. In both models this layer has a single neuron which outputs the ranking scores ( for the pairwise AdvDR and for the listwise strategy DELTR). This severely constrains the complexity of the transformation learned by the models after feature vector computation and is highly in contrast with the other models, which have no such restrictions. DELTR and AdvDR are able to clearly close the gap in average feature values in all considered datasets. The other models were also able to output reasonable corrections. For instance, our FairNF model opted to almost nullify the gap in prior crimes committed on the COMPAS dataset. However, these models appear to be less consistent in this regard – FairNF instead opted to amplify the average difference in employment variation rate on the Adult dataset. Interestingly, this model is the strongest tradeoff performer on that same dataset (Figure 5(b)). As previously discussed, we hypothesize that the absence of restrictions about how many layers may be included after the computation of the correction vectors may limit the intuitiveness of some of the corrections. This issue could be addressed in the future by further constraining the maximum computable value for correction vectors. These matters notwithstanding, the opportunity to peek inside the “black box” of fair representation learning might also guide how a practitioner would select models: a model which computes corrections that are deemed counterintuitive or excessive may be discarded during model selection. Therefore, our framework adds a new layer of opportunities in developing models that can be trustworthy and transparent on top of fair.
5 Correction Vectors: a Legal Perspective
Our proposal has to be discussed in the context of the GDPR GDPR and the more recent regulation of AI in the EU aiact. As an expression of Article 8 EU of the Charter of Fundamental Rights fundrights and Article 16 of the Treaty on the Functioning of the EU eutreaty, the GDPR protects the right to informational selfdetermination, according to which everyone should be able to decide which personal data one wishes to disclose and by whom it may be used. Article 22 (1) of the GDPR stipulates a basic right for data subjects “not to be subject to a decision based solely on automated processing”. Paragraph 2 entails exceptions of the general prohibition of automated decisionmaking. These exceptions are linked to some of the key transparency requirements of the GDPR that are outlined in Articles 1315 GDPR and constitute a data subject’s right to be informed. Following these transparency requirements, data subjects are to be provided with “meaningful information about the logic involved”^{5}^{5}5Recital 63 of the GDPR; Articles 13 (2)(f), 14 (2)(g), 15 (1)(h) of the GDPR GDPR..
With this legal background, two potential problems arise in the context of our proposal: Firstly, which content requirements are to be met when having to disclose “the logic involved” and, secondly, their realisation in practice^{6}^{6}6See Hoeren and Niehoff, 2018 hoeren2018ki 1 RW 47, 55.. In general, only the principles^{7}^{7}7The French version of the GDPR speaks explicitly of revealing only the logic that forms the basis: ”la loquique qui soustend leur éventuel traitement automatisé”. This is also the case with the Dutch version: ”welke logica er ten grondslag ligt”. behind a decision have to be stated under the GDPR, not the algorithm formula per se^{8}^{8}8See Recital 63 of the GDPR GDPR; BGHZ 200, 28; Hoeren and Niehoff, 2018 hoeren2018ki 1 RW 47, 56.. The foundation of the decision only needs to be comprehensible, not recalculable^{9}^{9}9See Hoeren and Niehoff, 2018 hoeren2018ki 1 RW 47, 57; and OLG Nürnberg ZD 2013, 26.. Areas of possible concern when legally assessing the scope of the transparency requirement lie in the application to deep neural networks, where the model is trained to develop its own set of decisionmaking rules. As these rules are accessible but – for now – not easily interpretable by humans, the relevant fundamentals of the decisionmaking process cannot be disclosed.
It becomes clear that this “black box” problem and the way it limits legal compliance in practice had not been considered when the GDPR was brought to life^{10}^{10}10See Hoeren and Niehoff, 2018 hoeren2018ki 1 RW 47, 58; and DFK Bitkom, 2017 dfk.. Thus, the transparency requirements of the GDPR must be interpreted retrospectively and with regard to the functioning of deep neural networks. In contrast to other stateoftheart fair representation algorithms, where all underlying rules for a decision remain unrecognisable, the introduction of the presented correction vector in the decisionmaking process offers an alternate approach to transparency when using deep neural networks. While the evolvement of this set of rules remains part of the “black box”, the correction vector itself becomes “humanreadable” – which may satisfy the GDPRrequirement of disclosing the “logic involved”. This assumption will have to be legally assessed in more detail.
A second area to be considered in the legal context of our proposal are antidiscrimination rights. In addition to striving for transparency, the introduction of the correction vector aims to meet an even furtherreaching fairness concept. Models that learn from past data tend to perpetuate existing structures and thus solidify them. The model approach taken here is to counteract this trend by trying to remove biased information. The general objective is to promote equal opportunities and distributive justice through explicitly penalizing privileged groups on the one hand and by improving unprivileged groups on the other in order to make them more similar and comparable to one another. The way this process is set up may, however, affect antidiscrimination rights. The EU Commission’s recently proposed “first ever legal framework on AI” euproposal focuses on the possible impact on fundamental rights when AI technology is applied. The approach of the EU Commission is not to directly deal with the AI techniques as such but rather with their applications and associated risks. Four categories of risk are introduced in the proposal: unacceptable risk, high risk, limited risk, and minimal risk. The principle behind the categories: “the higher the risk of a specific type of use, the stricter the rules” times. With this proportionate and riskbased approach, the EU aims to create clearer rules for legally compliant and trustworthy use of AI, even if full disclosure of the underlying rules for the decisionmaking process cannot be ensured times. As these recent developments also concern deep neural networks, it will have to be clarified if and how the implementation of our proposed correction vector affects fundamental rights such as antidiscrimination rights.
Notwithstanding these potential legal issues which will have to be determined in more detail, however, by taking into account the need for more transparency while adding fairness aspects into automated decisionmaking processes through a correction vector, our framework for interpretable fair representation learning should be suitable for contributing to a legally secure and trustworthy use of AI in the EU and inspire further discussion.
6 Conclusions and Future Work
In this paper we presented a framework for interpretable fair learning based around the computation of correction vectors. Our experimental results show that existing methodologies may be constrained for explicit computation of correction vectors with negligible losses in performance. Furthermore, our implicit methodology for correction vector computation showed the overall best accuracy/fairness tradeoffs in the considered datasets. While our overall framework is decomposable and allows for inspection of the learned representations, it is still relying on two different blackbox components: a future development of our methodology could look into employing neural network models which are themselves explainable. The intuitiveness of the corrections might also benefit from added constraints to their absolute value, so to limit maximum corrections. Another direction for future research could explore extending our framework so that it could handle nonvectorial data such as text. While the models presented in this paper have no specific properties that limit their applicability to tabular data, their architecture has not been optimized for handling other types of data yet. One could explore, as an example, including convolutional layers when analyzing image data or a recurrent mechanism for text. In this field, some authors have focused on learning fair word vectors zhang2020hurtful so that generative language models do not display problematic associations such as “black” and “violence”. On the other hand, we are currently not aware of image datasets which lend themselves to group fairness tasks. In this regard, we plan to explore whether the techniques included in this paper might be generalized to interpretable transformations of images. Domain adaptation tasks could be the target of this line of research (see, e.g., AlignFlow alignflow). Lastly, in the light of recent developments in regulation at the EU level, we reason that our correction vector framework is able to open up the black box of fair DNNs. The legal impact of explicitly computing corrections is, however, not trivial: some individuals belonging to nonprotected groups might be explicitly penalized in their features, which might in turn affect antidiscrimination rights. It stands to reason, however, that noninterpretable fair models also compute penalties to complex nonlinear combinations of the input features. Our framework, on the other hand, lets human analysts peek inside the “black box” of representation learning and see for themselves what the model has learned in a humanreadable format. As such, it adds further possibilities for scrutiny of fair models beyond performance on standard metrics, adding to their transparency and trustworthiness.
7 Supplementary Material
7.1 An Introduction to the Real NVP Architecture
In the following we present a brief introduction to the real NVP architecture dinh2017realnvp with the purpose to make the present paper as selfcontained as possible.
Given a dimensional feature space, the real NVP architecture implements a function and achieves invertibility by employing socalled affine coupling layers, which leave part of their input invariant while tranforming the remaining degrees of freedom. To be more specific, for , , an affine coupling layer performs the transformation given by
(10) 
where is the Hadamard product. and would usually be implemented by neural network structures of arbitrary kind. This transformation can easily be inverted via the transformation
(11) 
Since an affine coupling layer only transforms part of its input, a real NVP usually consists of several such layers that transform different components of the input in tandem, allowing to transform all components.
Observing that the transformations (10) and (11) are differentiable, the function defined by a real NVP is in fact a diffeomorphism. This fact allows for the interpretation of
as a transformation between probability distributions. Given data distributed according to a probability density function (pdf)
on a feature space , the transformed data is distributed according to the pdf on the image of under , given by(12) 
with being the modulus of the Jacobian determinant of , which can easily be calculated as the exponential of the sum of all output neurons of all layers within the net.
7.2 Developing the FairNF Architecture
As described extensively in Section 3.2, pairing two real NVP models might not be enough to obtain fairness. While it is, in principle, possible to transform samples from one distribution to another, the transformation does not remove information about the sensitive data since it is a bijection. We recall here our two proposals to solve this issue: (i) Break the bijection by using a function which sets the first component of the latent feature vector to 0; (ii) Maximize the overlap between and by employing its value  before deletion  to predict the sensitive attribute via crossentropy loss. In this section we refer to the base real NVP model as FairNF; the model implementing (i) is referred to as FairNF+; the model implementing (ii) is referred to as FairNF+BCE.
To evaluate whether the transformed distributions of the three different FairNF models are mixing the two sensitive groups well enough, we propose the following procedure. We choose a sensitivity group whose value of the sensitive attribute we take to be . For each instance we determine the set of its nearest neighbors and count the number of these nearest neighbors belonging to the same sensitivity group as . Furthermore we calculate the average ratio
(13) 
In the case of a well mixed dataset this ratio should converge to the overall dataset ratio of when increasing . From this, we define a mixture metric by averaging over different values for up to a predefined threshold :
(14) 
For evaluating the proposed metric and quantify which of the three FairNF algorithms has the better mixing of the target data, we generate two sets of synthetic data by randomly sampling two features from different normal distributions. In the first case both sensitive attributes of the generated data have the same mean and the standard deviation. In the second case we choose different means for the two distributions, representing two separate groups with different values of a sensitive attribute (
and ). We consider the first synthetic data as well mixed and we will use this in the following as a fair baseline, while the second dataset is considered to be unfair and will be further transformed by the three different kind of FairNF models. We plot in Figure 8 over different NN values. It is clear that the value for FairNF+BCE is closer to the fair baseline, while the other two FairNF models are not closer to the unfair dataset. For higher the FairNF approach is lowering under the values of the FairNF+ model. This behavior shows that naively removing one component of the latent space does not result in fairer results. Nevertheless the approach of the FairNF+BCE model seems to be able to mix the sensitive attributes in the target space well.A representation of how this simple twoGaussians dataset is transformed after application of the three different FairNF models may be found in Figure 9. The axes display feature values. In Figure 9(a) the fair dataset is shown while Figure 9(b) shows the unfair data which will be transformed. In Figure 9(c), Figure 9(d) and Figure 9(e) the transformed data is shown. The first one shows the data transformed by the FairNF algorithm. The second figure was transformed using the FairNF+ while the third figures shows the result of the FairNF+BCE model. For both the FairNF+ and the FairNF+BCE, we observe a strong mixing of the two sensitivity groups, wheras the base FairNF still keeps the groups apart in the latent representation. After gathering this evidence, we decided to employ the FairNF+BCE model in all our experiments, which are reported in Section 4 in the main body of the paper.
7.3 Datasets
In the following we give an extended description of the four datasets we employed in our experimentation. We also clarify how they might be employed in classification and ranking settings. The COMPAS dataset was released as part of an investigative journalism effort in tackling automated discrimination machine_bias. We employed the original risk score in our ranking experiments; and indicator variable representing whether an individual committed a crime in the following two years was employed in our classification experiments. In terms of correction vectors analysis (see Section 4.7) we evaluated the feature “prior crimes” and how much it was changed by the interpretable models to make the representations fair. Moreover, the Adult dataset is used, where the ground truth represents whether an individual’s annual salary is over 50K$ per year or not adult. It is commonly used in fair classification, since it is biased against gender Louizos2016TheVF; zemel2013learning; cerrato2020constraining. For the correction vectors analysis, the feature “capital gain” was analyzed. The third dataset used in our experiments is the Bank Marketing Data Set banks, where the classification goal is whether a client will subscribe a term deposit. The dataset is biased against people under 25 or over 65 years. The “employment variation rate” was studied for the correction vectors analysis. The last dataset we used is the Law Students dataset, which contains information relating to 21,792 USbased, firstyear law students law_student. As done previously zehlike2018reducing, we subsampled 10% of the total samples while maintaining the distribution of gender and ethnicity. We used ethnicity as the sensitive attribute while sorting students based on their predicted academic performance. Here, we analysed the change in LSAT score during our correction vectors analysis.
7.4 Metrics
7.4.1 Ranking Metrics
The NDCG Metric.
The normalized discounted cumulative gain of top documents retrieved (or NDCG@) is a commonly used measure for performance in the field of learning to rank. Based on the cumulative gain of top documents (DCG@), the NDCG@ can be computed by dividing the DCG@ by the ideal (maximum) discounted cumulative gain of top documents retrieved (IDCG@):
where is the list of documents sorted by the model with respect to a single query and is the relevance label of document .
rND.
To the end of measuring fairness in our models, we employ the rND metric ke2017measuring. This metric is used to measure group fairness and is defined as follows:
(15) 
The goal of this metric is to measure the difference between the ratio of the protected group in the top documents and in the overall population. The maximum value of this metric is given by , which is also used as normalization factor. This value is computed by evaluating the metric with a dummy list, where the protected group is placed at the end of the list. This biased ordering represents the situation of “maximal discrimination”.
This metric also penalizes if protected individuals at the top of the list are overrepresented compared to their overall representation in the population.
Groupdependent Pairwise Accuracy.
Let be a set of protected groups such that every instance inside the dataset belongs to one of these groups. The groupdependent pairwise accuracy fair_pair_metric is then defined as the accuracy of a ranker on instances which are labeled more relevant belonging to group and instances labeled less relevant belonging to group . Since a fair ranker should not discriminate against protected groups, the difference should be close to zero. In the following, we call the Groupdependent Pairwise Accuracy GPA. We also note that this metric may be employed in classification experiments by considering a classifier’s accuracy when computing and .
7.4.2 Classification Metrics
Area Under Discrimination Curve.
We take the discrimination as a measure of the bias with respect to the sensitive feature in the classification zemel2013learning, which is given by:
(16) 
where denotes that the th example has a value of equal to 1. This measure can be seen as a statistical parity which measures the difference of the proportion of the two different groups for a positive classification. Similar to the AUC we can evaluate this measure for different classification thresholds and calculate the area under this curve. Using different thresholds, dependencies on the yDiscrim measure can be taken into account. We call this metric in the following AUDC (Area Under the Discrimination Curve). One issue with this metric is that it may hide high discrimination values on certain thresholds as they will be “averaged away”. We show plots analyzing the accuracy/discrimination of our models tradeoff at various thresholds in Figure 16. These were obtained by computing the discrimination and accuracy of our models at 20 different thresholds on the interval . Overall we find that our models find sensible fairness/accuracy tradeoffs for all the thresholds we considered.
Absolute Distance to Random Guess.
To evaluate the invariance of the representation with respect to the sensitive attribute, we report classifier accuracy as the absolute distance from a random guess (the majority class ratio in the dataset), which we call ADRG in the following.
7.5 Statistical Comparison of Interpretable vs. NonInterpretable Models
To the end of understanding the fairness impact of the architectural constraints we imposed to allow for explicit computation of fairness vectors (Section 3.1) we perform further experimentation. Each of the architectures that were constrained for direct computation of correction vectors were compared to their noninterpretable versions trained as described in the respective papers introducing them. We trained each interpretable/noninterpretable pair of architectures 100 times while searching for the best hyperparameters. Each dataset is split into 3 internal folds which are employed for hyperparameter selection purposes, while 15 external folds are kept separate until test time. We then analyze the average performance on the relevant fairness metrics (1GPA, 1rND for ranking; 1AUDC for classification) on the external folds. We correct the variance estimation as described by Nadeau and Bengio nadeau2003inference and perform a significance ttest. Here the number of runs averaged is the number of test (external) folds following the recommendations by Nadeau and Bengio, who found little benefit in higher values in terms of statistical power.
We report the pairwise comparisons between interpretable and noninterpretable models in Figure 10. Each comparison also reports the pvalue obtained by the application of the ttest. We found that the data does not support the presence of statistically significant differences between the fairness performance of our interpretable models and the noninterpretable ones. One model/dataset combination does take exception to this, however. This exception is represented by the LawRace dataset when learning AdvDR ranking models. In Figure 10(b) we observe a statistically significant advantage in GPA when employing the interpretable model, whereas this trend is reversed when considering rND (Figure 10(c)). We hypothesize here that our hyperparameter search, which seeks to find models with the best values of , has found an impressively fair interpretable model for 1GPA while sacrificing rND performance. The interpretable AdvDR does not seem to suffer much in terms of rND on other datasets, which suggests that it has no issue optimizing rND in general. We conclude that our framework for explicit computation of correction vectors warrants strong consideration in situations where transparency is needed to comply with existing laws and regulations.
7.6 Correction Vectors Analysis
7.6.1 Synthetic data
To provide a more visual example, we provide here pre and postcorrection visualizations for a simple twoGaussians dataset, which can be found in Figure 11. This dataset consists of samples from two separate Gaussian distributions which only differ in their mean. More specifically, the mean vectors are and , forming two point clouds which partially overlap as we opted for an unitvalued variance for both. We employ this dataset in training an interpretable AdvCls cerrato2020constraining; xie2017controllable, which performs explicit corrections (see Section 3.1), and a FairNF model (see Section 3.2). What we observe in the plots is the AdvCls model performing rather intuitive translations, which push the two Gaussian distributions closer in feature space after correction. The FairNF model instead opts to push both distributions into a linelike manifold. We reason that this is a side effect of our strategy to break the bijection feature of Normalizing Flows via discarding one of the features in the latent space . We do note that the examples are centered around after correction, meaning that the second Gaussian has been pushed onto the first. It is noteworthy, however, that the original distribution shapes were not kept equal, even if the variance appears to be similar.
7.6.2 Real world data
After having built a visual intuition about how explicit and implicit corrections might look and differ, we now move on to analyzing the corrections on the realworld datasets described in Section 4.2. Our analysis was performed as follows. For each dataset, we pick a feature for which we expect correction vectors to make individuals belonging to different groups more similar. We then compute the average correction for each group and compute the average difference in the chosen feature in absolute and relative terms. A figure with the average corrections for the best performing models over all datasets is given in Figure 12. A table with all average corrections for each model and dataset can be found in Table 13. What stands out from these results is that different models often disagree about both the sign and the intensity of the correction that should be applied. For instance, on the COMPAS dataset, we would expect a correction to reduce the difference in the average amount of previous crimes committed between white and black people. While our implicit computation methodology does just that, the VFAE constrained for explicit computation of correction vector opts to widen that gap. Nonetheless, it obtains an overall fair result (see Figures 6(d) and 6(b)). Similar extreme corrections, which are definitely not intuitive, were computed by the interpretable AdvCls and FairNF models on the Adult dataset. One noteworthy trend is that simpler architectures computed more intuitive corrections. To be more specific, both the AdvDR and DELTR models employ only a single neural layer after computing the corrected features. In both models this layer has a single neuron which outputs the ranking scores ( for the pairwise AdvDR and for the listwise strategy DELTR). This severely constrains the complexity of the transformation learned by the models after feature vector computation and is highly in contrast with the other models, which have no such restrictions. DELTR and AdvDR are able to clearly close the gap in average feature values in all considered datasets. The other models were also able to output reasonable corrections. For instance, our FairNF model opted to almost nullify the gap in prior crimes committed on the COMPAS dataset. However, these models appear to be less consistent in this regard – FairNF instead opted to amplify the average difference in employment variation rate on the Adult dataset. Interestingly, this model is the strongest tradeoff performer on that same dataset (Figure 6(b) and 6(d)). As previously discussed, we hypothesize that the absence of restrictions about how many layers may be included after the computation of the correction vectors may limit the intuitiveness of some of the corrections. This issue could be addressed in the future by further constraining the maximum computable value for correction vectors. These matters notwithstanding, the opportunity to peek inside the “black box” of fair representation learning might also guide how a practitioner would select models: a model which computes corrections that are deemed counterintuitive or excessive may be discarded during model selection. Therefore, our framework adds a new layer of opportunities in developing models that can be trustworthy and transparent on top of fair.
We report in Figure 13 a table displaying the average correction vectors for all datasets and models considered. These models were obtained via hyperparameter selection as described in Section 4. We take a single feature from each dataset for which we expect our models to close the gap between historically privileged and unpriviledged groups. We observe that simpler models, which include only a single layer after correction vector computation, display more intuitive corrections, whereas even very fair models might learn corrections that are not as easy to understand. We refer to Section 4.7 for an indepth discussion of these matters.
7.7 Fair Representations
In fair representation learning, the ultimate objective is removing information about the sensitive attribute in the obtained representation. We investigate this matter by training offtheshelf rankers and classifiers on the transformed data. The task is to classify the sensitive attribute, i.e. recovering information about it. We show results for this task in Figure 14 and Figure 15 of the supplementary material. The employed models are plain Linear Regression (LR) and Random Forests (RF). We evaluate the absolute difference in accuracy from the random guess (ADRG cerrato2020constraining; cerrato2020pairwise; Louizos2016TheVF). As explained above, this metric is defined as the absolute value of the difference between the accuracy of an external classifier and the accuracy of a classifier which always predicts the majority class. Here, the rationale is that a perfectly fair representation will force a classifier to always predict the majority class as there is no other way to obtain a higher accuracy value. Therefore, lower scores are better. We also computed the AUC of the employed classifiers. We observe only minimal differences between the algorithms we propose in this paper and the ones already present in the literature. One issue we observed with the VFAE algorithm is that it highly degrades external classifier performance to the point that the accuracy becomes lower than the random guess. In a binary classification setting, one could invert the decisions undertaken by the model and obtain a higher performance than random guessing: therefore, this model displays fairly high ADRG.
As an aside, gains in representation invariance with respect to the sensitive attribute are very small when considering AUC. To the best of our knowledge, this is the first time that AUC has been considered for this kind of evaluation, with previous literature focusing mostly on accuracybased measures such as ADRG (see e.g. Louizos2016TheVF; zemel2013learning; cerrato2020constraining). We conclude that employing AUC should be strongly considered in the future when evaluating fair representation learning algorithms.
7.8 AccuracyDiscrimination tradeoffs
We report in Figure 16 the accuracy/discrimination tradeoffs for our classification models. This analysis was performed by taking 20 different thresholds in the interval and computing accuracy and discrimination metrics. The highest discrimination value (lower is better, see Section 4.1.2 and 7.4.2) we find is .