Ethical Adversaries: Towards Mitigating Unfairness with Adversarial Machine Learning

05/14/2020 ∙ by Pieter Delobelle, et al. ∙ UNamur 32

Machine learning is being integrated into a growing number of critical systems with far-reaching impacts on society. Unexpected behaviour and unfair decision processes are coming under increasing scrutiny due to this widespread use and also due to theoretical considerations. Individuals, as well as organisations, notice, test, and criticize unfair results to hold model designers and deployers accountable. This requires transparency and the possibility to describe, measure and, ideally, prove the 'fairness' of a system. This involves concepts such as fairness, transparency and accountability that will hopefully make machine learning more amenable to criticism and improvement proposals towards the fulfilment of societal goals. We concentrate on fairness, taking into account that both the transparency of the neural networks and accountability of actors and systems will require further methods. We offer a new framework that assists in mitigating unfair representations in the dataset used for training. Our framework relies on adversaries to improve fairness. First, it evaluates a model for unfairness w.r.t. protected attributes and ensures that an adversary cannot guess such attributes for a given outcome, by optimizing the model's parameters for fairness while limiting utility losses. Second, the framework leverages evasion attacks from adversarial machine learning to perform adversarial retraining with new examples unseen by the model. These two steps are iteratively applied until a significant improvement in fairness is obtained. We evaluated our framework on well-studied datasets in the fairness literature-including COMPAS-where it can surpass other approaches concerning demographic parity, equality of opportunity and also the model's utility. We also illustrate our findings on the subtle difficulties when mitigating unfairness and highlight how our framework can help model designers.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


⚖️ Code for the paper "Ethical Adversaries: Towards Mitigating Unfairness with Adversarial Machine Learning".

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning eases the deployment of systems that tackles various tasks: spam filtering, image recognition, gesture recognition, etc. One of the most trendy applications is decision support. After collecting data on people and their context, these systems give recommendations on who should get a loan, predict who may commit subsequent offences, etc. However, this support can have detrimental consequences. Well-studied examples include the COMPAS system that predicts the recidivism of pre-trial inmates [angwinMachineBiasThere2016, chouldechovaFairPredictionDisparate2017] or accepting credit applications, or more recently the issues with Apple’s credit card that resulted in vastly lower spending limits for women. Such systems may amplify the prevalent situation by imposing more expensive loans to African-American people, who then fail to repay them more often [overdorfQuestioningAssumptionsFairness2018, ensign2017]. These “positive” feedback loops should be detected and mitigated.

Training a machine learning model can be costly, is sensitive to the data quality, and may result in a complex model. Hence, the decision process may fail to be transparent, which ushers in discrimination or unfair treatment for protected groups. But how to perform this assessment when decisions are often neither interpretable nor intuitive? Researchers have focused on providing quantitative assessments (e.g. demographic parity [dworkFairnessAwareness2012]

, equalized odds 

[hardtEqualityOpportunitySupervised2016], statistical parity [feldman2015certifying, zemel2013learning], disparate impact [chouldechovaFairPredictionDisparate2017, feldman2015certifying], Darlington criterion [darlington1971], threshold testing [pmlr-v84-pierson18a]) all covering a specific fairness aspect.

To tackle this problem, researchers assume that a protected attribute (e.g., race or gender) exists while it should not be predictable, despite existing dependencies between this attribute and others (ZIP code, …). Thus, only removing the protected attribute, sometimes referred to as fairness through unawareness, is known to be insufficient [pedreschi2008discrimination, calders2013unbiased].

Because of the far-reaching consequences—being encoded in legal obligations—of these machine learning systems, state-of-the-art methods employ more advanced approaches to mitigate unfairness issues with these models inspired by other machine learning domains, like domain adaptation. One technique is not enough to harness this complex problem. Here, we propose to use two kinds of adversarial machine learning techniques, which we motivate through the following scenarios.

1.1 Motivating Scenarios

Our goal is to develop a notion of ethical adversaries

based on adversarial machine learning techniques, initially designed to fool machine learning classifiers, to improve their fairness. Our scenarios take place in the context of an ethics assessment activity while designing a new machine learning-based system for a fictional company called “Fancy-Fair AI”.

The Feeder: Black-box External Attacks. Alice is a specialist in adversarial machine learning (advML) attacks. She is hired for a mission at Fancy-Fair AI to assess and improve the dataset to increase the performance of the trained machine learning model, later on. However, she is only given the dataset and not the inner details of the already trained black-box machine learning model. Therefore, she has to train a surrogate classifier on the dataset and will feed the machine learning model with new instances. This process, known as evasion attacks [biggio2013evasion], starts from an existing instance and create a new one by modifying feature values along the gradient in yet unseen zones of the feature space (i.e., where the prediction confidence is low) through successive displacements. Alice ensures the validity of modified features and tunes the attacks’ parameters to improve the system via retraining. Alice crafts examples that are also black-box for fairness evaluation as they are not tuned to optimise a particular metric. Yet, such examples can help the Reader to alleviate misrepresentations as they will provide new feature value combinations.

The Reader: White-box Adversarial Fairness. Bob is an ethics assessment officer at Fancy-Fair AI. He leans on a set of carefully selected fairness metrics assuming that fair decisions should not depend on some protected attributes (race, gender, etc.). Therefore Bob wants to ensure that an insider adversary, striving to predict the value of a protected attribute (e.g., gender) given an outcome (e.g., credit limit), fails to do so. This approach, called adversarial fairness

, has been applied for autoencoders 

[louizosVariationalFairAutoencoder2015, edwards2015, madrasLearningAdversariallyFair2018] and for both classification and regression networks [zhangMitigatingUnwantedBiases2018, raff2018, adel2019]. It relies on a gradient reversal [ganin2016domain]

to update the weights of the adversary so that the chance of predicting the protected attribute is no better than random. This relies on diminishing the dependencies between the protected attribute and the other attributes by backpropagating the gradients with gradient ascent.

Feed, Read and Fix: Grey-box Fairness. This last scenario, which illustrates the main contribution of this paper, reunites Alice and Bob approaches as depicted in Figure 1. This integrated architecture thus works both at the data level (by providing new instances) and at the model level (by preventing it from guessing protected attributes). Fancy-Fair AI monitors the effectiveness of the integrated solution by monitoring demographic parity and equal opportunity and the impact on the utility (accuracy, F) of the decision system.

Figure 1: Our ethical adversaries framework: the feeder (Alice) improves the target Y by creating new instances X, while the reader (Bob) prevents from guessing A. The modified prediction due to model adaptation is shown bottom-right.

1.2 Contributions and Organisation

In this paper, we propose a new framework implementing the grey-box fairness scenario. We make the following contributions:

  • The definition of the behaviour of our ethical adversaries in terms of evasion attacks and gradient reversal analyses;

  • Demonstration of an undesired side-effect of gradient-based fairness models;

  • An implementation of our framework in Python111Available at;

  • Evaluation on three datasets: COMPAS, German Credit and Adult; showing state-of-the-art level results in demographic parity and equal opportunity metrics while globally improving the model’s utility.

This paper is organised as follows: Section 2 discusses related work on adversarial machine learning techniques as well as on measuring and mitigating unfairness. Section 3 investigates a problem of gradient reversal methods. Section 4 presents our new framework, followed by its evaluation on the COMPAS, German Credit and Adult datasets in Section 5. Section 6 concludes and gives an outlook on future work.

2 Background and related work

2.1 Adversarial machine learning

Adversarial machine learning aims at finding or creating examples that are problematic for a machine learning model, e.g., [papernot2016, papernot2017, biggio2013evasion]. biggio2018

synthesised a decade of research in adversarial machine learning. These techniques follow the same process: probe an existing target machine learning based system to gain information about it, copy an existing example, apply an adversarial technique that will modify the example depending on the desired goal. Modified examples show an interesting behavior: they remain similar to the original ones while being misclassified by the trained model. Various models can be attacked including support vector machines (SVMs), linear models or even (deep) neural networks (NNs). While adversarial machine learning begins with a model of attackers’ possibilities, it also enables the design of defenses against attacks. In particular, adversarial retraining has been very popular with the emergence of generative adversarial nets (GANs) proposed by

goodfellowGenerativeAdversarialNets2014a. Other approaches to robustify neural networks exist and use existing examples: e.g., edwards2015 used an adversary to force an encoder-decoder network to learn domain-independent representations [edwards2015, madrasLearningAdversariallyFair2018, zhangMitigatingUnwantedBiases2018].

2.2 Fair Representations with Neural Networks

Several works [raff2018, adel2019, madrasLearningAdversariallyFair2018, ganin2016domain] aim at training models to obtain internal representations that are fair. The embeddings produced by these models cannot be used to predict the protected attribute . Such works integrate an adversary with a new goal: trying to predict the protected attribute (and not degrading the model’s performance anymore).

A new model is created but with two goals: (i) predicting the main attribute (which we will refer to as the utility of the model); (ii) not being able to predict the protected attribute . They can be formally defined using minimax [edwards2015]:


with an adversary and an encoder with parameters . We use this representation to predict both and via an adversarial network. ganin2016domain, raff2018, adel2019

all proposed to optimize a variant of the following loss function:


with the loss for predicting from , and the loss for the target prediction also from and a hyper-parameter.

zemel2013learning learn fair representations of the original input features . The idea is to remove existing dependencies between the representation and the protected attribute , making its prediction impossible for adversaries. This would make the practice of red-lining also impossible, as these dependencies can no longer be correlated with . We consider this goal as a good proxy for fairness and this approach has been further investigated [raff2018, adel2019, madrasLearningAdversariallyFair2018].

2.3 Fairness through a Gradient Reversal Layer (GRL)

ganin2016domain introduced a gradient reversal layer (GRL) originally for domain adaptation. Both raff2018 and adel2019 treated the protected attribute as a domain label. The gradient reversal strategy assumes that multiplying by a negative sign will increase the loss of the branch and yields a representation that is maximally invariant to changes in  [adel2019, raff2018]

. For a model with two target outputs and a hidden internal representation, Equation 

2 applies. In our framework, the Reader (see Figure 1) reuses this approach to mitigate the ability to predict .

2.4 Adversarial attacks on model inputs

Our framework uses a second kind of adversarial machine learning, known as evasion attacks [biggio2013evasion], to diversify the training set. The goal of the evasion attacks is to generate new examples that do not follow the same distribution as the original set. Generated examples combine different characteristics, initially under-represented. These examples can be added to the training set to perform adversarial retraining.

Evasion attacks are a gradient-based method and use a step size parameter to converge towards a local optimum. An attack: (i) chooses a starting example for which the classifier’s decision is known; (ii) computes the gradient directed towards the separating functions; (iii) applies this direction to the example’s position scaled by ; (iv) repeats until a stopping criterion is met (number of iterations or a plateau is reached). This algorithm has been implemented and made publicly available in the Python secML package 222 that we will use in our experiments.

demontis2019transfer showed the transferability potential of such attacks. In particular, from the attacker’s point of view, building the exact, same model is not necessary. Data distributions to train both models should be similar. Hence, one can approximate any complex or non-derivable ML models with simpler ones and still generate relevant examples to influence the original model while retraining. The Feeder of our framework aims at providing adversarial examples for retraining that will mitigate unfairness.

2.5 Discrimination-Aware Data Mining

In works on discrimination-aware data mining (DADM) and fairness in machine learning, modifications to the data, the learning algorithms, or the resulting patterns and models [DBLP:journals/tkde/HajianD13] have been developed and applied. pedreschi2008discrimination introduced an approach to tackle discrimination by extracting classification rules and ranking them based on a measure. DADM focuses on discovering discrimination as well as preventing discrimination, both direct or indirect discrimination, the latter is the reason why simply removing protected attributes is not effective (see Section 1). Our framework performs both steps in an integrated manner by generating new examples and tuning the model to prevent discrimination.

3 Why gradient reversal is not a silver bullet

As described in Section 2.3, GRL is a currently popular approach, also known as ‘adversarial fairness’. We also use this technique, and like the authors who used it to learn ‘fair(er) representations’ [adel2019, raff2018], we find that it can mitigate unfairness in classification/prediction tasks.

In the remainder of this section, we formulate and prove this problem and illustrate it in Figure 2.

The introduction of a gradient reversal layer by ganin2016domain targeted domain adaptation. adel2019, raff2018 continued on this by viewing the protected attribute as a domain label. However, ganin2016domain offered no guarantees as to how the domain was represented internally. In this section, we argue that the adversarial branch achieves its goal by learning specifically to predict the protected attribute, rather than obfuscating it.

The gradient reversal strategy assumes that multiplying by a negative sign will increase the loss of the branch that then yields a representation that is maximally invariant to changes in  [adel2019]. This is intuitive, but there is no guarantee that gradient descent with flipped gradients does guarantee this maximal invariance.

Lemma 1.

Gradient reversal equates to perform gradient ascent on the shared layers with respect to the protected attribute , whilst simultaneously performing gradient descent on the dedicated branch for the attribute .


Consider the final layer with two independent branches, governed by parameters and respectively, and the shared penultimate layer . The shared loss function for both is stated earlier in Equation 2. For the branch that predicts the protected attribute , the loss gives rise to the weight updates for the weights of the branch , following


The branch giving target label is updated similarly. The shared penultimate layer’s weights rely on the shared loss and are updated following


The weights are updated following gradient descent with respect to the loss , thus minimizing the loss for this branch. However, gradient reversal simultaneously performs gradient ascent with respect to the same loss on all weights of layers . The shared layers still perform gradient descent with regard to the loss for the target label .

We have shown that the shared penultimate layer does not perform gradient descent, but gradient ascent. This is in accordance with the implicit definition for maximal variance 

[adel2019, ganin2016domain, raff2018] following


This fits in the larger minimax problem from Equation 1 and results in a saddle point [ganin2016domain]. However, the end result is not guaranteed to be a maximal invariant representation. In the worst case, maximizing this loss can even result in the opposite optimum for the shared trunk with regard to . This means that the model is not necessarily maximally invariant on . We need to emphasize that this is a theoretical result, but it calls for caution when adversarial fairness is to be used to, for example, publish or re-use a supposedly ‘fair’ data representation. We illustrate this issue on the COMPAS dataset in Figure 2.

(a) Naive model
(b) Model trained with a GRL ()
(c) Model trained with our framework
Figure 2: T-SNE dimensionality reduction of the activations in the last hidden layer on the held-out COMPAS test set. Distinct colors are used for the reported race of individuals in the dataset: either African-American paper_blue or Caucasian paper_orange.

For each individual for the COMPAS test set, all three models derive a representation in the last hidden layer, on which we applied a t-SNE dimensionality reduction for a two-dimensional visualisation.

The model without fairness constraints ((a)) has slight separation with regard to the protected attribute, but it is clearly separable in the representation from the model trained with a GRL ((b)

). This is also shown by retraining a one-layer perceptron on these representation. The model that was originally trained to predict only recidivism could be used to classify the protected attribute race with

. And although the original GRL reported an , Theorem 1 tells us that this adversary cannot be trusted. Which is the case here, as an independent perceptron has . elazar-goldberg-2018-adversarial made an empirical observation on leakage of protected attributes specifically for text-based classifiers that can also be traced back to this.

Here, we demonstrated that the hidden representation obtained by gradient reversal, not only still contains information about the protected attribute, but contains a stronger signal. Our architecture that joins ‘adversarial fairness’ and ‘adversarial learning’ (see Section

1.1 and Fig. 1) leverages utility- and fairness-focused methods in a better way than the modification of the model alone. By injecting noise with the adversarial Feeder, our framework makes the protected attribute a useless predictor, as shown in (c). Our results, discussed in Section 5.3, confirm this expectation.

4 Ethical Adversaries Framework

In this section, we present how the two adversarial attacks interact in our framework. The first attacks the inputs of the model whereas the second tries to predict the protected attribute as part of the model. We join both adversaries in a single system to address issues discussed in Section 2.2 and Section 2.3, ultimately resulting in a fairer model.

Figure 1 shows how these two adversaries are incorporated. Our network follows the architecture with a GRL (discussed in Section 2.3 and used by the Reader on the right part of the figure). The external adversary (the Feeder on the left part of the figure) performs evasion attacks as discussed in Section 2.4

. We discuss both parts in this section, including the hyperparameters and complexity they introduce to our architecture.

4.1 Adversarial reader

We augment the original model by adding a second branch with reversed gradients that will predict the protected attribute . We follow the training setup from raff2018, discussed in Section 2.3. The model will thus be trained with the joint loss of the original prediction target and the protected attribute. During the backward pass, the signs of the gradients from the adversarial branch are flipped and scaled by a hyperparameter .

4.2 Adversarial feeder

As presented in Section 2.4, the feeder needs a starting point that is an approximation of the target model, i.e., a surrogate model (see Section 2).

The evasion attack runs as presented in Section 2.4, and newly generated examples can be included in the training set for adversarial retraining. Note that adversarial retraining may drastically increase convergence time to compute a separating function since included adversarial examples make the separation more difficult to find. Generally, defining the ideal size of batches for training remains an open issue [li2014efficient].

4.3 Complexity analysis

Our architecture consists of three elements: the model under attack, the Reader and the Feeder. The adversarial reader is trained in conjunction with the model under attack. The time complexity of the attacked model is in part dependent on the chosen model. For neural networks, this is architecture-dependent. After training the model with the adversarial reader, a surrogate is trained, in our case, an SVM with time complexity for with the number of data points and the number of features [chapelle2007training]. The time complexity for our entire system becomes and scales linearly with the number of adversarial attacks. While not the focus of this paper, there are ways to learn SVMs faster and integrating them is subject to future work.

5 Evaluation

We evaluate our model on three popular datasets: COMPAS [angwinMachineBiasThere2016], German Credit and, the Adult Census [kohavi1996scaling]. The COMPAS dataset was originally a sample of outcomes from the COMPAS system that predicted the risk of recidivism. This caused a debate about whether or not this score was disadvantaging African Americans [angwinMachineBiasThere2016, dieterichCOMPASRiskScales2016, chouldechovaFairPredictionDisparate2017, corbett-davies2018]. The dataset, therefore, includes the race of individuals.

In line with other research [adel2019, angwinMachineBiasThere2016, zafarParityPreferencebasedNotions2017], we will only use individuals from Caucasian or African-American descent. As there is much less data on other groups (e.g., only 31 instances for people of Asian descent), this poses issues during training and evaluation. This implies that there are minorities that are excluded from many studies; more datasets would be needed to study whether patterns of unfairness are similar and mitigation measures can be transferred, or whether these affect different demographics differently.

COMPAS is composed of 5,278 instances and represented by 12 features. The target variable is whether a person has recidivated within two years. The race is used as a protected attribute. The Adult dataset gathers 32,000 instances represented by 9 features. We use gender as a protected attribute and the binary target variable is income, whether someone earns more than 50,000 USD. German Credit is the smallest dataset, with only 1,000 instances and 20 features. There is a class imbalance, with 70% of all samples good credits and only 30% bad credits. The protected attribute is age, with a threshold at 25 years.

For reproducibility purposes, we have publicly released our code and provided users with a template that they can incorporate in their projects. It is compatible with all PyTorch models with only minor modifications, i.e., adding an adversarial branch and replacing the training loop. We recall that we have used the secML package

333 (v0.11) for running evasion attacks.

5.1 Training setup

The model under attack.

We start from a neural network of 3 hidden layers with 32 hidden units for COMPAS and German Credit and 128 for Adult, due to its larger encoded input. Each of the hidden units has a ReLU activation. This activation function is computationally efficient and mitigates the issue of vanishing gradients since the function never saturates, which makes it one of the most popular activation functions. For the output units, a softmax activation was used to get the classification and a linear activation for COMPAS. The network—as well as the adversarial reader—are trained with the Adam optimizer with

and an initial learning rate , which is adjusted by a factor of  when reaching a plateau.

The adversarial reader. The adversarial reader is part of the model under attack and therefore follows the same training regime. The joint loss follows Equation 2 by including the GRL. The individual losses for both and are binary cross-entropy loss, except for COMPAS. In that case, the risk score is predicted as a regression problem with the MSE loss and then thresholded at (low vs medium and high risk).

The adversarial feeder.

In our setting, we can use the same training set for both the feeder and reader since they are part of the same, unique architecture. We also approximate—relying on the earlier discussed transferability of attacks—the attacked model by an SVM with a radial basis function kernel. We set the hyperparameters

and with a grid search with a reduced number of values: respectively and . We performed 10-fold cross-validation.

5.2 Evaluating fairness

Since the architecture we proposed in Section 4 aims at mitigating unfairness, we will have to evaluate this aspect in our experiments. There exist several measures of fairness in the literature. In this subsection, we discuss some of the most popular ones for different aspects of fairness.

We define all measures via the predicted values of the classifier and the protected attribute . We identify the disadvantaged group with and the privileged group with . The similarities of predictions are described for . Since the focus of most fairness measures is on the disadvantaged group having fewer (desired) opportunities, is generally the desired outcome.

One set of measures expresses the requirement that the predicted values of the classifier conditioned on the protected attribute be equal [calders2010] or the difference to be within an acceptable range.

Definition 1.

Demographic parity (DP). DP is the equality or similarity of prediction outcomes as an absolute difference [dworkFairnessAwareness2012, raff2018]:

Definition 2.

Demographic parity ratio (DPR). DPR is the equality or similarity of prediction outcomes as a ratio:


Requiring or would require exactly equal predicted outcomes for both groups. This is unrealistic for most data, such that real-world usage of such measures is less restrictive. For instance, in a legal setting, the US Equal Employment Opportunity Commission (EEOC) uses the DP ratio with (“80% rule[feldman2015certifying]), stating that disparate impact caused by employment-related decisions or structures can only be ascertained if .

Demographic parity has received some criticisms, since (i) it can meaninglessly reduce the utility of the classifier and—more worrying— (ii) does not necessarily measure what many would define as fairness [dworkFairnessAwareness2012]. The first issue is due to possible correlations between the protected attribute and the true outcome . Since we expect equality of the classifier concerning the protected attribute, it cannot operate as a perfect classifier.

The second issue stems from ignoring both the true outcome and individual merits. For instance, consider a selection procedure with two subgroups with different values for the protected attribute . One subgroup can be composed of qualified individuals (i.e., with high chances for a positive true outcome ), but another subgroup can consist of random individuals. This still satisfies demographic parity, but these token individuals are not guaranteeing fairness since qualified individuals from the protected subgroup are still mistreated.

Addressing the criticisms of demographic parity, hardtEqualityOpportunitySupervised2016 presented two other metrics that extend the aforementioned ones. By including the true outcome , the authors show that this variable can serve as a justification for the predicted outcome. For example, in the case of COMPAS, this is the recidivism rate as measured by violent crimes in a two-year window. Conditioning by the true outcome is a justification that the authors consider to be a suitable interpretation of the task-specific similarity measure from dworkFairnessAwareness2012, which can otherwise be difficult to come up with. This is also very similar to disparate mistreatment [zafarFairnessDisparateTreatment2017, barocas-hardt-narayanan]

used as an evaluation metric by


Definition 3.

Equal opportunity (EO). EO requires an independence of and conditioned on the true outcome . Expressed as a difference, this yields:


“Equality of opportunity” is satisfied if , and larger values are indicative of unfairness in the model or data.

5.3 Results

Baseline without fairness constraints 0.8390.009 0.763 0.173 0.296 0.096
GRL 0.6120.012 0.518 0.059 1.931 0.061
NBF (NB) [calders2010] 0.773 0.000
NBF (EM) [calders2010] 0.801 0.001
Grad-Pred [raff2018] 0.754 0.000
FF [raff2018fair] 0.753 0.000
LFR [zemel2013learning] 0.702 0.001
Ours 0.8140.009 0.689 0.031 0.784 0.179
German Credit
Baseline without fairness constraints 0.705 0.063 0.624 0.018 0.929 0.198
GRL 0.710 0.063 0.415 0.000 0.000
Grad-Pred [raff2018] 0.675 0.001
FF [raff2018fair] 0.700 0.000
LFR [zemel2013learning] 0.591 0.004
Ours 0.7300.062 0.640 0.006 0.971 0.175
Baseline without fairness constraints 0.715 0.709 0.466 2.192 0.449
GRL 0.567 0.549 0.057 0.926 0.114
COMPAS 0.6550.029 0.654 0.289 1.829 0.000
Preference-based fairness [zafarParityPreferencebasedNotions2017] 0.675 0.380
Ours 0.794 0.793 0.026 0.840 0.008
Table 1: Results on the three datasets. An obelisk () show results reported by original papers. Results of classifiers without fairness constraints are reported as a baseline. Best results are in bold typeface. An asterisk () indicates a division by zero.

Table 1 presents our results on the three datasets. We compare them with (i) a naive baseline, i.e., the same architecture without any particular control on fairness aspects, (ii) a re-implementation of the GRL [raff2018, adel2019, ganin2016domain]

and (iii) the reported results from other works that incorporate fairness and cover a wide range of learning algorithms: Naive Bayes 


, random forests 

[raff2018fair], SVMs [zafarParityPreferencebasedNotions2017] and neural networks [raff2018, zemel2013learning]. The models’ utility was evaluated by binary classification accuracy and macro-averaged score; the latter highlights some issues when dealing with class imbalances, as is the case for German Credit. Fairness is evaluated with demographic parity, both as an absolute difference (DP) and as a ratio (DPR), and equal opportunity (EO).

adel2019 also report results on both COMPAS and Adult but use a different setup for the Adult dataset. For COMPAS, the reported results (as well as their unfair baseline) are significantly higher than in our experiments, which we could replicate only when classifying high-risk individuals. To make a meaningful comparison, we also include our replication of FAD [adel2019] as GRL.

The utility of our framework is the highest on the German Credit and COMPAS datasets, even surpassing the baseline model. On Adult, we achieve the highest utility of any model with fairness constraints. These results show that our model has only a very limited impact on the utility of the classifier, and it can even contribute to the training as is also visualised in Figure 3. Note that on German Credit, a majority classifier would achieve 70% accuracy already, hence the inclusion of the  score.

Regarding fairness evaluation, our framework gives the best results for COMPAS when considering DP. It also increases fairness as measured by DPR, which is the only one of the considered measures that indicates the “direction” of unfairness. More fairness is sometimes given by an increase towards parity (DPR=1) for the disadvantaged group: for the German Credit dataset, their chances of getting a loan increase. In COMPAS, the “bias against blacks” [angwinMachineBiasThere2016] decreases

from a probability of recidivism prediction that is more than twice as high as for white people. Here, the near-equality of 0.926 appears fairer than the “opposite unfairness” of our, further reduced, DPR value.

Figure 3: Fairness and utility measures after each attack iteration on COMPAS (Batch size of 1024,

, epochs=100, 50 adversarial points per iteration)

Figure 3 also highlights the effect of the adversarial fraction in the training dataset on COMPAS. When adversarial examples (equivalent to 25% of the training set size) are added to the training set, the utility is maximal. With higher fractions, the utility decreases and the development of the DP ratio fluctuates. This could stem from the minimax formulation, where a small fraction (i.e., 25%) helps optimize better for this saddle point, but higher fractions only add noise.

6 Summary, conclusions and future work

In this paper, we presented a novel architecture for integrating fairness constraints in machine learning models. Our architecture consists of two adversaries: (i) an adversarial reader that evaluates fairness constraints during model training and attempts to enforce them, and (ii) an adversarial feeder that performs iterative evasion attacks to discover previously uncovered regions in the input space. We evaluated our architecture on three well-studied datasets and showed that it can deliver high utility to models while satisfying fairness constraints. On COMPAS, we illustrated that our architecture yields a model that surpasses an unfair baseline regarding the utility (accuracy and score), whilst giving better fairness guarantees. We provide evidence that gradient reversal alone is not sufficient (it might even be detrimental) but that our combination of adversaries leads to intrinsically fairer models.

There is room for future work. First, we may optimize the runtime execution of the technique via faster learning of surrogate models. Second, we could use the target model directly instead of a surrogate classifier to support adversarial attacks and assess if transferability properties hold for fairness constraints. This requires heavyweight modification of the secML framework to allow multiple output values in neural networks. Third, while we do not generate invalid instances, one could define constraints involving multiple features: e.g., a 4-year-old child cannot have a Ph.D. Enforcing these domain-specific constraints during attack generation raises questions on the representation of the feature space and optimal convergence of the algorithms. Finally, we would like to generate the most dissimilar examples possible to ensure good coverage of the unseen feature space with a minimal number of attacks.


Pieter Delobelle was supported by the Research Foundation - Flanders under EOS No. 30992574 and received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme. Gilles Perrouin is an FNRS Research Associate. This research was partly supported by EOS Verilearn project grant no. O05518F-RG03. We also want to thank the secML developers from the PRALab (Pattern Recognition and Applications Laboratory, University of Cagliari, Sardegna, Italy) for having answered our numerous questions and helping us in using their newly developed library.