Measuring Fairness in Generative Models

07/16/2021 ∙ by Christopher T. H Teo, et al. ∙ 0

Deep generative models have made much progress in improving training stability and quality of generated data. Recently there has been increased interest in the fairness of deep-generated data. Fairness is important in many applications, e.g. law enforcement, as biases will affect efficacy. Central to fair data generation are the fairness metrics for the assessment and evaluation of different generative models. In this paper, we first review fairness metrics proposed in previous works and highlight potential weaknesses. We then discuss a performance benchmark framework along with the assessment of alternative metrics.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models have been well researched since the introduction of the Variational-Autoencoder (VAE)

(Kingma and Welling, 2014)

and Generative Adversarial Network (GAN)

(Goodfellow et al., 2014; Goodfellow, 2017). Focusing on GANs, much research has been targeted at improving model architecture and performance, e.g. StyleGAN (Karras et al., 2019) and BIGGAN (Brock et al., 2019), or finding solutions to stability issues, e.g. balancing discriminator-generator and mode collapse (Metz et al., 2017; Gulrajani et al., 2017; Salimans et al., 2016; Tran et al., 2018, 2019), improving data-efficiency (Tran et al., 2021), and detectability of deep-generated images (Chandrasegaran et al., 2021). However, to our knowledge, little research has gone into addressing biases. This is an important factor to consider as it limits the potential of generative models. For instance, in a GAN application on the facial composition of criminal profiles (Jalan et al., 2020), biases in gender may result in wrongful profiling.

In this work, we take a closer look at the fairness metrics for evaluating deep generative models. A deep generative model produces synthetic data . follows some model distribution imposed by the generator network. In many cases, can biased w.r.t. some targeted attribute .

is a one-hot vector representation of the targeted attribute, and

is the cardinality of the attribute, e.g. if corresponds to gender, or if is a compound attribute corresponding to gender and two different hair colours.

In many cases, is a latent attribute. Therefore, to evaluate the fairness of the generated data w.r.t. , one would need to determine the attribute value through an attribute classifier, (Grover et al., 2019; Choi et al., 2020; Tan et al., 2020). In particular, for an observed

, the attribute classifier

produces a soft output . Then, some discrepancy measure between

and the uniform probability vector

can be used to quantify the fairness of , where . As per previous works (Choi et al., 2020; Xu et al., 2018; Tan et al., 2020)

, we utilise a uniform distribution to determine if the model has the ability to achieve statistical parity

(Caton and Haas, 2020) - each equal probability is given to each outcome.For example, in (Choi et al., 2020), the L2 norm is used to measure the fairness discrepancy (FD) and is given by:


If , then is considered to be perfectly fair w.r.t. the targeted attribute . On the other hand, taking a second look at (1), one may realize that is highly dependent on . Specifically, error in is inevitable in practice. However, its effect on has not been investigated. Even with a perfectly fair , we may not obtain due to error in .

Related work: A few other limited works have also tried to quantify fairness in GAN. (Xu et al., 2018) measures statistical parity with an additional discriminator. (Tan et al., 2020) on the other hand, utilises a similar method as (Choi et al., 2020) with the addition of determining if the contextual attributes of the generated images are maintained w.r.t to the population distribution. As a result of Xu’s dependency on the model’s architecture, we focus on the works of (Choi et al., 2020; Tan et al., 2020) whose use of an auxiliary classifier is deemed advantageous due to its ease of deployment.

To understand how different choices of may affect the validity of under errors in , we conduct an empirical study in this work. The main idea of our study is as follows. Given a deep generative model (biased or fair w.r.t. ) and an attribute classifier , we examine different choices of discrepancy measure . We compute the value , and compare it with the ground-truth discrepancy value , where is an attribute classifier with perfect accuracy. The difference between and indicates the validity of selecting under attribute classification errors in .

Our contributions are:

  • Identification of the effects of inaccuracies in on fairness metrics.

  • A methodological framework to analyse and quantify the differences in fairness metrics for generative models, and a recommendation of a robust metric.

2 Problem Setup and Analysis

2.1 Experiment Setup

In our experiments we utilise ResNet-18 (He et al., 2015) as our , training it on different and configurations. We use Adam as our optimiser with . Suppose that for a dataset, the ground-truth distribution of some attributes is denoted by . In order to evaluate the effect of inaccuracies in C on a given discrepancy measure D in a controlled setting, first we need to simulate a generator that generates the data with assumed attribute distribution . To this end, we simply construct a dataset by sampling an available generated dataset (e.g. CelebA (Liu et al., 2015)). The sampling helps to not concern ourselves with the quality or diversity of . For example, if corresponds to gender (), and , given that has 100 samples for each of male and female, we randomly sample 90 males and 10 females. Then we apply the attribute classifier and calculate the approximated distribution which can be used to calculate .

Lastly, to mimic the perfect classifier , we forgo the sampling and directly calculate via , i.e. , and hence . thus indicates the deviation of the fairness score as a result of inaccuracies in .

2.2 Analysis of Previous Works

We emphasise a few key requirements that are necessary for effective FD measurement: 1) has to be well defined such that fair representation is achievable 2) has to have relatively high accuracy with low accuracy variability between the various 3) Appropriate metric has to be utilised such that the FD score is robust against noise such as inaccuracies in and selection of hyper-parameters, e.g. . We focus on criteria 3 in this paper and highlight a few properties that need to be addressed.

We first clarify a few notations and terms. The extreme points (EP) in the distribution include the ideal worst case bias scenario which we call absolutely bias (AB-EP), and also the ideal fair scenario (Fair-EP). Using the previous gender example (), there are two AB-EPs: , and , and the Fair-EP would be . Following this, the largest possible fairness discrepancy score achievable with is denoted by , which is the for each of AB-EPs. Utilising our previous example, , where . Note that . With that, we highlight a few weaknesses of the past works with reference to FD (1) and propose solutions that would mitigate these weaknesses.

a) Scale: The current metrics do not have a consistent upper bound, thereby making experimental comparison difficult. For example, as attribute size increases, the increases, i.e. , and , . Hence, we require a metric that has a fixed scale which does not vary with hyper-parameter changes.
Solution: As an easy fix, we propose to normalise each of our proposed metrics with their , such that FD score . As per the previous example, if given a where and , our normalised score is . Note that each metric with different would have a different normalisation factor , as per Annex A.1 Table 2. This normalisation would fix all metrics’ scales , allowing comparison between them. We thus utilised this normalisation technique for all subsequent FD scores, where all and can be assumed to have been normalised.

b) Inaccuracy in : The imperfect presents a challenging problem where inaccuracies are propagated to the FD scores, making the measure unreliable. We demonstrate this by comparing the normalised FD score, of our proposed metrics introduced later in section 3, on four different of various accuracies and configurations. These were trained on the CelebA data set whose attributes include 1) Gender 2) Youth 3) Male and black-hair 4) Young and smiling.
Pinching effect: In Fig 1 and 2 we observe that a decrease in the accuracy of results in a ”pinching effect” whereby at fair-EP, increases and at AB-EP, decreases. This causes deviation from the theoretical optimal score of 0 and 1. This is in line with our intuition that a less accurate tends towards a random classification, resulting in having a more uniform distribution.
Internal Variability: Next, there exists internal variability where different have different classification accuracies, e.g. Fig.1 the first classifier on the left has accuracies of 0.98 and 0.95 for attributes [1,0] and [0,1] respectively. These varying accuracies form a bias that propagates to the FD score. Thus of the same shape but on different supports may produce different scores, e.g. in Fig 1, each AB-EP measures different scores even though . This internal variability worsens as increases due to the increased difficulty to train (see Annex A). Hence, we require a metric that is robust to these inaccuracies and whose measurements are close to .

Figure 1: The effect of the accuracies of on different fairness metrics at AB-EP. X-axis sections: different as per (2.2b) Row1: AB-EPs with different , Row 2: Accuracies.

Figure 2: The effect of the accuracies of on fairness metrics at Fair-EP. X-axis sections: different as per (2.2b) Row1: AB-EPs with different , Row 2: Accuracies.

3 Proposed Metrics

Generative models cannot utilise the traditional measurement of fairness as classifiers, e.g. Equalised Odds, Equalised Opportunity

(Hardt et al., 2016) and Demographic Parity (Feldman et al., 2015)

, as a result of their different objectives. Instead, we can evaluate the fairness metric as a problem of measuring similarities between probability distribution

and , i.e. . Similarities refer to how the shape of the distributions resemble one another. In this section, we explore the various utilised to describe this similarity. (Charfi et al., 2020) differentiate these similarity measures into two categories, amorphic and morphic.

The following notations and denotes the outcome in the distribution, e.g. , then would be the probability for . Amorphic metrics are direct “point-to-point” measurements between distributions, e.g. Normalised Manhattan Distance (L1), Normalised Euclidean Distance (L2) and Wasserstein Distance (WD)/Earth-mover Distance.


On the other hand, morphic metrics describes the shape of the graph through quantifying relevant information in the distributions. For instance, Specificity Measurements (5) (Charfi et al., 2020), describes the variability in the distribution. (6) then describes the similarity measure, the difference between the distributions’ variability. However, for simplicity, we refer to this metric as specificity, since .


A hybrid measure also exists, that take the combination of morphic and amorphic metrics, e.g. information specificity (IS) (7)(Charfi et al., 2020) utilises a combination of the L1’s amorphic point-to-point measurement to determine the displacement between the two distribution as well as specificity that measure the difference in distributions’ variation. We set for (7), in our experiments.


We did not consider KL-divergence as per (Tan et al., 2020) as it resulted in computational problems when the support of and were different, e.g. at AB-EP. In addition, it’s non-symmetrical properties poses other problem beyond the scope of this paper.

3.1 Performance Benchmark

We utilise the following benchmarks to identify the ideal metric that is robust against inaccuracies in and hyper-parameter changes.

Mean extreme point error (MEPE) (8) and (9) measures the metrics’ deviations from the theoretical boundaries, 0 and 1 respectively. Note that is an average across multiple AB-EP. denotes a set of approximated AB-EP/fair-EP distribution classified by where is the number of fair-EP/AB-EP in the set. For example k=2 and 4, when calculating we utilise ={[1,0],[0,1]} and ={[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]} hence, . We mix the errors across different in order to find a metric independent of .


Extreme point variability () (10) helps indicate the overall stability of the metrics by measuring the variability of at fair-EP/AB-EP. indicates the average FD score.


Mean error measurement (MEM) (11) determines how far, on average, the approximated score utilising deviates from the theoretical ideal score. and denotes the set of approximated and ground-truth sample distributions respectively. This metric thus indicates the overall effects that external factors, e.g. classifier’s accuracy has on the .


Note that (8),(9),(10) are calculate across ={2,4,8,16} in our experiments.

4 Experiments

4.1 Experiment Setup

Next, we conduct the experimental evaluation of the various metrics. We again utilise CelebA to train the next set of for the remaining experiments. {gender, black hair, smiling and bangs} are incrementally trained into 4 different of increasing . Without loss of generality, the attributes are binary and scale exponentially, after permutations. In 4.2 and 4.3, we measure fairness metrics response to varying k={2,4,8,16}. We first analyse the fairness metric on their EPs followed by varying the distribution from AB-EP to fair-EP, generated with algorithm 1.

Figure 3: =8, Normalised Fairness score against a sweeping distribution from AB-EP to Fair-EP

Figure 4: =8 , Normalised Error score between the ideal score and the measured score ,, against a sweeping distribution from AB-EP to Fair-EP


while current_dist != uniform_dist do

       if current_dist[index]!=uniform_dist[index] then
       end if
end while
Algorithm 1 Distribution sweep from AB-EP to Fair-EP
Benchmark Point l2 L1 IS Specificity WD
Mean Extreme point Error
(Fair-EP) 0.0588 0.0851 0.0361 0.0276 0.0851
(AB-EP) 0.2106 0.1884 0.2273 0.2338 0.1884
Extreme Point Variability
(Fair-EP) 0.0349 0.0657 0.0207 0.0168 0.0657
(AB-EP) 0.1058 0.0698 0.1149 0.1212 0.0698
Mean Error Measurement
(2ATTR Sweep) 0.0184 0.0184 0.0184 0.0184 0.0184
(4ATTR Sweep) 0.0884 0.0623 0.0919 0.1034 0.0623
(8ATTR Sweep) 0.1386 0.0798 0.1535 0.1727 0.0798
(16ATTR Sweep) 0.1761 0.1155 0.19 0.2031 0.1155
Table 1: Summary of the metric performances according to our performance benchmark. A lower score is better for all benchmarks. Green indicates the best results and Red indicates worst

4.2 Extreme Points Analysis

Fair-EP vs AB-EP: Overall, we observe as per Table 1 that fair-EP scores are closer to the theoretical value, i.e. lower error, than the AB-EP. This occurs due to trend-line effects, which we will discuss in 4.3, as well as inter-variability discussed in 2.2, where a particular may have poorer accuracy and hence larger error in its AB-EP score, e.g. Fig 1,

. However, when measuring fair-EP any poor estimation by

is averaged across all permutations of due to the uniform distribution being sampled. This contributes to the lower error measurement in . Hence, this results in better estimation of the fair-EP with lower in comparison to as per Table 1. We call this the internal variability effect (see Annex D).

Figure 5: Mean normalised score for different at AB-EP. Theoretical optimum score=1, hence a larger score is better

Figure 6: Overall variability of normalised scores at AB-EP for = [2,4,8,16]

AB-EP Analysis: We utilise quantitative analysis to study the metrics behaviour of FD score and its variability at AB-EP, as seen in Fig 5 and 6 respectively. Specificity performs the worst with the highest , regardless of size. It also has the highest , making it the least consistent fairness metric at AB-EP. Conversely, WD and L1 performed the best with the lowest and . As used is relatively small WD and L1 are generally identical to one another when the scores are rounded to 4 D.P.

We identify the advantages of the L1 metric in comparison to L2, specificity and by extension IS. L1’s lower variability is attributed to the simplicity of the metrics linear scale, making it less susceptible to noise from volatility in . Whereas, specificity on the other hand measures variability in

and therefore is more sensitive to the noise prevalent at AB-EP. This influences the larger variance in its score as well as score deviations from

. This volatility is further amplified by the decrease in accuracy as k increases. To validate this, Annex A Fig 11 shows specificity having the greatest increase in error when k increased.

Fair-EP Analysis: Utilising Fig 7 and 8, the metric performance at Fair-EP are the converse of AB-EP. L1/WD performs the worst and specificity the best. Amorphic metrics’ have smaller . As such their fairness scores are largely scaled up during normalisation, resulting in L1, L2 and WD relatively larger fairness scores and hence larger . This scaling effect is amplified with increasing , Annex A.1 Table 2 for all the . Specificity on the other hand regardless of the size of , hence no scaling occurs, thereby attaining the lowest . Furthermore, as per the internal variability effect, average accuracy is generally higher at the fair-EP, making specificity more stable thereby attaining the lowest .

Figure 7: Mean normalised score for different at Fair-EP. Theoretical optimum score=0, hence a smaller score is better

Figure 8: Overall variability of normalised scores at Fair-EP for = [2,4,8,16]

4.3 Metric Trend line Analysis

The trend line analysis studies the behaviour of the metrics with increasing fairness distribution. Each increment in the fairness epoch tends towards a more uniform distribution. Fig

3 shows that the metrics’ general trends follow our expectation, where the highest score is seen at the beginning which decreases with each epoch. Overall L1/WD performed the best with the lowest MEM score and specificity the worst. A few interesting observations were seen in Fig.4 where 1)All metrics began with a large error that decreases with each fairness epoch and 2) L1’s momentary increment in fairness score at epoch 38 to 52.

General reduction in error: The initial large error seen in Fig 4 epoch 1, is the result of inaccuracy in . This is a similar observation as discussed in 4.2 AP-EP analysis. Additionally, we observed a decrease in trend-line gradient, i.e. gradient accuracy which creates deviation from its ideal trend, seen in Annex C Fig 14. The change in gradient then results in the gradual convergence towards the ideal scores, with each increment in fairness epoch, i.e. reduction in error. This same trend holds for =4 and 16, seen in Annex C, where the former experiences less deviation as a result of its more accurate and the converse for the latter. Furthermore, we note that when starting at different AB-EP the convergence rate are different, where the AB-EP whose has the lowest accuracy converges the slowest (see Annex D). Lastly, we notice that not all metrics converge at the same rate, we address this in the following.

L1 trend abnormality: The increment in L1’s fairness score is the result of the metric’s ability to aggressively correct its error. We attribute this to L1’s simplistic linear equation that is least susceptible to noise caused by inaccuracies in and hence the quickest to correct. This inference is supported by the trend that specificity, the metric most sensitive to poor accuracy in , having the largest error and slowest convergence. Followed by IS who is influenced by specificity and finally L2, whose quadratic calculation creates larger variations from the true measurement as seen previously in 4.2. The L1 aggressive correction makes it the most robust metric to poor accuracy in . These trends are similarly observed in and as per Annex C.

5 Future works, Conclusion and Metric Recommendation

Future works: our work defines fairness as a uniform distribution with the motivation of data augmentation where even representation of, e.g. hair colour, is ideal for balance learning . However, this definition may vary where fairness could mean ”to follow the population distribution”, e.g. the racial demographic of a country, which may not be uniform. Thus future work could explore this domain.

To conclude, we have presented the existing problems in the current FD score and introduced a variety of different morphic, amorphic and hybrid fairness metrics to help mitigate the problem. We introduced performance benchmarks to determine the ideal metric. Through experimentation, we observed, that L1/WD metric performed the best at AB-EP and specificity at Fair-EP as per Table 1. Furthermore, L1/WD demonstrates to be the most robust, having the lowest MEM. As such, there is no clear distinct optimal metric that favours all components of our benchmark. Thus, we recommend , hybrids metrics, IS to be used as the ideal middle ground. We also point out the unlikely-hood that the generated distribution will tend towards AB-EP. Hence, emphasis should be placed on Fair-EP, which IS does.


We would like to thank Milad Abdollahzadeh for his constructive comments on the manuscript.


  • A. Brock, J. Donahue, and K. Simonyan (2019) Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv:1809.11096 [cs, stat] (en). External Links: 1809.11096 Cited by: §1.
  • S. Caton and C. Haas (2020)

    Fairness in Machine Learning: A Survey

    arXiv:2010.04053 [cs, stat] (en). External Links: 2010.04053 Cited by: §1.
  • K. Chandrasegaran, N. Tran, and N. Cheung (2021) A closer look at fourier spectrum discrepancies for cnn-generated images detection. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 7200–7209. Cited by: §1.
  • A. Charfi, S. Ammar Bouhamed, E. Bosse, I. K. Kallel, W. Bouchaala, B. Solaiman, and N. Derbel (2020)

    Possibilistic Similarity Measures for Data Science and Machine Learning Applications

    (en). Note: Cited by: §3, §3, §3.
  • K. Choi, A. Grover, T. Singh, R. Shu, and S. Ermon (2020) Fair Generative Modeling via Weak Supervision. arXiv:1910.12008 [cs, stat]. External Links: 1910.12008 Cited by: §1, §1.
  • M. Feldman, S. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian (2015) Certifying and removing disparate impact. arXiv:1412.3756 [cs, stat]. External Links: 1412.3756 Cited by: §3.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. 2672–2680. External Links: Link Cited by: §1.
  • I. Goodfellow (2017) NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv:1701.00160 [cs] (en). External Links: 1701.00160 Cited by: §1.
  • A. Grover, J. Song, A. Agarwal, K. Tran, A. Kapoor, E. Horvitz, and S. Ermon (2019) Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting. arXiv:1906.09531 [cs, stat]. External Links: 1906.09531 Cited by: §1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville (2017) Improved Training of Wasserstein GANs. arXiv:1704.00028 [cs, stat] (en). External Links: 1704.00028 Cited by: §1.
  • M. Hardt, E. Price, and N. Srebro (2016)

    Equality of Opportunity in Supervised Learning

    arXiv:1610.02413 [cs]. External Links: 1610.02413 Cited by: §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (en). External Links: 1512.03385 Cited by: §2.1.
  • H. J. Jalan, G. Maurya, C. Corda, S. Dsouza, and D. Panchal (2020) Suspect face generation. In 2020 3rd International Conference on Communication System, Computing and IT Applications (CSCITA), Vol. , pp. 73–78. External Links: Document Cited by: §1.
  • T. Karras, S. Laine, and T. Aila (2019) A Style-Based Generator Architecture for Generative Adversarial Networks. arXiv:1812.04948 [cs, stat] (en). External Links: 1812.04948 Cited by: §1.
  • D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat]. External Links: 1312.6114 Cited by: §1.
  • Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep Learning Face Attributes in the Wild. arXiv:1411.7766 [cs]. External Links: 1411.7766 Cited by: §2.1.
  • L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein (2017) Unrolled Generative Adversarial Networks. arXiv:1611.02163 [cs, stat] (en). External Links: 1611.02163 Cited by: §1.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved Techniques for Training GANs. arXiv:1606.03498 [cs] (en). External Links: 1606.03498 Cited by: §1.
  • S. Tan, Y. Shen, and B. Zhou (2020) Improving the Fairness of Deep Generative Models without Retraining. arXiv:2012.04842 [cs]. External Links: 2012.04842 Cited by: §1, §1, §3.
  • N. Tran, T. Bui, and N. Cheung (2018) Dist-gan: an improved gan using distance constraints. In ECCV, Cited by: §1.
  • N. Tran, V. Tran, B. Nguyen, L. Yang, and N. (. Cheung (2019) Self-supervised gan: analysis and improvement with multi-class minimax game. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
  • N. Tran, V. Tran, N. Nguyen, T. Nguyen, and N. Cheung (2021) On data augmentation for gan training. IEEE Transactions on Image Processing 30 (), pp. 1882–1897. External Links: Document Cited by: §1.
  • D. Xu, S. Yuan, L. Zhang, and X. Wu (2018) FairGAN: Fairness-aware Generative Adversarial Networks. arXiv:1805.11202 [cs, stat] (en). External Links: 1805.11202 Cited by: §1, §1.

Appendix A Metric Study

The following graph shows the variability of the accuracy between AB-EP as the dimension of increases. As previously discussed, the variability begins to increase in addition to the reduction in accuracy as the increases. This indicates that would have an internal bias to certain attributes.

Figure 9: Accuracy Variability among different on different attributes

With this internal bias, we further in Fig 10, 11 we observe that the extreme point errors increase with . Additionally, we begin to see a clear distinction between the metrics. In Fair-EP, specificity has the least error regardless of , whereas WD begins diverging with increase error. However, the converse is seen for AB-EB.

Figure 10: Extreme points Fair Error
Figure 11: Extreme points AB Error

a.1 Normalisation factor

The Normalisation factors are in Table 2, where the values in each cell represented the theoretical ceiling of each metric. Hence, the normalisation factor is an important addition to ensure that the fairness scores thereby allowing the metrics to have a direct comparison to one another. Furthermore, the metric would have little meaning as the theoretical ceiling changes according to attribute size. This implies that the fairness score could be synthetically improved simply by increasing .

None-the-less we are aware that the the normalisation factor does create some problems. For example, in fair-EP we observe the raw metric to have the following scores , , , and . It is clear the L1 achieves a lower score that specificity in the raw score. However, as a result of the large difference in the normalisation factor upon normalisation . Now, L1 is greater than specificity as a result of its smaller normalisation factor.

On the other hand, when normalisation factor difference is small such as L2 and l1 we observe less significant effects. For example where the raw fairness scores are and Normalised scores . L1 remains larger than L2 regardless of L2’s smaller normalisation factor.

Attributes l2 l1 IS Specificity wd
2 0.353553391 0.5 0.75 1 0.5
4 0.216506351 0.375 0.6875 1 0.375
8 0.116926793 0.21875 0.609375 1 0.21875
16 0.060515365 0.1171875 0.55859375 1 0.1171875
Table 2: Normalisation factor

Appendix B Classifier’s attributes

Attributes dimension of Accuracy
Gender 2 0.98
Youth 2 0.81
Male,black hair 4 0.83
young, Smiling 4 0.72
Table 3: Set 1 Classifiers for accuracy analysis
Attributes dimension of Accuracy
Gender 2 0.98
Gender, black-hair 4 0.86
Gender,black hair, Smiling 8 0.78
Gender, black hair, Smiling, bangs 16 0.66
Table 4: Set 2 Classifiers for attribute increment analysis

Appendix C Trendline Sweep

Figure 12: Normalise Fairness score with sweeping distribution from AB-EP to Fair-EP at k=2
Figure 13: Error with sweeping distribution from AB-EP to Fair-EP at k=2
Figure 14: Ideal Normalise Fairness score with sweeping distribution from AB-EP to Fair-EP at k=4
Figure 15: Normalise Fairness score with sweeping distribution from AB-EP to Fair-EP at k=4
Figure 16: Error with sweeping distribution from AB-EP to Fair-EP at k=4
Figure 17: Normalise Fairness score with sweeping distribution from AB-EP to Fair-EP at k=16
Figure 18: Error with sweeping distribution from AB-EP to Fair-EP at k=16

Appendix D Internal variability

We observe that as mentioned in the trend line sweep that the metric generally trend downwards with increase fairness in the distribution. We ran the trend-line sweep starting from the respective AB-EP, e.g. {1,0,0,0} , {0,1,0,0}, {0,0,1,0} and {0,0,0,1} and measured the standard deviation. As per Figures 21,27, we observe that as the fairness epoch increase there is a decrease in the variability of the metric which we deem the internal-variability effect. This observation is in addition to the general decrease in the fairness score on all metrics. Fig 22,23,24,25 shows the individual metrics and their normalised scores at each fairness epoch.

Fig 28 further demonstrates the internal-variability effect by showing that the error converges when at a lower point when comparing the AB-EP with the highest accuracy 0.65 against 0.87. Furthermore, when looking at the l1 metric on the two extreme accuracies in Fig 28 and 29, we see the different impacts that the internal variability effect has on the various extreme points. in orange having the worst accuracy observe a steep improvement in its error. Whereas having the highest accuracy observe a slower gradual improvement.

Figure 19: , standard deviation starting from 4 different AB-EP to uniform distribution
Figure 20: , 4 extreme points trend line, L1 metric
Figure 21: , 4 extreme points trend line, L2 metric
Figure 22: , 4 extreme points trend line, Specificity metric
Figure 23: , 4 extreme points trend line, IS metric
Figure 24: , standard deviation starting from 8 different AB-EP to uniform distribution
Figure 25: , Max and Min accuracy attributes sweep from AB-EP to fair-EP
Figure 26: , Max and Min accuracy attributes sweep from AB-EP to fair-EP for L1
Figure 27: , Max and Min accuracy attributes sweep from AB-EP to fair-EP for L1