Fair Federated Learning for Heterogeneous Face Data

09/06/2021 ∙ by Samhita Kanaparthy, et al. ∙ IIIT Hyderabad 0

We consider the problem of achieving fair classification in Federated Learning (FL) under data heterogeneity. Most of the approaches proposed for fair classification require diverse data that represent the different demographic groups involved. In contrast, it is common for each client to own data that represents only a single demographic group. Hence the existing approaches cannot be adopted for fair classification models at the client level. To resolve this challenge, we propose several aggregation techniques. We empirically validate these techniques by comparing the resulting fairness metrics and accuracy on CelebA, UTK, and FairFace datasets.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Federated Learning (FL), popularized by Google [8, 17], is gaining momentum. FL distributes the training process of a machine learning (ML) task across individual clients such that each client trains a local ML model for the task on its private dataset. A central aggregator combines these local models, through heuristics, to derive a generalizable global model [26]

. This distribution of the training process has several advantages, including but not limited to: (i) minimizing data collection, (ii) reduction in overall training time and power consumption, and (iii) compliance for devices with lower computation capabilities. These advantages enable FL to facilitate various smartphone features such as automatic text completion, voice recognition, face recognition, etc. Recently, Apple Inc. has also improved its popular ‘Face ID’ feature through FL 

[21]. Motivated by these applications, this paper focuses on different classification tasks based on attractiveness, gender, race in an FL setting with face images as the input data.

It is natural to assume that FL will soon be the popular choice for automation in various sectors. Currently, ML models find use in job recruitment [3], recidivism prediction [18], recommender systems [24], among others. Unfortunately, these models suffer from biased predictions. E.g., the authors in [9] highlight that an ML model tasked with ranking job applications tends to rank less qualified males higher than more qualified females. These biased, or unfair, predictions are undesirable and may even be catastrophic in some applications. The unfairness of an ML model is often correlated to the inherent bias in the training data. This bias corresponds to the lack of data samples belonging to particular demographic groups such as gender, ethnicity, age, etc. We refer to such a group as the sensitive attribute. What is more, training the models for competitive accuracies often aggravates the unfairness.

To quantitatively assert an ML model as fair

, researchers introduce notions for group fairness. These include, Equalised Odds (EO) 

[6], Equality of opportunity (EOpp) [2], and Accuracy Parity (AP) [32]

. In the context of classifying ‘attractiveness’ (or gender) of a face image with gender (or race) as the sensitive attribute, EO states that the probability with which a model predicts an attractive face to not be attractive must be independent of gender. EOpp ensures that the probability of predicting an attractive face as unattractive is the same across genders. Lastly, AP states that the overall classification error must be equal across genders.

In the literature, several approaches exist which satisfy these notions for non-FL – centralized ML setting [15, 23, 25] and [30, 11, 13, 4, 19] for FL setting. These methods work towards reducing the fairness loss over iterations for a particular notion. A critical assumption for these approaches is that the training data must consist of samples for all the available sensitive attributes. If not, the training loss will not be bounded.

While the assumption is reasonable for classical ML models, the client’s data is often limited in an FL setting. It is highly likely that a client only has data samples belonging to a particular demographic group. E.g., a mobile application used in a particular geographical area will receive data representing that area. Such heterogeneous data may further amplify bias since it may consist of samples for a particular race (sensitive attribute). We believe that such data heterogeneity w.r.t. the sensitive attribute is a significant challenge when striving towards fairness-aware FL – as such, achieving fair classification in an FL setting with heterogeneous data forms the basis of this work.

Figure 1: Our Approach

Our Approach. We remark that the existing approaches cannot ensure fairness with heterogeneous data in an FL setting. This paper argues that fairness may be ensured through an appropriate aggregation heuristic for such an extreme case. More concretely, we propose four novel heuristics to control the fairness and accuracy trade-off. Our heuristics require that the aggregator has access to a small fraction of data, referred to as the validation set. This requirement is reasonable as typically the aggregator has limited starter data [5, 28]. With this, we propose the following heuristics. For these, we derive the accuracy values and fairness loss for a given fairness notion over the validation set.

  1. [leftmargin=*,noitemsep]

  2. FairBest. Aggregator sends the model which provides the least fairness loss for a given fairness notion.

  3. FairAvg. Aggregator sends the weighted average of the first models, sorted in increasing order based on the fairness loss.

  4. FairAccRatio. Aggregator sends the model with the highest ratio of accuracy with fairness loss.

  5. FairAccDiff. Aggregator sends the model with the highest value for the weighted difference between accuracy and fairness.

Figure 1 provides an overview of our approach. In summary, the following are our contributions.


  1. [leftmargin=*,noitemsep]

  2. To best of our knowledge, we are first to observe that adopting existing approaches for fairness-aware FL is not feasible under data heterogeneity. This is because the client specific fairness losses will become unbounded (Section 4).

  3. We argue that by altering the aggregation heuristics may ensure fairness while simultaneously maximizing the accuracy of the model in FL setting. We provide four novel heuristics for the same (Section 5).

  4. We empirically validate the performance of our proposed heuristics, in terms of accuracy and fairness notions, EO, EOpp and AP (Section 6). More concretely, we consider the following classification tasks:

    • [leftmargin=*,noitemsep]

    • CelebA [14]: We consider the task of predicting whether the input face image is attractive or not. We let gender be the sensitive attribute.

    • UTK [31] and FairFace [7]: For these datasets, we consider the task of predicting the gender of an input face image. Here, we choose ethnicity (race) as the sensitive attribute.

2 Related Work

Federated Learning. Federated Learning has gained much attention in recent times. Google first introduced it for smartphone applications like automatic word prediction [28]. Many works in ML started to apply FL to resolve the issues such as computational efficiency, collection of data, and data privacy [30, 4, 10, 33]. FL has a wide range of applications in various domains such as Telecommunications [27], Mobile applications [5], Automotives [34], IoT [22], etc. In general, for most of the FL settings, FedAvg [16] has become state-of-art. However, FedAvg does not tackle the statistical heterogeneity which is inherent in FL and is shown to diverge empirically. In [12], authors propose an FL optimization framework to handle statistical data heterogeneity. However, fairness guarantees in FL setting under data heterogeneity is still not explored.

Fairness in ML. Towards achieving fairness-aware learning, there are two primary directions to achieve fairness. The first type of work includes (i) Pre-processing the training data to remove sensitive information about the sensitive attribute or (ii) Post-processing the classifier to achieve fair prediction. For instance, the work in [7] studies the accuracy disparity of the prediction on a sensitive class by constructing a novel face dataset removing racial bias. The authors in [23]

apply generative adversarial neural networks to generate fair data from the original training data and uses the generated data to train the model. The second type of work includes incorporating fairness constraints into the classification model during optimization

[25, 20, 29]. The popular among these is Lagrangian Multiplier Method (LMM) [15, 25]. However, in this work, we observe that LMM cannot be adopted in FL settings with data heterogeneity. Towards this, we propose different aggregation heuristics to tackle fairness issues in FL under data heterogeneity.

3 Preliminaries

We consider binary classification problem, with the universal instance space which is non-sensitive information; output label space and sensitive attribute for some finite . The sensitive attribute represents demographic groups like ethnicity, gender or age. Each attribute has multiple finite categories. E.g., gender may comprise male and female, and ethnicity can have multiple categories like White, Black, Asian etc. We aim to learn a neural network based classifier (or model) , where represents the model parameters. The parameters are trained to learn an accurate and fair model. In a non-FL setting, there is a global training set sampled from and network is trained on this set for a well-defined loss till convergence. To avoid collecting and training over large number of samples, researchers propose FL for distributed data.

3.1 Federated Learning (FL) Setting

Unlike classical ML, where a centralized aggregator collects all the local data to train, in FL the data is distributed across multiple clients. The data with these clients is private and inaccessible by others including the aggregator. More formally, FL includes the following two major parties:

  • [leftmargin=*,noitemsep]

  • Set of clients , wherein each client owns a finite and limited number of samples . Let be the number of samples. Each client trains an individual model on its private data .

  • A unique aggregator which does not have access to s, but has access to . It is also standard for the aggregator to have a small starter data [5, 28]. We call this the validation set . Note that, may be too small to be used for training in a non-FL setting.

To ensure good performance on the universal data, FL proposes the following iterative process where aggregator-client interactions are repeated for multiple rounds. It involves the following three stages:

  1. [leftmargin=*,noitemsep]

  2. Initialization: Aggregator communicates the initial (often random) model parameters

    to the clients i.e., epoch


  3. Local training: Each client initialize their local models with , and train the model on . At the end of each epoch , the client obtains .

  4. Aggregation: All the clients communicate their locally updated model parameters to the aggregator, at every fixed number of epochs. The aggregator combines each model to obtain a global model using certain heuristics [26]. The most common heuristic applied is weighted average defined as follows,


    After aggregation, the model is broadcasted back to the clients. The clients initialize their local model with these parameters and train further. This back and forth process is repeated multiple times till convergence.

3.2 Fairness Notions

In this section, we formally define the fairness notions which are usually defined in terms of the False Positive Rate (FPR) and False Negative Rate (FNR) of a classifer,

Equality of Opportunity (EOpp) [2]: A classifier satisfies EOpp for a distribution over if,


Informally, the FNR is same across all categories of the sensitive attribute.

Equalized Odds (EO) [6]: A classifier satisfies EO for a distribution over if,


EO ensures the FPR and FNR is equal across all categories.

Accuracy Parity (AP) [32]: A classifier satisfies AP for a distribution over if,


AP ensures overall error rate i.e., FPR+FNR is equal across all categories.

In general ensuring equal error rates across categories is impossible [2]. Hence, we aim to minimize the difference in these rates while maintaining the highest possible accuracy. For this, we represent the violation in EO by classifier on dataset as defined as the maximum of disparity in FPR and FNR across categories, i.e., . Here, represents FPR only within category and determined over the entire data; likewise for and . The violation in EOpp and AP are straightforward, i.e., one can similarly derive and , respectively.

3.3 Lagrangian Multiplier Method

Lagrangian Multiplier Method (LMM) is one of the in-processing methods widely used to obtain fair and accurate classifier [15, 25, 20]

. The method is proposed for the non-FL setting. LMM proposes a loss function which combines both fairness and accuracy. It minimizes cross entropy loss

(Equation 6) and also the violation in fairness constraint . Formally, the loss structure for a classifier is given by,


Here, , is the Lagrangian multiplier and the is soft version of as defined in [20]. During training, the parameters of and is learnt.

4 Problem Framework: FL

The overall goal is to train a classifier that minimizes the violation of fairness while maximizing accuracy. Maximizing accuracy is equivalent to minimizing cross entropy loss formally given by,


The overall optimization is given by the following and is for a small which denotes the violation in fairness.

Informally, we maximize accuracy while minimizing the fairness loss for the total samples available i.e., union over all the datasets with each individual client. The distribution of data among the clients effects the performance of any approach used to solve the above optimization. We talk about the two extreme cases of when the data is well-balanced vs heterogeneous data.

4.1 Balanced Data (DB)

In most FL settings, it is common to assume that each client has data representing all the categories of the sensitive attribute. We distinguish this scenario by referring to it as DB. That is, a scenario in which all the clients have approximately equal number of samples for Whites, Blacks and Asian when race is the sensitive attribute.

Towards solving our overall optimization with DB, we can train each client’s model with LMM and then use weighted aggregation. With DB, it is possible to compute the loss given by Equation 5, where the second term requires the computation of . The loss function used to train model by client is given by,

The overall parameters are given after aggregation using Equation 1, which is weighted average. We refer to the final classifier obtain after multiple interaction with client and aggregator as FedAvg-. Note that this approach works only with DB.

4.2 Data Heterogeneity (DH)

We believe assumption of DB is not practical and in reality each client may only possess samples of a particular category (for instance, depending on its geographical location). We refer to this scenario as DH, where each client , owns samples of only a single category, i.e, , where may represent either White or Black or Asian race.

With DH, for the client , the cannot be computed since the error rates for the categories not present in the data will be . Hence the LMM loss cannot be computed at the client level. We propose the following pipeline, to overcome this issue,

  1. [noitemsep, leftmargin=*]

  2. Client Loss. Each client trains the model only for maximizing accuracy,

  3. Aggregation. Training only for accuracy compromises unfairness [1]. Hence, simple weighted average does not ensure reduction in fairness violations. Towards this we propose certain aggregation heuristics using the validation set owned by the aggregator. We assume that is too small to be used for training but it is balanced data.

We discuss the aggregation heuristics in the next section.

5 Heuristics for Fair FL

Since sophisticated techniques which require the computation of fairness loss will not work for FL with data heterogeneity at the client level, we look at constructing different aggregation heuristics for achieving fairness.

The aggregator has access to a small validation set . Firstly, the aggregator evaluates the accuracy and fairness of individual client models over . Then, based on the accuracy and fairness values, it uses certain heuristics to derive the global model. The different types of heuristics for the aggregation are defined next.

  1. [leftmargin=*,noitemsep]

  2. FairBest. In this, aggregator selects a specific model from the set of local models, which provide better fairness on . That is, the global aggregation parameter at an epoch , , is defined as,

    Here, is the fairness loss for client ’s model on at an epoch

  3. FairAvg. For this approach, aggregator sorts the local models in increasing order of the fairness loss and then takes the weighted average of the first local models. Let denote the set comprising the models. We have,

  4. FairAccRatio. Aggregator picks the model parameters from the local model which gives the best ratio of accuracy with fairness loss on . That is,

    Here, is the accuracy observed in client ’s model over at an epoch

  5. FairAccDiff. In this, aggregator picks the model parameters from the local model which gives the best weighted difference, for a weight , of the accuracy and the fairness loss on . That is,


6 Experiments

In this section, we empirically validate our proposed heuristics. Firstly, for an appropriate comparison, we define four relevant baselines. We then explain our implementation in terms of our network architecture and dataset details. Further, we describe our training approach for the FL settings, followed by our results. We conclude the section with a comprehensive discussion of the results obtained as compared to the baselines111We provide our complete code-base as supplementary material along with this submission..

6.1 Baselines

To validate our proposed heuristics, we evaluate their performance with the standard settings. The evaluation is in terms of accuracy and violation of fairness notions mentioned in Section 3.2. We create the following baselines.

  1. [leftmargin=*]

  2. NonFed. This is the classical ML setting. The central aggregator has access to all the training data. The aggregator trains the model for maximizing accuracy using the loss function . Note that the training is independent of any fairness constraint.

  3. FedAvg [16]. This corresponds to the state-of-the-art FL setting wherein the aggregator uses a weighted average to combine all local models. Critically, in this baseline, the training data is balanced, i.e., each client has data samples of all demographic groups (in a similar ratio). For the training, we again use the loss function . Finally, global model parameters are aggregated as (Eq. 1). Similar to NonFed, this baseline is also independent of any fairness constraint.

  4. FedAvg-. In contrast to FedAvg, in this baseline, we train the local models for both (maximizing) accuracy and (minimizing) fairness. To incorporate the fairness constraint, we adopt the state-of-the-art LMM approach. More formally, we use the loss function (Eq. 5), for training . Again, is used for global model aggregation.

  5. FedAvg-DH. This corresponds to our novel FL setting with data heterogeneity. That is, each client has data associated with a single demographic group. As sophisticated approaches which require the computation of fairness losses (e.g., LMM) will not work in this setting, we train the individual s for only accuracy using . We then aggregate them using .

6.2 Implementation and Setup Details

We now provide the datasets and architecture details.

Datasets. We conduct experiments on the following three datasets: CelebA [14], UTK [31], FairFace [7].

  • [leftmargin=*]

  • In CelebA, there are 180K train samples where we use face images of input size to predict ‘attraction’ attribute, while ‘gender’ is the sensitive attribute. Here, gender is either male or female, with the dataset containing 42% of male and 58% of female samples.

  • In UTK, there are 20K samples. Here, we use face images of input size to predict the attribute ‘gender’ while ‘ethnicity’ is the sensitive attribute. Ethnicity comprises the following five classes: namely White (42%), Black (19%), Asian (15%), Indian (17%), and others (7%).

  • In FairFace, there are 85k train samples where face images of input size are used to predict ‘gender’ while the attribute ‘race’ is sensitive. FairFace has seven different races: East Asian (42%), White (19%), Latino Hispanic (15%), Southeast Asian (17%), Black (7%), Indian (17%), and Middle Eastern (7%).


We use the PyTorch’s implementation of the standard ResNet-18 architecture for the base model

[15]. We use SGD optimization having a learning rate of 0.01 to train each dataset. The batch size used for CelebA and FairFace is 256, and 64 for UTK.

6.3 Training Details

For the FL setting, for each baseline and heuristics, we consider 50 clients. We randomly distribute the training data such that each client has an equal number of data samples. Each client’s local model is trained for 5-10 epochs on the private data before every aggregation. The global model aggregation is performed periodically till the training gets stable. The training details specific to each dataset follow next.

Figure 2: Training of different models in CelebA
Figure 3: Training of different models in UTK
Figure 4: Training of different models in FairFace

CelebA. For CelebA, we consider ‘gender’ as the sensitive attribute with ‘attractive’ as the label to be predicted. To ensure data heterogeneity, we randomly distribute the training data such that among the 50 clients, 21 have access only to ‘male’ data samples. In contrast, the remaining 29 have access only to ‘female’ data samples. This distribution among the clients aims to mimic the original 42%-58% gender distribution.

Each client also has 3K training samples. We run the training of each local model for 75 epochs, and the aggregation is performed 15 times. Figure 2 shows the validation loss which is saturated after 70 epochs for all the models.

UTK. For UTK, we use ‘ethnicity’ as the sensitive attribute, with ‘gender’ as the predicting label. We consider five different groups for ethnicity. We then distribute the data such that 42% of clients have access only to White, 18% to Black, 14% to Asian, 16% to Indian, and 10% to others. This configuration ensures data heterogeneity and also mimics the original distribution. Each client has 500 training samples. We train the local models for five epochs before every aggregation and perform aggregation 20 times. So, overall, each client trains its local model for 100 epochs. From Figure 3, the loss becomes stable around 80 epochs.

FairFace. In FairFace, similar to UTK, we consider ‘ethnicity’ as the sensitive attribute and ‘gender’ as the predicting label. We also have seven different groups for ethnicity. Likewise, we distribute the data among the clients such that each client has access to only a single group. The training data for each client comprises 7% of a single group, i.e., each client has 1K training samples.

We train the local models for 120 epochs, such that the aggregation is performed after every eight epochs. From Figure 4, we see that the loss becomes stable around 100 epochs.

6.3.1 Validation Loss over Epochs

Given that our overall optimization, as discussed in Section 4, is to ensure that the final aggregated model has the least classification loss and fairness violation. We observe the classification loss and fairness violation of the aggregated models across epochs. We observe that these reduce as training progresses in Figures 2, 3, and 4.

Baselines. For each of the baselines except NonFed, we plot their losses calculated on the validation set after every aggregation. The loss for FedAvg and FedAvg-DH is , for FedAvg- it is . Since the baseline NonFed is for non-federated setting, we simply plot the validation loss of the single model.

Note that, for FedAvg- includes both classification and fairness loss, the values are higher compared to rest as observed in Figures 2, 3, 4 (left) for CelebA, UTK and FairFace respectively. We also observe that NonFed and FedAvg has clear convergence as training proceeds and final loss obtained is significantly less, since data is balanced. For FedAvg-DH, with heterogeneous data, we observe the the validation loss is less in initial epochs it does not reduce much with training. These results comply with the final accuracies as given in Tables 1, 2 and 3.

Heuristics. For validating each of the heuristics, we plot the losses as calculated on the validation set . We show the results for loss . The results for fairness losses, , and will be provided in the supplement.

We observe that, the for all the heuristics reduces with training as shown in Figures 2, 3, and 4 (right) for CelebA, UTK and FairFace respectively. Upon considering only accuracy, we find that there is no dominant heuristic for all three datasets. For CelebA and FairFace FairAvg performs best while for UTK FairBest has best accuracy.

max width= CelebA Heuristic Accuracy NonFed 78.731% 18.178 23.406 41.584 FedAvg 79.852% 20.482 27.042 47.524 FedAvg- 68.255% 3.061 11.012 14.073 FedAvg-DH 72.939% 12.883 31.589 44.472 FairBest 64.843% 2.736 13.927 16.663 FairAvg 69.853% 6.540 19.176 25.716 FairAccRatio 70.959% 4.379 11.710 16.089 FairAccDiff 69.266% 7.901 12.224 20.125

Table 1: Accuracy and Fair Losses for CelebA.

max width= UTK Heuristic Accuracy NonFed 87.900% 13.265 18.645 31.910 FedAvg 86.800% 14.948 25.164 40.112 FedAvg- 84.600% 6.718 17.453 24.171 FedAvg-DH 85.300% 9.851 23.055 32.906 FairBest 71.600% 0.255 2.930 3.185 FairAvg 80.600% 4.976 4.976 9.847 FairAccRatio 80.800% 0.364 4.828 5.192 FairAccDiff 79.600% 7.801 7.801 14.053

Table 2: Accuracy and Fair Losses for UTK.

max width= FairFace Heuristic Accuracy NonFed 92.974% 3.265 23.326 26.591 FedAvg 75.898% 16.323 22.651 38.974 FedAvg- 74.168% 12.941 12.941 21.571 FedAvg-DH 74.250% 10.505 17.583 28.088 FairBest 70.649% 16.356 16.356 32.036 FairAvg 70.503% 8.991 15.976 24.967 FairAccRatio 71.674% 14.197 14.197 27.14 FairAccDiff 65.538% 17.936 17.936 27.394

Table 3: Accuracy and Fair Losses for FairFace.

6.4 Results and High-level Trends

From Tables 1, 2 and 3, we observe the following trends.

  • [leftmargin=*,noitemsep]

  • The baselines NonFed, FedAvg guarantee the best accuracy but, in turn, suffer from greater fairness loss.

  • For all the three datasets, the baseline FedAvg- provides better fairness guarantees than other baseline models.

  • Our novel heuristic, FairBest, guarantees good fairness guarantees but at the cost of accuracy.

  • The heuristics FairAvg, FairAccDiff

    show a much more desirable trade-off between accuracy and fairness. One can further optimize the hyperparameters

    and to achieve a desirable trade-off between fairness and accuracy.

  • We observe that FairAccRatio ensures better or at-worst similar fairness than FedAvg-, which was trained on balanced data for fairness using the state-of-the-art LMM approach.

  • More significantly, FairAccRatio also has accuracies comparable to FedAvg-DH, which is the state-of-the-art approach for FL but trained on heterogeneous data for an appropriate comparison.

6.5 Fairness Improvements

In the last subsection, we enlisted the high-level trends observed for the baselines and our heuristics. We now discuss the results in detail. We derive the numbers from Tables 1, 2 and 3.

CelebA. Compared to baseline models FedAvg and FedAvg-DH, our novel heuristics FairBest, FairAvg, FairAccRatio and FairAccDiff, quantitatively improve the fairness guarantees. Tables 4-5 give the improvements observed in different fairness notions (i.e., Fairness loss in Baseline model - Fairness loss in the respective heuristic).

UTK. With UTK, our heuristics provide significantly better fairness even when compared to FedAvg- which is trained for both accuracy and fairness using LMM. Tables 6-8 give the improvements observed in fairness notions compared to baseline models FedAvg-, FedAvg and FedAvg-DH.

FairFace. Similarly for FairFace, in comparison to Baseline models FedAvg and FedAvg-DH, Tables 9-10 give the improvements observed in different fairness notions.

max width= CelebA Heuristic Improvement Improvement Improvement in in in FairBest 17.746 13.115 30.861 FairAvg 13.942 7.866 21.808 FairAccRatio 16.103 15.332 31.435 FairAccDiff 12.581 14.818 27.399

Table 4: Fairness Improvements in comparison to FedAvg for CelebA.

max width= CelebA Heuristic Improvement Improvement Improvement in in in FairBest 10.1594 17.662 27.809 FairAvg 6.343 12.413 18.756 FairAccRatio 8.504 19.879 28.383 FairAccDiff 4.982 19.365 24.347

Table 5: Fairness Improvements in comparison to FedAvg-DH for CelebA.

max width= UTK Heuristic Improvement Improvement Improvement in in in FairBest 6.463 14.523 20.986 FairAvg 1.742 12.477 14.324 FairAccRatio 6.354 12.625 18.979 FairAccDiff -1.083 9.652 10.118

Table 6: Fairness Improvements in comparison to FedAvg- for UTK.

max width= UTK Heuristic Improvement Improvement Improvement in in in FairBest 14.693 22.234 36.927 FairAvg 9.972 20.188 30.265 FairAccRatio 14.584 20.336 34.92 FairAccDiff 7.147 17.363 26.059

Table 7: Fairness Improvements in comparison to FedAvg for UTK.

max width= UTK Heuristic Improvement Improvement Improvement in in in FairBest 9.596 20.125 29.721 FairAvg 4.875 18.079 23.059 FairAccRatio 9.487 18.227 27.714 FairAccDiff 2.05 15.254 18.853

Table 8: Fairness Improvements in comparison to FedAvg-DH for UTK.

max width= FairFace Heuristic Improvement Improvement Improvement in in in FairBest -0.033 6.295 6.938 FairAvg 7.332 6.675 14.007 FairAccRatio 2.126 8.454 11.834 FairAccDiff -1.613 4.715 11.58

Table 9: Fairness Improvements in comparison to FedAvg for FairFace.

max width= FairFace Heuristic Improvement Improvement Improvement in in in FairBest -5.851 1.227 -3.948 FairAvg 1.514 1.607 3.121 FairAccRatio -3.692 3.386 0.948 FairAccDiff -7.431 -0.353 0.694

Table 10: Fairness Improvements in comparison to FedAvg-DH for FairFace.

Remark. From the above, one can observe that our proposed aggregation heuristics indeed remarkably improve the fairness of models in FL with data heterogeneity, when compared to FedAvg, FedAvg-DH. This highlights the significance of our novel heuristics. The comparable improvement, in percentage, over state-of-the-art approaches such as LMM, provides further validation.

Figure 5: Accuracies and Fairness Loss

6.6 Results: Discussion

In this subsection, with Figure 5, we provide an overview of the performances of our novel heuristics, simultaneously over all the three datasets. We also provide a detailed comparison of the heuristics across datasets.

Comparision: FairBest and FairAvg. While FairBest achieves good fairness consistently across datasets, it suffers from poor accuracy when compared to FedAvg-DH. In constrast, FairAvg (with ) provides accuracies which are similar to FedAvg-DH, but for a slightly less improvement in fairness guarantee. One may further choose an appropriate for a desirabel trade-off between accuracy and fairness.

Comparision: FairAvg and FairAvgDiff. Similar to FairAvg, FairAccDiff (with ) also achieves decent accuracies but for a marginal improvement in fairness guarantees. However, tweaking the parameter may further allow for a better trade-off of fairness with the accuracy.

Comparision: FairAvg and FairAvgRatio. From Figure 5, we observe that FairAvg (with ) and FairAvgRatio have similar accuracies for all three datasets. What is more, their accuracies are comparable to the state-of-the-art model FedAvg-DH in FL under DH. We also know that FedAvg- is the baseline model for fairness for standard FL setting with balanced data. However, as stated above, FairAccRatio outperforms FedAvg- guaranteeing the best fairness, on the same dataset. Thus, we believe that our heuristics are significant. A user may choose an appropriate heuristic for its desired accuracy and fairness trade-off.

7 Conclusion

In this paper, we focussed on the fair classification problem in Federated Learning under data heterogeneity. Firstly, we observed that existing approaches for fairness in FL are not feasible under data heterogeneity. Towards this, we proposed alternative aggregation heuristics that ensure fairness while simultaneously maximizing the model’s accuracy. We proposed four aggregation heuristics based on the fairness and accuracy assured by the local client models. Further, we have shown that these heuristics perform better than the standard baseline methods by empirically evaluating over visions datasets, CelebA, UTK, FairFace.


  • [1] A. Chouldechova (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 2, pp. 153–163. Cited by: item 2.
  • [2] A. Chouldechova (2017) Fair prediction with disparate impact: a study of bias in recidivism prediction instruments. Big data 5 (2), pp. 153–163. Cited by: §1, §3.2, §3.2.
  • [3] S. C. Geyik, Q. Guo, B. Hu, C. Ozcaglar, K. Thakkar, X. Wu, and K. Kenthapadi (2018) Talent search and recommendation systems at linkedin: practical challenges and lessons learned. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 1353–1354. Cited by: §1.
  • [4] M. Habib ur Rehman, A. Mukhtar Dirir, K. Salah, and D. Svetinovic (2020) FairFed: cross-device fair federated learning. In

    2020 IEEE Applied Imagery Pattern Recognition Workshop (AIPR)

    pp. 1–7. External Links: Document Cited by: §1, §2.
  • [5] A. Hard, C. M. Kiddon, D. Ramage, F. Beaufays, H. Eichner, K. Rao, R. Mathews, and S. Augenstein (2018) Federated learning for mobile keyboard prediction. External Links: Link Cited by: §1, §2, 2nd item.
  • [6] M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    Advances in neural information processing systems 29, pp. 3315–3323. Cited by: §1, §3.2.
  • [7] K. Karkkainen and J. Joo (2021) FairFace: face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1548–1558. Cited by: 2nd item, §2, §6.2.
  • [8] J. Konečný, H. B. McMahan, D. Ramage, and P. Richtárik (2016) Federated optimization: distributed machine learning for on-device intelligence. External Links: 1610.02527 Cited by: §1.
  • [9] P. Lahoti, G. Weikum, and K. Gummadi (2019) IFair: learning individually fair data representations for algorithmic decision making. 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1334–1345. Cited by: §1.
  • [10] L. Li, Y. Fan, M. Tse, and K. Lin (2020) A review of applications in federated learning. Computers & Industrial Engineering 149, pp. 106854. External Links: ISSN 0360-8352 Cited by: §2.
  • [11] T. Li, S. Hu, A. Beirami, and V. Smith (2021) Ditto: fair and robust federated learning through personalization. In International Conference on Machine Learning, pp. 6357–6368. Cited by: §1.
  • [12] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith FEDERATED optimization in heterogeneous networks. Cited by: §2.
  • [13] T. Li, M. Sanjabi, A. Beirami, and V. Smith (2020) Fair resource allocation in federated learning. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • [14] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pp. 3730–3738. Cited by: 1st item, §6.2.
  • [15] V. S. Lokhande, A. K. Akash, S. N. Ravi, and V. Singh (2020) Fairalm: augmented lagrangian method for training fair models with little regret. In European Conference on Computer Vision, pp. 365–381. Cited by: §1, §2, §3.3, §6.2.
  • [16] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. Cited by: §2, item 2.
  • [17] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. Cited by: §1.
  • [18] H. Mehta, S. Shah, N. Patel, and P. Kanani (2020-06) Classification of criminal recidivism using machine learning techniques. pp. 5110–5122. Cited by: §1.
  • [19] M. Padala, S. Damle, and S. Gujar (2021) Federated learning meets fairness and differential privacy. arXiv e-prints, pp. arXiv–2108. Cited by: §1.
  • [20] M. Padala and S. Gujar (2020) FNNC: achieving fairness through neural networks. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence,IJCAI-20, International Joint Conferences on Artificial Intelligence Organization, Cited by: §2, §3.3, §3.3.
  • [21] M. Paulik, M. Seigel, H. Mason, D. Telaar, J. Kluivers, R. van Dalen, C. W. Lau, L. Carlson, F. Granqvist, C. Vandevelde, et al. (2021) Federated evaluation and tuning for on-device personalization: system design & applications. arXiv preprint arXiv:2102.08503. Cited by: §1.
  • [22] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah (2018) Federated learning for ultra-reliable low-latency v2v communications. In 2018 IEEE Global Communications Conference (GLOBECOM), pp. 1–7. Cited by: §2.
  • [23] P. Sattigeri, S. C. Hoffman, V. Chenthamarakshan, and K. R. Varshney (2019)

    Fairness gan: generating datasets with fairness properties using a generative adversarial network

    IBM Journal of Research and Development 63 (4/5), pp. 3–1. Cited by: §1, §2.
  • [24] B. Smith and G. Linden (2017) Two decades of recommender systems at amazon.com. IEEE Internet Computing 21 (3), pp. 12–18. External Links: Document Cited by: §1.
  • [25] C. Tran, F. Fioretto, and P. Van Hentenryck (2020) Differentially private and fair deep learning: a lagrangian dual approach. arXiv preprint arXiv:2009.12562. Cited by: §1, §2, §3.3.
  • [26] O. Wahab, A. Mourad, H. Otrok, and T. Taleb (2021-02) Federated machine learning: survey, multi-level classification, desirable criteria and future directions in communication and networking systems. IEEE Communications Surveys & Tutorials 23, pp. . External Links: Document Cited by: §1, item 3.
  • [27] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, and M. Chen (2019) In-edge ai: intelligentizing mobile edge computing, caching and communication by federated learning. IEEE Network 33 (5), pp. 156–165. Cited by: §2.
  • [28] T. Yang, G. Andrew, H. Eichner, H. Sun, W. Li, N. Kong, D. Ramage, and F. Beaufays (2018) Applied federated learning: improving google keyboard query suggestions. arXiv preprint arXiv:1812.02903. Cited by: §1, §2, 2nd item.
  • [29] M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi (2017) Fairness constraints: mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962–970. Cited by: §2.
  • [30] D. Y. Zhang, Z. Kou, and D. Wang (2020) FairFL: a fair federated learning approach to reducing demographic bias in privacy-sensitive classification models. In 2020 IEEE International Conference on Big Data (Big Data), pp. 1051–1060. External Links: Document Cited by: §1, §2.
  • [31] Z. Zhang, Y. Song, and H. Qi (2017)

    Age progression/regression by conditional adversarial autoencoder

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5810–5818. Cited by: 2nd item, §6.2.
  • [32] H. Zhao and G. Gordon (2019) Inherent tradeoffs in learning fair representations. Advances in neural information processing systems 32, pp. 15675–15685. Cited by: §1, §3.2.
  • [33] Z. Zheng, Y. Zhou, Y. Sun, Z. Wang, B. Liu, and K. Li (2021) Applications of federated learning in smart cities: recent advances, taxonomy, and open challenges. Connection Science, pp. 1–28. Cited by: §2.
  • [34] W. Zhou, Y. Li, S. Chen, and B. Ding (2018) Real-time data processing architecture for multi-robots based on differential federated learning. In 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 462–471. Cited by: §2.

Appendix A Training Models: Validation Loss and Fairness Violations

a.1 Fairness over Epochs

Baselines. For each baseline, we plot validation loss, , , and calculated on the validation set after every aggregation.

Note that only FedAvg- is trained using , which minimizes fairness loss. Thus, the fairness loss values are lower compared to the rest of the baseline models, as observed in Figure 6 (rows 2, 3, & 4). We also observe that the fairness loss values in FedAvg- show a clear convergence as training proceeds, and the final loss obtained is significantly less. Such a trend is not apparent in other models.

Heuristics. We plot the validation and fairness losses as calculated on the validation set to validate each of the proposed heuristics. We show the results for the losses , , and (rows 2, 3, & 4).

From Figure 7, we observe that the fairness loss values for all the heuristics reduce with training. And the reduction in fairness loss is gradual and smooth for FairBest and FairAccRatio. For CelebA and UTK, FairBest performs best, while for UTK, FairAvg assures better fairness. For all three datasets, we can see that FairAccRatio has better accuracy and fairness trade-off.

Figure 6: Training of Baseline Models across Datasets: Accuracies and Fairness Loss
Figure 7: Training of Hueristic Models across Datasets: Accuracies and Fairness Loss