1 Introduction
Machine learning models are being increasingly adopted to make decisions in a range of domains, such as finance, insurance, medical diagnosis, recruitment, and many more [10.1145/3376898]. Therefore, we are often confronted with the need – sometimes imposed by regulatory bodies – to ensure that such machine learning models do not lead to decisions that discriminate individuals from a certain demographic group.
The development of machine learning models that are fair across different (demographic) groups has been well studied in traditional learning setups where there is a single entity responsible for learning a model based on a local dataset holding data from individuals of the various groups. However, there are settings where the data representing different demographic groups is spread across multiple entities rather than concentrated on a single entity/server. For example, consider a scenario where various hospitals wish to learn a diagnostic machine learning model that is fair (or performs reasonably well) across different demographic groups but each hospital may only contain training data from certain groups because – in view of its geolocation – it serves predominantly individuals of a given demographic [cui2021addressing]. This new setup along with the conventional centralized one are depicted in Figure 1.
These emerging scenarios however bring about various challenges. The first challenge relates to the fact that each individual entity may not be able to learn locally by itself a fair machine learning model because it may not hold (or hold little) data from certain demographic groups. The second challenge relates to that fact that each individual entity may also not be able to directly share their own data with other entities due to legal or regulatory challenges such as GDPR [EUdataregulations2018]. Therefore, the conventional machine learning fairness ansatz – relying on the fact that the learner has access to the overall data – does not generalize from the centralized data setup to the new distributed one.
It is possible to address these challenges by adopting federated learning (FL) approaches. These learning approaches enable multiple entities (or clients^{1}^{1}1Clients are different user devices, organisations or even geodistributed datacenters of a single company [advancesFL]. In this manuscript we use the terms participants, clients, and entities, interchangeably.) coordinated by a central server to iteratively learn in a decentralized manner a single global model to carry out some task [DBLP:journals/corr/KonecnyMRR16, DBLP:journals/corr/KonecnyMYRSB16]. The clients do not share data with one another or with the server; instead the clients only share focused updates with the server, the server then updates a global model, and distributes the updated model to the clients, with the process carried out over multiple rounds or iterations. This learning approach enables different clients with limited local training data to learn better machine learning models.
However, with the exception of some recent works such as [cui2021addressing, zhang2021unified], which we will discuss later, federated learning is not typically used to learn models that exhibit performance guarantees for different demographic groups served by a client (i.e., group fairness guarantees); instead, it is primarily used to learn models that exhibit specific performance guarantees for each client involved in the federation (i.e., client fairness guarantees). Importantly, in view of the fact that a machine learning model that is client fair is not necessarily group fair (as we formally demonstrate in this work), it becomes crucial to understand how to develop new federated learning techniques leading up to models that are also fair across different demographic groups.
This work develops a new federated learning algorithm that can be adopted by multiple entities coordinated by a single server to learn a global minimax group fair model. We show that our algorithm leads to the same (minimax) group fairness performance guarantees of centralized approaches such as [diana2020convergent, martinez2020minimax], which are exclusively applicable to settings where the data is concentrated in a single client. Interestingly, this also applies to scenarios where certain clients do not hold any data from some of the demographic groups.
The rest of the paper is organized as follows: Section 2 overviews related work. Section 3 formulates our proposed distributed group fairness problem. Section 4 formally demonstrates that traditional federated learning approaches such as [DBLP:journals/corr/abs210808435, DRFA, Li2020Fair, AFL] may not always solve group fairness. In Section 5 we propose a new federated learning algorithm to collaboratively learn models that are minimax group fair. Section 6 illustrates the performance of our approach in relation to other baselines. Finally, Section 7 draws various conclusions.
2 Related Work
Fairness in Machine Learning. The development of fair machine learning models in the standard centralized learning setting – where the learner has access to all the data – is underpinned by fairness criteria. One popular criterion is individual fairness [dwork2011fairness] that dictates that the model is fair provided that people with similar characteristics/attributes are subject to similar model predictions/decisions. Another family of criteria – known as group fairness
– requires the model to perform similarly on different demographic groups. Popular group fairness criteria include equality of odds, equality of opportunity
[hardt2016equality], and demographic parity [louizos2017variational], that are usually imposed as a constraint within the learning problem. More recently, [martinez2020minimax] introduced minimax group fairness; this criterion requires the model to optimize the prediction performance of the worst demographic group without unnecessarily impairing the performance of other demographic groups (also known as noharm fairness) [diana2020convergent, martinez2020minimax]. In this work we leverage minimax group fairness criterion to learn a model that is (demographic) group fair across any groups included in the clients distribution in federated learning settings. However, the overall concepts here introduced can also be extended to other fairness criteria.Fairness in Federated Learning. The development of fair machine learning models in federated learning settings has been building upon the group fairness literature. The majority of these works has concentrated predominantly on clientfairness which targets the development of algorithms leading to models that exhibit similar performance across different clients [Li2020Fair].
One such approach is agnostic federated learning (AFL) [AFL], whose aim is to learn a model that optimizes the performance of the worst performing client. Extensions of AFL [DRFA, afedavg] improve its communicationefficiency by enabling clients to perform multiple local optimization steps. Another FL approach proposed in [Li2020Fair], uses an extra fairness constraint to flexibly control performance disparities across clients. Similarly, tilted empirical risk minimization [li2021tilted]
uses a hyperparameter called tilt to enable fairness or robustness by magnifying or suppressing the impact of individual client losses. FedMGDA
[fedmgda+] is an algorithm that combines minimax optimization coupled with Pareto efficiency [micro_theory] and gradient normalization to ensure fairness across users and robustness against malicious clients.The works in [horvath2021fjord, munir2021fedprune]
enable fairness across clients with different hardware computational capabilities by allowing any participant to train a submodel of the original deep neural network (DNN) in order to contribute to the global model. The authors in
[Wang2021FederatedLW] observe that unfairness across clients is caused by conflicting gradients that may significantly reduce the performance of some clients and therefore propose an algorithm for detecting and mitigating such conflicts. Finally, GIFAIRFL [DBLP:journals/corr/abs21080274] uses a regularization term to penalize the spread in the aggregated loss to enforce uniform performance across the participating entities. Our work naturally departs from these fairness federated learning approaches since, as we prove in Section 4, clientfairness ensures fairness across all demographic groups included across clients datasets only under some special conditions.Another fairness concept in federated learning is collaborative fairness [fan2021improving, 9378043, Lyu2020, Nagalapatti_Narayanam_2021], which proposes each client’s performance compensation to correspond to its contribution on the utility task of the global model. Larger rewards to highcontributing clients motivate their participation in the federation while lower rewards prevent freeriders [Lyu2020]. However, such approaches might further penalize clients that have access to the worst performing demographic groups resulting to a even more unfair global model.
There are some recent complementary works that consider group fairness within client distributions. Group distributional robust optimization (GDRFA) [zhang2021unified], aims to optimize for the worst performing group by learning a weighting coefficient for each local group, even if there are shared groups across clients. In our work, we combine the statistics received from the clients sharing the same groups to learn a global model, since, as we experimentally show in Section 6, considering duplicates of the same group might lead to worst generalization in some FL scenarios. FCFL [cui2021addressing] focuses on improving the worst performing client while ensuring a level of local group fairness defined by each client, by employing gradientbased constrained multiobjective optimization. Our primary goal is to learn a model solving (demographic) group fairness across any groups included in the clients distribution, independently of the groups representation in a particular client.
Finally, some recent approaches study the effects of (demographic) group fairness in FL using metrics such as demographic parity and/or equality in opportunity [DBLP:journals/corr/abs201005057, cui2021addressing, rodriguezgalvez2021enforcing, zeng2021improving, ezzeldin2021fairfed, DBLP:journals/corr/abs210511570, chu2021fedfair]. Compared to these methods, our approach can support scenarios with multiple group attributes and targets without any modifications on the optimization procedure. Also, even though comparing different fairness metrics is out of the scope of this work,^{2}^{2}2There are various studies discussing the effects of different fairness metrics. See for example [10.1145/3287560.3287589]. the aforementioned methods enforce some type of zero risk disparity across groups^{3}^{3}3The risk considered in the fairness constraints is different across fairness definitions. and thus degrade the performance of the good performing groups. In this work, we consider minimax group fairness criterion [martinez2020minimax, diana2020convergent], and due to its nounnecessary harm property, we do not disadvantage any demographic groups except if absolutely necessary, making it suitable for applications such as healthcare and finance. Our formulation is complemented by theoretical results connecting minimax client and minimax group fairness and by proposing a provably convergent optimization algorithm.
Robustness in Federated Learning. Works dealing with robustness to distributional shifts in user data, such as [scaffold, NEURIPS2020_f5e53608], also relate to group fairness. One work that closely relates to group fairness is FedRobust [NEURIPS2020_f5e53608]
, that aims to learn a model for the worst case affine shift, by assuming that a client’s data distribution is an affine transformation of a global one. However, it requires each client to have enough data to estimate the local worst case shift else the global model performance on the worst group hinders
[li2021fedbn].Our Contributions. To recap, our core contributions compared to the literature are:

We formulate minimax group fairness in federated learning settings where some clients might only have access to a subset of the demographic groups during the training phase.

We formally show under what conditions minimax group fairness is equivalent to minimax client fairness so that optimizing for any of the two notions results into a model that is both group and client fair.

We propose a provably convergent optimization algorithm to collaboratively learn a minimax fair model across any demographic groups included in the federation, that allows clients to have high, low or no representation of a particular group. We show that our federated learning algorithm leads to a global model that is equivalent to a model yielded by a centralized learning algorithm.
3 Problem Formulation
3.1 Group Fairness in Centralized Machine Learning
We first describe the standard minimax group fairness problem in a centralized machine learning setting [diana2020convergent, martinez2020minimax], where there is a single entity/server holding all relevant data and responsible for learning a group fair model (see Figure 1
). We concentrate on classification tasks, though our approach also applies to other learning tasks such as regression. Let the triplet of random variables
represent input features, target, and demographic groups. Let alsorepresent the joint distribution of these random variables where
represents the prior distribution of the different demographic groups and their data conditional distribution.Let
be a loss function where
represents the probability simplex. We now consider that the entity will learn an hypothesis
drawn from an hypothesis class , that solves the optimization problem given by(1) 
Note that this problem involves the minimization of the expected risk of the worst performing demographic group.
Importantly, under the assumption that the loss is a convex function w.r.t the hypothesis^{4}^{4}4This is true for the most common functions in machine learning settings such as Brier score and cross entropy. and the hypothesis class is a convex set, solving the minimax objective in Eq. 1 is equivalent to solving
(2) 
where
represent the vectors in the simplex with all of their components larger than
. Note that if the inequality in Eq. 2 becomes an equality, however, allowing zero value coefficients may lead to models that are weakly, but not strictly, Pareto optimal [geoffrion1968proper, miettinen2012nonlinear].The minimax objective over the linear combination of the sensitive groups can be achieved by alternating between projected gradient ascent or multiplicative weight updates to optimize the weights given the model, and stochastic gradient descent to optimize the model given the weighting coefficients
[chen2018, diana2020convergent, martinez2020minimax].3.2 Group Fairness in Federated Learning
We now describe our proposed group fairness federated learning problem; this problem differs from the previous one because the data is now distributed across multiple clients but each client (or the server) do not have direct access to the data held by other clients. See also Figure 1.
In this setting, we incorporate a categorical variable
to our data tuple to indicate the clients participating in the federation. The joint distribution of these variables is , where represents a prior distribution over clients – which in practice is the fraction of samples that are acquired by client relative to the total number of data samples –, represents the distribution of the groups conditioned on the client, and represents the distribution of the distribution of the input and target variables conditioned on the group and client. We assume that the groupconditional distribution is the same across clients, meaning . Note, however, that our model explicitly allows for the distribution of the demographic groups to depend on the client (via ), accommodating for the fact that certain clients may have a higher (or lower) representation of certain demographic groups over others.We now aim to learn a model that solves the minimax fairness problem as presented in Eq. 1, but considering that the group loss estimates are split into estimators associated with each client. We therefore reexpress the linear weighted formulation of Eq. 2 using importance weights, allowing to incorporate the role of the different clients, as follows:
(3) 
where is the expected client risk and denotes the importance weight for a particular demographic group.
There is an immediate nontrivial challenge that arises within this proposed federated learning setting in relation to the centralized one described earlier: we need to devise an algorithm that solves the objective in Eq. 3 under the constraint that the different clients cannot share their local data with the server or with one another, but – in line with conventional federated learning settings [DRFA, Li2020Fair, DBLP:journals/corr/McMahanMRA16, AFL]– only local model updates of a global model (or other quantities such as local risks) are shared with the server. This will be addressed later in this paper by the proposed federated optimization.
4 Client Fairness vs. Group Fairness in Federated Learning
Before proposing a federated learning algorithm to solve our proposed group fairness problem, we first reflect whether a model that solves the more widely used client fairness objective in federated learning settings given by [AFL]
(4) 
where denotes a joint data distribution over the clients and is the vector consisting of client weighting coefficients, also solves our proposed minimax group fairness objective given by
(5) 
where denotes a joint data distribution over sensitive groups and is the vector of the group weights.
The following lemma illustrates that a model that is minimax fair with respect to the clients is equivalent to a relaxed minimax fair model with respect to the (demographic) groups.
Lemma 1.
Let denote a matrix whose entry in row and column is (i.e., the prior of group in client ). Then, given a solution to the minimax problem across clients
(6) 
that is solution to the following constrained minimax problem across sensitive groups:
(7) 
where the weighting vector is constrained to belong to the simplex subset defined by . In particular, if the set : , then , and the minimax fairness solution across clients is also a minimax fairness solution across demographic groups.
Lemma 1 proves that being minimax with respect to the clients is equivalent to finding the group minimax model constraining the weighting vectors to be inside the simplex subset . Therefore, if this set already contains a group minimax weighting vector, then the group minimax model is equivalent to client minimax model. Another way to interpret this result is that being minimax with respect to the clients is the same as being minimax for any group assignment such that linear combinations of the groups distributions are able to generate all clients distributions, and there is a group minimax weighting vector in .
Being minimax at the client and group level relies on containing the minimax weighting vector. In particular, if for each sensitive group there is a client comprised entirely of this group ( contains a identity block), then and group and client level fairness are guaranteed to be fully compatible. Another trivial example is when at least one of the client’s group priors is equal to a group minimax weighting vector. This result also suggests that client level fairness may also differ from group level fairness. This motivates us to develop a new federated learning algorithm to guarantee group fairness that – where the conditions of the lemma hold – also results in client fairness. We experimentally validate the insights deriving from Lemma 1 in Section 6. The proof for Lemma 1 is provided in the supplementary material, Appendix A.
5 MiniMax Group Fairness Federating Learning Algorithm
We now propose an optimization algorithm – Federated Minimax (FedMinMax) – to solve the group fairness problem in Eq. 3.
We let each client have access to a dataset containing various data points drawn i.i.d according to . We also define three additional sets: (a) is a set containing all data examples associated with group in client ; (b) is the set containing all data examples associated with group across the various clients; and (c) is containing all data examples across groups and across clients. Note again that – in view of our modelling assumptions – it is possible that can be empty for some and some implying that such a client does not have data realizations for such group.
We will also let the model be parameterized via a vector of parameters , i.e., . ^{5}^{5}5This vector of parameters could for example correspond to the set of weights / biases in a neural network. Then, one can approximate the relevant statistical risks using empirical risks as
(8) 
where , , , , , and . Note that is an estimate of , is an estimate of , and is an estimate of .
We consider the importance weighted empirical risk since the clients do not have access to the data distribution but instead to a dataset with finite samples. Therefore, the clients in coordination with the central server attempt to solve the optimization problem given by:
(9) 
The objective in Eq. 9 can be interpreted as a zerosum game between two players: the learner aims to minimize the objective by optimizing the model parameters and the adversary seeks to maximize the objective by optimizing the weighting coefficients .
We use a nonstochastic variant of the stochasticAFL algorithm introduced in [AFL]. Our version, provided in Algorithm 1, assumes that all clients are available to participate in each communication round . In each round , the clients receive the latest model parameters , the clients then perform one gradient descent step using all their available data, and the clients then share the updated model parameters along with certain empirical risks with the server. The server (learner) then performs a weighted average of the client model parameters . The server also updates the weighting coefficient using a projected gradient ascent step in order to guarantee that the weighting coefficient updates are consistent with the constraints. We use the Euclidean algorithm proposed in [10.1145/1390156.1390191] in order to implement the projection operation ().
We can show that our proposed algorithm can exhibit convergence guarantees.
Lemma 2.
Consider our federated learning setting (Figure 1, right) where each entity has access to a local dataset , and a centralized machine learning setting (Figure 1, left) where there is a single entity that has access to a single dataset (i.e., this single entity in the centralized setting has access to the data of the various clients in the distributed setting). Then, Algorithm 1 (federated) and Algorithm 2 (nonfederated, in supplementary material, Appendix B) lead to the same global model provided that learning rates and model initialization are identical.
The proof for Lemma 2 is provided in Appendix A. This lemma shows that our federated learning algorithm inherits any convergence guarantees of existing centralized machine learning algorithms. In particular, assuming that one can model the single gradient descent step using a approximate Bayesian Oracle [chen2018], we can show that a centralized algorithm converges and hence our FedMinMax one converges too (under mild conditions on the loss function, hypothesis class, and learning rates). See Theorem 7 in [chen2018].
6 Experimental Results
In this section we empirically showcase the applicability and competitive performance of the proposed federated learning algorithm. We apply FedMinMax to diverse federated learning scenarios by utilizing common benchmark datasets with multiple targets and sensitive groups. In particular, we perform experiments on the following datasets:

Synthetic. We generated a synthetic dataset for binary classification involving two sensitive groups (i.e., ). Let
be the normal distribution with
being the mean andbeing the variance, and
Bernoulli distribution with probability . The data were generated assuming the group variable , the input features variable and the target variable , where is the optimal hypothesis for group . We select . As illustrated in Figure 2, left side, the optimal hypothesis is equal to the optimal model for group . 
Adult [adult]. Adult is a binary classification dataset consisting of entries for predicting yearly income based on twelve input features such as age, race, education and marital status. We consider four sensitive groups (i.e., ) created by combining the gender labels and the yearly income as follows: {Male w/ income K, Male w/ income K, Female w/ income K, Female w/ income K}.

FashionMNIST [fashionmnist]. FashionMNIST is a grayscale image dataset which includes training images and testing images. The images consist of
pixels and are classified into 10 clothing categories. In our experiments we consider each of the target categories to be a sensitive group too, (i.e.,
). 
CIFAR10 [cifar].CIFAR10 is a collection of colour images of pixels. Each image contains one out of 10 object classes. There are training images and test images. We use all ten target categories, which we assign both as targets and sensitive groups (i.e., ).

ACS Employment [ding2021retiring]. ACS Employment is a recent dataset constructed using ACS PUMS data for predicting whether an individual is employed or not. For our experiments we use the 2018 1Year data for all the US states and Puerto Rico. We combine race and utility labels to generate the following sensitive groups: : {Employed White, Employed Black, Employed Other, Unemployed White, Unemployed Black, Unemployed Other} (i.e., ). We also conduct experiments where the sensitive class is race using the original 9 labels that we report in supplementary material, Appendix D.
We also examine three federated learning settings, that we categorize based on the sensitive group allocation on clients as follows:

Equal access to Sensitive Groups (ESG), where every client has access to all sensitive groups but does not have enough data to train a model individually. Each client in the federation has access to the same amount of the sensitive classes (i.e., and ). Here we examine a case where group and client fairness are not equivalent.

Partial access to Sensitive Groups (PSG), where each participant has access to a subset of the available groups memberships. In particular, the data distribution is unbalanced across participants since the size of local datasets differs (i.e., ). Akin to ESG, this is a scenario where group and client fairness are incompatible. We use this scenario to compare the performances when there is low or no local representation of particular groups.

Access to a Single Sensitive Group (SSG), where each client holds data from one sensitive group, for showcasing the group and client fairness objectives equivalence derived from Lemma 1. Similarly to PSG setting, the size of the local dataset varies across clients.
Note that ESG is an i.i.d. data scenario while PSG and SSG are noni.i.d. data settings. Also note that each client’s data is unique, meaning that there are no duplicated examples across clients. In all experiments we consider a federation consisting of 40 clients and a single server that orchestrates the training procedure. We benchmark our approach against AFL [AFL], FedAvg [Li2020Fair], TERM [li2021tilted] and FedAvg [DBLP:journals/corr/McMahanMRA16]. Further, as a baseline, we also run FedMinMax with one client (akin to centralized ML), that we denote Centralized Minmax Baseline, to confirm Lemma 2
. We do not compare to baselines that explicitly employ a different fairness metric (e.g., demographic parity) since this not the focus of this work. For all the datasets, we compute the means and standard deviations of the accuracies and risks over three runs. We assume that every client is available to participate at each communication round for every method to make the comparison more fair. More details about model architectures and experiments are provided in Appendix
C.We begin by investigating the worst group, the best group and the average utility performance for the Adult, FashionMNIST, CIFAR10 and ACS Employment datasets in Figure 3. We present the mean and standard deviation of the accuracies and risks on the test dataset. FedMinMax enjoys a similar accuracy to the Centralized Minimax Baseline in all settings, as proved in Lemma 2. AFL is similar to FedMinMax and Centralized Minmax Baseline only in SSG, where group fairness is implied by client fairness, in line with Lemma 1. FedAvg has similar best accuracy across federated settings, however the accuracy of the worst group decreases as the local data becomes more heterogeneous (i.e., in PSG and SSG). In many datasets, FedAvg and TERM have superior performance on the worst group compared to AFL and FedAvg in PSG and ESG, but do not to achieve minimax group fairness on any of the FL settings. Note that FedMinMax has the best worst group performance in all settings as expected.
For the numerical values, illustrating the efficiency of the proposed approach for every setting and dataset, see Tables D.2, D.4, D.5, D.7 and D.9, in the supplementary material.
Next we show the final group weighting coefficients for the minimax approaches AFL, FedMinMax, and Centralized Minmax Baseline in Figure 4. Note that PSG scenario is valid only for datasets where , else its equivalent to SSG setting.
The proposed approach yields similar group weights across all settings. FedMinMax also achieves the same weighting coefficients to Centralized Minmax Baseline, akin to Lemma 2. AFL produces weights similar to the group priors in ESG that move towards the minimax weighting coefficients the more we increase the heterogeneity w.r.t. the sensitive groups. AFL achieves the similar weights to FedMinMax and Centralized Minmax Baseline only in SSG scenario where each participant has access to exactly one group, following Lemma 1. Note that the group weighting coefficients are updated based on the risks calculated on the training set and might not generalize to the testing set for every dataset. We provide a complete description of the weighting coefficients for each approach in Tables D.1, D.3, D.6, D.8 and D.10, in the supplementary material.
Finally, we illustrate the efficiency of considering global demographics across entities instead of multiple local ones, as in [zhang2021unified]. For these experiments, we repurpose our algorithm – we call the adjusted version LocalFedMinMax – so that the adversary proposes a weighting coefficient for each group located in a client (i.e., ). Recall that the adversary in our proposed algorithm uses a single weighting coefficient for every common demographic group (i.e., ). We provide the detailed description of LocalFedMinMax in Algorithm 3, Appendix E.
In Table 1 we report results for both approaches on two federations consisting of 10 and 40 participants, respectively. LocalFedMinMax and FedMinMax offer similar improvement on the worst group on SSG regardless the number of clients. We also notice a similar behavior in the smaller federated network for the ESG scenario. In the remaining settings, LocalFedMinMax, leads to a worst performance as the amount of client increases and the number of data for each group per client reduces. On the other hand, FedMinMax is not effected by the local group representation since it aggregates the statistics received by each client and updates the weights for (global) demographics, leading up to a better generalization performance.
FashionMNIST  
Clients  Clients  
Method  ESG  PSG  SSG  ESG  PSG  SSG 
LocalFedMinMax  0.3160.092  0.3310.007  0.3090.013  0.3460.081  0.3310.021  0.310.005 
FedMinMax  0.310.005  0.3080.012  0.3080.003  0.3070.01  0.310.008  0.3090.011 
CIFAR10  
Clients  Clients  
Method  ESG  PSG  SSG  ESG  PSG  SSG 
LocalFedMinMax  0.3580.008  0.3530.042  0.3520.0  0.3810.004  0.3780.005  0.3520.007 
FedMinMax  0.3520.02  0.3510.005  0.3510.0  0.3510.002  0.3510.009  0.3510.002 
7 Conclusion
In this work, we formulate (demographic) group fairness in federated learning setups where different participating entities may only have access to a subset of the population groups during the training phase (but not necessarily the testing phase), exhibiting minmax fairness performance guarantees akin to those in centralized machine learning settings.
We formally show how our fairness definition differs from the existing fair federated learning works, offering conditions under which conventional clientlevel fairness is equivalent to grouplevel fairness. We also provide an optimization algorithm, FedMinMax, to solve the minmax group fairness problem in federated setups that exhibits minmax guarantees akin to those of minmax group fair centralized machine learning algorithms.
We empirically confirm that our method outperforms existing federated learning methods in terms of group fairness in various learning settings and validate the conditions under which the competing approaches yield the same solution as our objective.
References
Appendix A Appendix: Proofs
Lemma 1. Let denote a matrix whose entry in row and column is (i.e., the prior of group in client ). Then, given a solution to the minimax problem across clients
(10) 
that is solution to the following constrained minimax problem across sensitive groups:
(11) 
where the weighting vector is constrained to belong to the simplex subset defined by . In particular, if the set : , then , and the minimax fairness solution across clients is also a minimax fairness solution across demographic groups.
Proof. The objective for optimizing the global model for the worst mixture of client distributions is:
(12) 
given that . Since with being the prior of for client , and is the distribution conditioned on the sensitive group , Eq. 12 can be rewritten as
(13) 
where , . Note that this creates the vector . It holds that the set of possible vectors satisfies , since , with and .
Then, from the equivalence in Equation 13 we have that
(14) 
and
(15) 
with have the same minimax risk, that is
(16) 
In particular, if the space contains any group minimax fair weights, meaning that the set : is not empty, then it follows that any (solution to Equation 15) is already minimax fair with respect to the groups , and the clientlevel minimax solution is also a minimax solution across sensitive groups.
Lemma 2. Consider our federated learning setting (Figure 1, right) where each entity has access to a local dataset , and a centralized machine learning setting (Figure 1, left) where there is a single entity that has access to a single dataset (i.e., this single entity in the centralized setting has access to the data of the various clients in the distributed setting). Then, Algorithm 1 (federated) and Algorithm 2 (nonfederated, in supplementary material, Appendix B) lead to the same global model provided that learning rates and model initialization are identical.
Proof. We will show that FedMinMax, in Algorithm 1 is equivalent to the centralized algorithm, in Algorithm 2 under the following conditions:

the dataset on client , in FedMinMax is and the dataset in centralized MinMax is ; and

the model initialization , the number of adversarial rounds ,^{6}^{6}6In the federated Algorithm 1, we also refer to the adversarial rounds as communication rounds. learning rate for the adversary , and learning rate for the learner , are identical for both algorithms.
This can then be immediately done by showing that steps lines 37 in Algorithm 1 are entirely equivalent to step 3 in Algorithm 2. In particular, note that we can write
(17) 
(18) 
Therefore, the model update
(19) 
associated with step in 7 at round of Algorithm 1, is entirely equivalent to the model update
(20) 
associated with step in line 3 at round of Algorithm 2, provided that is the same for both algorithms.
Appendix B Appendix: Centralized Minimax Algorithm
We provide the centralized version of FedMinMax in Algorithm 2.
Appendix C Appendix: Experimental Details
Experimental Setting and Model Architectures. For AFL and FedMinMax the batch size is equal to the number of examples per client while for TERM, FedAvg and FedAvg is equal to
. For the synthetic dataset, we use an MLP architecture consisting of four hidden layers of size 512. In the experiments for Adult we use a single layer MLP with 512 neurons. For FashionMNIST we use a CNN architecture with two 2D convolutional layers with kernel size 3, stride 1, and padding 1. Each convolutional layer is followed with a maxpooling layer with kernel size 2, stride 2, dilation 1, and padding 0. For CIFAR10 we use a ResNet18 architecture without batch normalization. Finally for ACS Employment dataset we use a single layer MLP with 512 neurons for the experiments where the sensitive label is the combination of race and employment, and Logistic Regression for the experiments with the original 9 races. For training we use either cross entropy or Brier score loss function. We perform a grid search over the following hyperparameters: tilt
,, local epochs
and (where appropriate). We report a summary of the experimental setup in Table C.1. During the training process we tune the hyperparameters based on the validation set for each approach. The mean and standard deviation reported on the results are calculated over three runs. We use 3fold cross validation to split the data into training and validation for each run.Dataset  Setting  Method  Batch Size  Loss  Hypothesis Type  Epochs  or  

Synthetic  ESG,SSG  AFL  0.1  Brier Score  MLP (4x512)    0.1  
FedAvg  0.1  100  Brier Score  MLP (4x512)  15    
FedAvg  0.1  100  Brier Score  MLP (4x512)  15    
FedMinMax (ours)  0.1  Brier Score  MLP (4x512)    0.1  
Centalized Minmax  0.1  Brier Score  MLP (4x512)    0.1  
Adult  ESG,SSG,PSG  AFL  0.01  Cross Entropy  MLP (512)    0.01  
FedAvg  0.01  100  Cross Entropy  MLP (512)  15    
FedAvg  0.01  100  Cross Entropy  MLP (512)  15    
FedMinMax (ours)  0.01  Cross Entropy  MLP (512)    0.01  
Centalized Minmax  0.01  Cross Entropy  MLP (512)    0.01  
FashionMNIST  ESG,SSG,PSG  AFL  0.1  Brier Score  CNN    0.1  
FedAvg  0.1  100  Brier Score  CNN  15    
FedAvg  0.1  100  Brier Score  CNN  15    
FedMinMax (ours)  0.1  Brier Score  CNN    0.1  
Centalized Minmax  0.1  Brier Score  CNN    0.1  
CIFAR10  ESG,SSG,PSG  AFL  0.1  Brier Score  ResNet18 w/o BN    0.01  
FedAvg  0.1  100  Brier Score  ResNet18 w/o BN  3    
FedAvg  0.1  100  Brier Score  ResNet18 w/o BN  3    
FedMinMax (ours)  0.1  Brier Score  ResNet18 w/o BN    0.01  
Centalized Minmax  0.1  Brier Score  ResNet18 w/o BN    0.01  
ACS Employment  ESG,SSG,PSG  AFL  0.01  Cross Entropy  MLP (512)    0.01  
(6 sensitive groups)  FedAvg  0.01  100  Cross Entropy  MLP (512)  10    
FedAvg  0.01  100  Cross Entropy  MLP (512)  10    
FedMinMax (ours)  0.01  Cross Entropy  MLP (512)    0.01  
Centalized Minmax  0.01  Cross Entropy  MLP (512)    0.01  
ACS Employment  ESG,SSG,PSG  AFL  0.01  Cross Entropy  Logistic Regression    0.01  
(9 sensitive groups)  FedAvg  0.01  100  Cross Entropy  Logistic Regression  10    
FedAvg  0.01  100  Cross Entropy  Logistic Regression  10    
FedMinMax (ours)  0.01  Cross Entropy  Logistic Regression    0.01  
Centalized Minmax  0.01  Cross Entropy  Logistic Regression    0.01 
Software & Hardware.
The proposed algorithms and experiments are written in Python, leveraging PyTorch
[NEURIPS2019_9015]. The experiments were realised using 1 NVIDIA Tesla V100 GPU.Appendix D Appendix: Additional Results
Experiments on Synthetic dataset. Recall that we consider two sensitive groups (i.e., ) in the synthetic dataset. In the Equal access to Sensitive Groups (ESG) setting, we distribute the two groups on 40 clients, while for the Single access to Sensitive Groups (SSG) case, every client has access to a single group, each group is distributed to 20 clients, and the amount of samples on each local dataset varies across clients. There is no Partial access to Sensitive Groups (PSG) setting for binary sensitive group scenarios since it is equivalent to SSG. A comparison of the testing group risks is provided in Table D.2 and the weighting coefficients for the groups are given by Table D.1.
Setting  Method  Worst Group  Best Group 
ESG  AFL  0.528  0.472 
FedMinMax (ours)  0.999  0.001  
SSG  AFL  0.999  0.001 
FedMinMax (ours)  0.999  0.001  
Centalized Minmax Baseline  0.999  0.001 
Setting  Method  Worst Group Risk  Best Group Risk 

ESG  AFL  0.4850.0  0.2160.001 
FedAvg  0.4870.0  0.2140.002  
FedAvg (=0.2)  0.4790.002  0.220.002  
FedAvg (=5.0)  0.4780.002  0.2230.004  
TERM (=1.0)  0.4690.0  0.2610.001  
FedMinMax (ours)  0.4510.0  0.310.001  
SSG  AFL  0.4510.0  0.310.001 
FedAvg  0.4830.002  0.2190.001  
FedAvg (=0.2)  0.4760.001  0.2210.002  
FedAvg (=5.0)  0.4680.005  0.2740.004  
TERM (=1.0)  0.4610.004  0.2720.001  
FedMinMax (ours)  0.4510.0  0.3090.003  
Centalized Minmax Baseline  0.4510.0  0.3080.001 
Experiments on Adult dataset. In the Equal access to Sensitive Groups (ESG) setting, we distribute the 4 groups equally on 40 clients. In the Partial access to Sensitive Groups (PSG) setting, 20 clients have access to Males subgroups, and the other 20 to subgroups relating to Females. In the Single access to Sensitive Groups (SSG) setting, every client has access to a single group and each group is distributed to 10 clients. We show the testing group risks in Table D.4 and the group weights in Table D.3.
Setting  Method  Males earning <= K  Males earning > K  Females earning <= K  Females earning > K 

ESG  AFL  0.475  0.214  0.284  0.028 
FedMinMax (ours)  0.697  0.301  0.001  0.001  
SSG  AFL  0.705  0.293  0.003  0.001 
FedMinMax (ours)  0.697  0.301  0.001  0.001  
PSG  AFL  0.500  0.229  0.244  0.027 
FedMinMax (ours)  0.705  0.293  0.001  0.001  
Centalized Minmax Baseline  0.697  0.301  0.001  0.001 
Setting  Method  Males earning <= K  Males earning > K  Females earning <= K  Females earning > K 

ESG  AFL  0.2630.002  0.7010.003  0.0860.002  1.0960.008 
FedAvg  0.2550.002  0.6970.004  0.0810.001  1.1210.009  
qFedAvg  0.2630.003  0.6970.004  0.0840.001  1.10.006  
TERM  0.3810.101  0.6070.04  0.2240.06  0.7250.021  
FedMinMax (ours)  0.4140.003  0.4530.003  0.4150.008  0.3470.007  
SSG  AFL  0.4180.006  0.4520.009  0.4160.002  0.3490.007 
FedAvg  0.2630.001  0.7040.002  0.070.0  1.230.002  
qFedAvg  0.2610.001  0.6830.002  0.0820.001  1.1170.01  
TERM  0.3580.016  0.5790.002  0.2860.031  0.6930.071  
FedMinMax (ours)  0.4130.002  0.4530.005  0.4140.006  0.3480.01  
PSG  AFL  0.2740.003  0.7570.009  0.0940.002  1.2850.022 
FedAvg  0.2630.001  0.70.001  0.0690.001  1.2260.007  
qFedAvg  0.2630.004  0.7520.014  0.090.004  1.2390.032  
TERM  0.4850.195  0.5810.108  0.3670.316  0.690.003  
FedMinMax (ours)  0.4110.002  0.4520.006  0.4170.001  0.3460.008  
Centalized Minmax Baseline  0.4120.004  0.4530.005  0.4160.012  0.3470.004 
Experiments on FashionMNIST dataset. For the Equal access to Sensitive Groups (ESG) setting, each client in the federation has access to the same amount of the 10 classes. In the Partial access to Sensitive Groups (PSG) setting, 20 of the participants have access only to groups Tshirt, Trouser, Pullover, Dress and Coat. The remaining 20 clients own data from groups Sandal, Shirt, Sneaker, Bag and Ankle Boot. Finally, in the Single access to Sensitive Groups (SSG) setting, every group is owned by 4 clients only and all clients have access to just one group membership. The group risks are provided in Table D.5. We also show the weighting coefficients for each sensitive group in Table D.6.
Setting  Method  Tshirt  Trouser  Pullover  Dress  Coat  Sandal  Shirt  Sneaker  Bag  Ankle boot 

ESG  AFL  0.2390.003  0.0460.0  0.2620.001  0.1590.001  0.2520.004  0.060.0  0.4940.004  0.0670.001  0.0490.0  0.070.001 
FedAvg  0.2430.003  0.0460.0  0.2620.001  0.1580.003  0.2530.002  0.0610.0  0.4920.003  0.0680.0  0.0490.0  0.0690.0  
qFedAvg  0.2680.051  0.0470.005  0.3120.016  0.1640.029  0.3060.052  0.0390.003  0.4770.006  0.0740.001  0.0360.005  0.0560.008  
TERM  0.2560.066  0.0480.008  0.310.083  0.1750.022  0.2940.016  0.0410.012  0.4670.002  0.0660.019  0.0380.011  0.0620.018  
FedMinMax (ours)  0.2610.006  0.1910.016  0.2560.027  0.2170.013  0.2230.031  0.2070.027  0.3070.01  0.1720.016  0.1930.021  0.1560.011  
SSG  AFL  0.2670.009  0.1940.023  0.2360.013  0.2260.012  0.2620.012  0.2010.026  0.3070.003  0.1780.033  0.2050.025  0.1620.021 
FedAvg  0.2270.003  0.0390.001  0.2360.004  0.1430.003  0.2320.003  0.0510.001  0.4630.003  0.0670.0  0.0410.0  0.0630.001  
qFedAvg  0.240.001  0.0410.008  0.2460.026  0.1420.014  0.2570.028  0.0360.001  0.4250.002  0.0590.014  0.0270.002  0.0420.007  
TERM  0.2510.011  0.0340.003  0.260.017  0.1440.005  0.2420.034  0.040.004  0.3990.017  0.050.003  0.0260.001  0.0440.001  
FedMinMax (ours)  0.2690.012  0.20.026  0.2380.017  0.2310.013  0.2520.034  0.20.024  0.3090.011  0.1770.03  0.2050.032  0.1690.013  
PSG  AFL  0.2440.007  0.0320.001  0.2570.066  0.1220.006  0.2090.098  0.0450.002  0.4250.019  0.0590.001  0.0410.001  0.0620.001 
FedAvg  0.2290.008  0.0390.0  0.2360.004  0.1420.002  0.2320.003  0.0520.001  0.4640.011  0.0670.001  0.0420.001  0.0630.001  
qFedAvg  0.2780.062  0.040.013  0.2560.083  0.160.026  0.3110.044  0.0450.013  0.4530.002  0.0630.02  0.0290.007  0.0470.004  
TERM  0.2260.007  0.0370.005  0.2330.004  0.1530.007  0.2550.016  0.0380.0  0.4390.007  0.0530.003  0.0260.001  0.0430.002  
FedMinMax (ours)  0.2630.013  0.1770.026  0.2280.011  0.210.019  0.2380.025  0.1820.03  0.310.008  0.160.027  0.1840.031  0.1540.018  
Centalized Minmax Baseline  0.2590.01  0.1730.015  0.2390.051  0.2130.008  0.240.063  0.1820.024  0.3110.006  0.1680.018  0.180.013  0.1510.012 
Setting  Method  Tshirt  Trouser  Pullover  Dress  Coat  Sandal  Shirt  Sneaker  Bag  Ankle boot 

ESG  AFL  0.099  0.100  0.101  0.101  0.100  0.100  0.099  0.100  0.100  0.100 
FedMinMax (ours)  0.217  0.001  0.241  0.007  0.151  0.001  0.380  0.001  0.001  0.001  
SSG  AFL  0.217  0.001  0.241  0.007  0.151  0.001  0.379  0.001  0.001  0.001 
FedMinMax (ours)  0.216  0.001  0.237  0.017  0.155  0.001  0.370  0.001  0.001  0.001  
PSG  AFL  0.128  0.064  0.138  0.099  0.129  0.063  0.173  0.069  0.066  0.071 
FedMinMax (ours)  0.216  0.001  0.238  0.014  0.154  0.001  0.372  0.001  0.001  0.001  
Centalized Minmax Baseline  0.217  0.001  0.240  0.010  0.152  0.001  0.377  0.001  0.001  0.001 
Experiments on CIFAR10 dataset. In the Equal access to Sensitive Groups (ESG) setting, the 10 classes are equally distributed across the clients, creating a scenario where each client has access to the same amount of data examples and groups. In the Partial access to Sensitive Groups (PSG) setting, 20 clients own data from groups Airplane, Automobile, Bird, Cat and Deer and the rest hold data from Dog, Frog, Horse, Ship and Truck groups. Finally, in the Single access to Sensitive Groups (SSG) setting, every client owns only one sensitive group and each group is distributed to only 4 clients. We report the risks on the test set in Table D.7 and the final group weighting coefficients in Table D.8.
Setting  Method  Airplane  Automobile  Bird  Cat  Deer  Dog  Frog  Horse  Ship  Truck 

ESG  AFL  0.140.001  0.1040.009  0.2890.011  0.4610.01  0.2430.01  0.280.016  0.1510.009  0.140.009  0.1250.012  0.1320.009 
FedAvg  0.1480.014  0.1080.006  0.2830.011  0.4870.002  0.2370.002  0.2560.002  0.1440.005  0.1480.008  0.1230.003  0.1280.004  
qFedAvg  0.1780.065  0.1180.047  0.3080.099  0.5070.003  0.3110.054  0.410.01  0.1790.012  0.1190.013  0.1580.07  0.1820.05  
TERM  0.2170.087  0.1150.006  0.3110.057  0.4910.007  0.2740.055  0.2720.026  0.1760.041  0.166 
Comments
There are no comments yet.