1 Introduction
Federated learning (FL) [mcmahan2017communication] is a machine learning paradigm where multiple entities (or users) collaborate in training a machine learning model under the coordination of a central server or a service provider. In this setting, only some statistics relevant to the model’s training are shared with the central server, and the user’s raw data is always stored on their device and never leaves it.
In many applications of interest, the data is distributed across many users, is large compared to the model’s size, and is privacysensitive. Therefore, this decentralized setting is attractive since (i) it is amenable to parallelizing the computation across multiple devices (which can easily be accommodated on such devices with modern, fast, processors) and (ii) it is not dependent on a central dataset, which could be susceptible to privacy leaks via reidentification attacks [sweeney2000simple].
The statistics transmitted to the central server contain less information than the raw data by the data processing inequality, and therefore reduce the risk of privacy leaks. However, there exist determined enough adversaries that can extract delicate information from these updates or the model itself. For example, even in the extreme case where an adversary only has access to queries of the model (or “blackbox” access), they can discover the identity of users present in the training data for a generic task [shokri2017membership]
or even reconstruct the faces used to train a face recognition system
[fredrikson2015model].Therefore, in order to provide guarantees of privacy to users, it is customary to ensure that the training of the model is differentially private, see e.g. [bhowmick2018protection; mcmahan2018general; mcmahan2018learning; truex2019hybrid; granqvist2020improving]. Differential privacy (DP) [dwork2006calibrating; dwork2014algorithmic]
is a strong privacy standard for algorithms that operate on data. More precisely, DP guarantees that the probability of obtaining a model using all the users’ data is close to the probability of obtaining that same model if any one of the users does not participate in the training. Hence, DP limits the amount of information that any attacker (regardless of their compute power or access to auxiliary information) may obtain about any user’s identity after observing the model.
Deep learning models have seen great success in a wide range of applications such as image classification, speech recognition, or natural language processing [he2015delving; amodei2016deep; xue2021mt5]. For this reason, they are usually the model of choice for FL [mcmahan2017communication]. In fact, DL models have been successfully trained with differentially private FL (or PFL) for tasks like next word prediction or speaker verification [mcmahan2017communication; granqvist2020improving]. Nonetheless, these models are susceptible to perpetuate societal biases existing in the data [caliskan2017semantics] or to discriminate against certain groups even when the data is balanced [buolamwini2018gender]. Moreover, when the training is differentially private [abadi2016deep], degradation in the performance of these models disproportionately impacts underrepresented groups [bagdasaryan2019differential]. More specifically, the accuracy of the model on minority groups is deteriorated to a larger extent than the accuracy for the majority groups.
In the realm of federated learning, there has been some research studying how to achieve individual fairness, i.e. that the performance of the model is similar among devices [hu2020fedmgda+; huang2020fairness; li2019fair; li2021ditto]. However, this notion of fairness falls short in terms of protecting users from underrepresented groups falling foul of disproportionate treatment. For example, [castelnovo2021zoo, Section 3.6] shows that a model that performs well on the majority of the users will have a good score on individual fairness metrics, even if all the users suffering from bad performance belong to the same group. Conversely, there is little work proposing solutions to enforce group fairness, i.e. that the performance is similar among users from different groups. The current work in this area [du2021fairness]
is devoted to a specific measure of fairness (demographic parity) when using logistic regression models. Moreover, all such prior work focuses on nonprivate federated learning, and therefore do not consider the adverse affects that differential privacy has on the models trained with PFL.
On the other hand, work studying the tradeoffs between privacy and group fairness proposes solutions that are either limited to simple models such as linear logistic regression [ding2020differentially], require an oracle [cummings2019compatibility], scale poorly for large hypothesis classes [jagielski2019differentially], or only offer privacy protection for the variable determining the group [tran2021differentially; rodriguez2020variational]. Furthermore, in the aforementioned work, the central learning paradigm is the only one considered, and the techniques are not directly adaptable to federated learning.
For all the above reasons, in this paper, we propose an algorithm to train deep learning models with private federated learning while enforcing group fairness. We pose the problem as an optimization problem with fairness constraints and extend the modified method of differential multipliers (MMDM) [platt1987constrained]
to solve such a problem with FL and DP. Hence, the resulting algorithm (i) is applicable to any model that can be learned using stochastic gradient descent (SGD) or any of its variants, (ii) can be tailored to enforce the majority of the group fairness metrics, (iii) can consider any number of attributes determining the groups, and (iv) can consider both classification and regression tasks.
The paper is structured as follows: in Section 2 we review the background on differential privacy, private federated learning, the MMDM algorithm, and group fairness; in Section 3 we present out approach for fair private federated learning; in Section 4 we describe our experimental results; and in Section 5 we conclude with a summary and interpretation of our findings.
2 Background
In this section we review several aspects on the theory of differential privacy, private federated learning, the MMDM algorithm, and group fairness that will be necessary to develop and understand the proposed algorithm.
2.1 Differential Privacy
In this subsection, we start describing the definition of privacy that we will adopt, differential privacy (DP) [dwork2006calibrating; dwork2014algorithmic]. Then, we describe some of the details and properties of this standard that are useful to understand the proposed algorithm.
Differential privacy formalizes the maximum amount of information that an “almighty” adversary can obtain from the release of private data. This private data can be anything, from an obfuscated version of the users’ data samples themselves to a noisy function of that data. Formally, the definition of DP establishes a bound on how different the distribution of a randomized function of the data can be if a user’s contribution is included or not in such data.
Definition 1 (()Differential Privacy).
A randomized function satisfies DP if for any two adjacent datasets and for any subset of outputs it holds that
(1) 
Two datasets and are said to be adjacent if dataset can be formed by adding or removing all data associated with a user from .
An alternative definition to differential privacy is given in [mironov2017renyi], where the dissimilarity of the distributions is measured using the Rényi divergence [renyi1961measures]. This definition is attractive since it enjoys similar properties to the original definition, and gives a more efficient privacy analysis of certain iterative algorithms such as differentially private SGD (DPSGD) [abadi2016deep; wang2019subsampled], even for the same DP budget.
Definition 2 ((Rdp).
A randomized function satisfies Rényi differential privacy of order , or RDP, if for any two adjacent datasets it holds that
(2) 
where is the Rényi divergence of order .
A common way to privatize some statistics of the data is to obfuscate them with Gaussian noise. Namely, instead of directly releasing one instead releases , where
(3) 
and is the sensitivity of the statistic. This is known as the Gaussian mechanism [dwork2014algorithmic] and it is RDP for any [mironov2017renyi].
Differential privacy, in either of the two mentioned forms, enjoys several desirable properties for a privacy definition. Among those, we highlight three (in terms of RDP) that are important for the privacy guarantees of the presented algorithm:

Postprocessing: Consider two functions and . If is RDP, then is also RDP. That is, no amount of postprocessing can reduce the privacy guarantees provided by the function [mironov2017renyi].

Adaptive Composition: Consider two functions and . Consider also a function defined as , where . If is RDP, and for all , is RDP, then is RDP. That is, the privacy guarantees of a function degrade gracefully when its realization is used by another function [mironov2017renyi].

Subsampling privacy amplification: Consider an RDP function where . Now consider the function defined as applying to a random subsample (without replacement) of fixed length of . Then is with . That is, the privacy guarantees of a function are amplified by applying that function to a random subsample of the dataset [wang2019subsampled].
2.2 Private Federated Learning
The federated learning setting focuses on learning a model that minimizes the expected value of a loss function
using a dataset of samples distributed across users. This paper will consider models parametrized by a fixed number of parameters and differentiable loss functions, which includes neural networks.
Since the distribution of the data samples is unknown and each user is the only owner of their private dataset , the goal of FL is to find the parameters that minimize the loss function across the samples of the users. That is, to solve
(4) 
where .
This way, the parameters can be learned with an approximation of SGD. mcmahan2017communication suggests that the central server iteratively samples users; they compute and send back to the server an approximation of , where is the learning rate; and finally the server updates the parameters as dictated by gradient descent . If the users send back the exact gradient the algorithm is known as FederatedSGD. A generalisation of this, FederatedAveraging
, involves users running several local epochs of minibatch SGD and sending up the difference. However, the method in this paper relies on an additional term in the loss function, so it will extend
FederatedSGD.The above algorithm can be modified so it becomes diferentially private by following the structure of the differentially private SGD algorithm [abadi2016deep; wang2019subsampled]. This modification consists of clipping (i.e. restricting the sensitivity to a predefined clipping bound ) and applying the Gaussian mechanism to each user’s gradient approximation before they are used to update the parameters [mcmahan2018general; mcmahan2018learning]. Then, the adaptive composition and subsampling privacy amplification properties of DP ensure that the whole algorithm is differentially private.
In this architecture, where clients contribute statistics to a server which updates a model, differential privacy can be used in its local or central forms. In local DP [dwork2014algorithmic], the statistics are obfuscated before leaving the device. However, models trained with local DP often suffer from low utility [granqvist2020improving]. Instead, this paper will assume central DP. This can involve trust in the server that updates the model, or a trusted third party separated from the model, which is how differential privacy was originally formulated. In some cases, like FederatedSGD, all the server needs to know is a sum over client contributions. In this case, the trusted third party can be replaced by multiparty computation [goryczka2015comprehensive], such as secure aggregation [bonawitz2017practical].
2.3 The Modified Method of Differential Multipliers
Ultimately, our objective is to find a parametric model that minimizes a differentiable loss function while respecting some fairness constraints. Therefore, in this section we review an algorithm for constrained differential optimization, the modified method of differential multipliers (MMDM)
[platt1987constrained].This algorithm tries to find a solution to the following constrained optimization problem:
(P1) 
where is the function to minimize, is the concatenation of constraint functions, and is the solution subspace.
The algorithm consists of solving the following set of differential equations resulting from the Lagrangian of (P1) with an additional quadratic penalty using gradient descent/ascent. This results in the following iterative update algorithm
(5) 
where is a Lagrange (or dual) multiplier, is a damping parameter, is the learning rate of the model parameters and is the learning rate of the Lagrange multiplier.
Intuitively, these sets of updates gradually fulfill (P1). The parameter updates against enforce the function minimization and the parameter updates against enforce the constraints’ satisfaction. Then, the multiplier and the multiplicative factor control how strongly the constraints’ violations are penalized in the parameters’ update.
A desirable property of MMDM is that for small enough learning rates and large enough damping parameter , there is a region comprising the surroundings of each constrained minimum such that if the parameters’ initialization is in that region and the parameters remain bounded, then the algorithm converges to a constrained minimum [platt1987constrained]. Intuitively, this condition comes from the fact that the term enforces a quadratic shape on the optimization search space on the neighbourhood of the solution subspace, and this local behavior is stronger the larger the damping parameter . Therefore, if the parameters’ initialization is in this locally quadratic region, then the algorithm is guaranteed to converge to the minimum.
2.4 Group Fairness
In this subsection, we mathematically formalize what group fairness means. To simplify the exposition, we describe this notion in the central setting, i.e. where the data is not distributed across users.
Let us consider a dataset of instances, where each instance belongs to a group . Group fairness considers how differently a model treats the instances belonging to each group. Many fairness metrics can be written in terms of the similarity of the expected value of a function of interest of the model evaluated on the general population with that on the population of each group [agarwal2018reductions; fioretto2020lagrangian]
. That is, if we consider a supervised learning problem, where
and where the output of the model is an approximation of , we say a model is fair if(6) 
for all .
Most of the group fairness literature focuses on the case of binary classifiers, i.e.
, and on the binary group case, i.e. . However, many of the fairness metrics can be extended to general output spaces and categorical groups . It is common that the function is the indicator functionof some logical relationship between the random variables, thus turning (
6) to an equality between probabilities. As an example, we describe two common fairness metrics that will be used later in the paper. For a comprehensive survey of different fairness metrics and their interrelationships, please refer to [verma2018fairness; castelnovo2021zoo].
False negative rate (FNR) parity (or equal opportunity) [hardt2016equality]: This fairness metric is designed for binary classification and binary groups. It was originally defined as equal true positive rate between the groups, which is equivalent to an equal FNR between each group and the overall population. That is, if we let then (6) reduces to
(7) for all .
This is usually a good metric when the target variable is something positive such as being granted a loan or being hired for a job, since we want to minimize the group disparity of misclassification among the individuals that deserved such a loan or such a job [hardt2016equality; castelnovo2021zoo].

Accuracy parity (or overall misclassification rate) [zafar2017fairness]: This fairness metric is also designed for binary classification and binary groups. Nonetheless, it applies well to general tasks and categorical groups. Similarly to the previous metric, if we let then (6) reduces to
(8) for all .
This is usually a good metric when there is not a clear positive or negative semantic meaning to the target variable, and also when this variable is not binary.
3 An Algorithm for Fair and Private Federated Learning (FPFL)
In this section we describe the proposed algorithm to enforce fairness in PFL of parametric models learned with SGD. First, we describe an adaptation of the MMDM algorithm to enforce fairness in standard central learning. Then, we extend that algorithm to PFL.
3.1 Adapting the MMDM Algorithm to Enforce Fairness
Consider a dataset of instances, where each instance belongs to a group . Consider also a supervised learning setting where , the output of the model is an approximation of , and the model is parametrized by the parameters . Finally, we consider we have no information about the distribution of the variables aside from the available samples.
We concern ourselves with the task of finding the model’s parameters that minimize a loss function across the data samples while enforcing a measure of fairness on the model. That is, to solve
(P2) 
where , is a subset of that varies among different fairness metrics, is the number of samples in , is the set of samples in such that , is the number of samples in , is the function employed for the fairness metric definition, and is a tolerance threshold. The subset is chosen based on the function used for the fairness metric: if the fairness function does not involve any conditional expectation (e.g. accuracy parity), then ; if, on the other hand, the subset involves a conditional expectation, then the is the subset of where that condition holds, e.g. when the fairness metric is FNR parity .
Note how the expected values of the fairness constraints in (6) are substituted with empirical averages in (P2). Also, note that the constraints are not strict, meaning that there is a tolerance for how much the function can vary between certain groups and the overall population. The reason for this choice is twofold:

It facilitates the training since the solution subspace is larger.

It is known that some fairness metrics, such as FNR parity, are incompatible with DP and nontrivial accuracy. However, if the fairness metric is relaxed, fairness, accuracy, and privacy can coexist [jagielski2019differentially; cummings2019compatibility].
This way, we may rewrite (P2) in the form of (P1) to solve the problem with MMDM. To do so we let and , where
(9) 
and . Therefore, the parameters are updated according to
(10) 
where we note that and
(11) 
Now, the fairnessenforcing problem (P2) can be solved with gradient descent/ascent or minibatch stochastic gradient descent/ascent, where instead of the full dataset , , and , one considers batches , , and (or subsets) of that dataset. Moreover, it can be learned with DP adapting the DPSGD algorithm from [abadi2016deep], where a clipping bound and Gaussian noise is included in both the network parameters’ and multipliers’ individual updates. Nonetheless, there are a series of caveats of doing so:

The batch size should be large enough to, on average, have enough samples of each group so that the difference
(12) is well estimated.

In many situations of interest such as when we want to enforce FNR parity or accuracy parity, the function employed for the fairness metric is not differentiable and thus does not exist. To solve this issue, we resort to estimate the gradient using a differentiable estimation of the function aggregate . For instance:

Enforcing FNR parity on a neural network
(with a Sigmoid output activation function) in a binary classification task. We note that given an input
, the raw output of the network is an estimate of the probability that . Hence, the function aggregate can be estimated as(13) 
Enforcing accuracy parity on a neural network (with softmax output activation function) in a multiclass classification task. We note that given an input , the raw output of the network is an estimate of the probability that . Hence, the function aggregate can be estimated as
(14) where
the onehot encoding vector of
.

Finally, we conclude this subsection noting the similarities and differences of this work and [tran2021differentially]. Even though their algorithm is derived from first principles on Lagrangian duality, their resulting algorithm is equivalent to an application of the basic method of differential multipliers (BMDM) to solve a problem equivalent to (P2) when . Nonetheless, the two algorithms differ in three main aspects:

The difference between BMDM and MMDM: BMDM is equivalent to MMDM when , that is, when the effect of making the neighbourhood of the solution subspace quadratic is not present. Moreover, the guarantee of achieving a local minimum that respects the constraints does not hold for unless the problem is simple (e.g. quadratic programming).

How they deal with impossibilities in the goal of achieving perfect fairness together with privacy and accuracy. In [tran2021differentially], the authors include a limit to the Lagrange multiplier to avoid floating point errors and reaching trivial solutions, which in our case is taken care by the tolerance , which, in contrast to , is an interpretable parameter.

In this paper we consider the exact expression for the gradient , see (11), while in [tran2021differentially] the authors employ the following approximation
(15) where the sign of the difference is ignored.
In the next subsection, we extend the algorithm to PFL, which introduces two new differences with [tran2021differentially]. Firstly, the privacy guarantees will be provided for the individuals, and not only to the group to which they belong; and secondly, the algorithm will be tailored to federated learning.
3.2 Extending the Algorithm to Private Federated Learning
In the federated learning setting we now consider that the dataset is distributed across users such that each user maintains a local dataset with samples. Nonetheless, as in the central setting, the task is to find the model’s parameters that minimize the loss function across the data samples of the users while enforcing a measure of fairness to the model. That is, to solve (P2).
To achieve this goal, we might first combine the ideas from FederatedSGD [mcmahan2017communication] and the previous section to extend the developed adaptation of MMDM to FL. In order to perform the model updates dictated by (10), the central server requires the following statistics:
(16) 
However, some of these statistics can be obtained from the others, namely , , and . Moreover, as mentioned in the previous section, one might use a sufficiently large batch of the data instead of the full dataset for each update. Therefore, we consider an iterative algorithm where, at each iteration, the central server samples users that report a vector with the sufficient statistics for the update, that is
(17) 
This way, if we define the batch as the sum of the users’ local datasets and the batch analogously, then the aggregation of each user’s vectors results in
(18) 
which contains all the sufficient statistics for the parameters’ update.
Finally, the resulting algorithm, termed Fair PFL or FPFL and described in Algorithm 1, inspired by the ideas from [mcmahan2018learning; truex2019hybrid; granqvist2020improving; bonawitz2017practical] guarantees the users’ privacy as follows:

It makes sure that the aggregation of the users’ sent vectors is done securely by a trusted third party, e.g. with secure aggregation [bonawitz2017practical].

It clips the vectors with a clipping bound to restrict their sensitivity, i.e. replace by . Then, it ensures
DP by adding Gaussian noise with variance
to the clipped vector. The parameteris calculated according to the refined moments accountant privacy analysis from
[wang2019subsampled], taking into account the number of iterations (or communication rounds) the algorithm will be run, the number of users (or cohort size) that are used per iteration, the total number of users (or population size) , and the privacy parameters and .Note that even if it later performs several post processing computations to extract the relevant information from the vector and update the Lagrangian and the parameters , the privacy guarantees do not change thanks to the postprocessing property of DP.
3.2.1 Local updates and batch size
The proposed algorithm extends the MMDM algorithm from Section 3.1 to FL adapting FederatedSGD, where each user uses all their data, computes the necessary statistics for a model update, and sends them to the thirdparty.
A natural question could be why is not FederatedAveraging adapted instead. That is, to perform several stochastic gradient descent/ascent updates of the model’s parameters and the Lagrange multipliers , and send the difference of the updated and the original version, i.e. to send the vector . This way, the size of the communicated vector would be reduced from to just and a larger part of the computation would be done locally, increasing the convergence speed. Moreover, the clipping bound could be reduced, thus decreasing the noise necessary for the DP guarantees.
Unfortunately, for the proposed MMDM algorithm, this option could lead to catastrophic effects. Imagine for instance a situation where each user only has data points belonging to one group, say or . Then in their local dataset the general population is equivalent to the population of their group and thus , implying that locally . Therefore, the Lagrange multipliers will never be locally updated and the weights updates will be equivalent to those updates without considering the fairness constraints. That is, using this approach one would (i) recover the same algorithm than standard PFL and (ii) communicate a vector of size instead of size .
Another question is if one could use the algorithm as described in Section 3.2 but using only a fraction of the users’ data in each update, i.e. using a batch of their local dataset . The answer to this is that this is possible. Nonetheless, (i) it is convenient to delegate as much computation to the user as possible and (ii) it is desirable to use as much users’ data as possible to have a good approximation of the performance metrics , which are needed to enforce fairness.
4 Experimental Results
We study the performance of the algorithm in two different classification tasks. The first task is a binary classification based on some demographic data from the publicly available Adult dataset [Dua:2019]. The fairness metric considered for this task is FNR parity. The second task is a multiclass classification where there are three different attributes. This task uses a modification of the publicly available FEMNIST dataset [caldas2018leaf] and the fairness metric considered is accuracy parity.
For the first task we first compare the performance of the MMDM algorithm to vanilla SGD centrally. After that, for both tasks, we confirm how FederatedSGD deteriorates the performance of the model for the underrepresented classes when clipping and noise (DP) are introduced. Finally, we demonstrate how FPFL can, under the appropriate circumstances, level the performance of the model across the groups without largely decreasing the overall performance of the model. In all our experiments, the fairness metrics are defined as the maximum difference between the value of a measure of performance on the general testing data and the value of that performance measure for each of the groups described by the sensitive attribute.
4.1 Results on the Adult dataset
Adult dataset, from the UCI Machine Learning Repository [Dua:2019].
This dataset consists of 32,561 training and 16,281 testing samples of demographic data from the US census. Each datapoint contains various demographic attributes. Though the particular task, predicting individuals’ salary ranges, is not itself of interest, this dataset serves as a proxy for tasks with inequalities in the data. A reason this dataset is often used in the literature on ML fairness is that the fraction of individuals in the higher salary range is 30% for the men and only 10% for the women. The experiments in this paper will aim to stop this imbalance from entering into the model by balancing the false negative rate [castelnovo2021zoo].
Federated Adult dataset.
To generate a version of the Adult dataset suitable for federated learning, it must be partitioned into individual contributions. In the experiments, differential privacy will be guaranteed per contribution. In this paper, the number of datapoints per contribution is Poissondistributed with mean of 2.
Privacy and fairness parameters.
For all our experiments, we considered the privacy parameters and and the fairness tolerance .
Data preprocessing.
The 7 categorical variables where onehot encoded and the 6 numerical variables where normalized with the training mean and variance. There is an underlying assumption that these means and variances can be learned at a low privacy cost. Hence, to be precise, the private models are
DP, where is a small constant representing the privacy budget for learning said parameters for the normalization.Models considered.
We experimented with two different fully connected networks. The first network, from now on the shallow network, has one hidden layer with 10 hidden units and a ReLU activation function. The second network, henceforth the deep network, has three hidden layers with 16, 8, and 8 hidden units respectively and all with ReLU activation functions. Both networks ended with a fully connected layer to a final output unit with a Sigmoid activation function.
Hyperparameters.
For all the experiments, the learning rate was for the network parameters and for the Lagrange multipliers. The damping parameter was . The batch size for the experiments learned centrally was and the cohort sized studied for the federated experiments were and . Finally, the clipping bounds for the shallow and the deep networks were, respectively and depending if the training algorithm was PFL or FPFL, or and or . These hyperparameters where not selected with a private hyperparameter search and were just set as an exemplary configuration. If one desires to find the best hyperparameters, one can do so at an additional privacy budget cost following e.g. [abadi2016deep, Appendix D].
We start our experiments comparing the performance and fairness of the models trained with vanilla SGD and the MMDM adaptation. We trained the shallow and deep networks with these algorithms, where we tried to enforce FNR parity with a tolerance . The results after 1,000 iterations are displayed in Table 1
, where we also consider the gap in other common measures of fairness such as the equalized odds, the demographic parity, or the predictive parity, see e.g.
[castelnovo2021zoo].The MMDM algorithm reduces the FNR gap from 7% to the targeted 2% for both the shallow and deep networks, while reducing the accuracy of the model by less than 0.5%, thus succeeding in its objective. Similarly, the gap in the equalized odds, which is a stronger fairness notion than the FNR parity, also decreases from around 7% to 3%. Moreover, the demographic parity gap, which considers the probability of predicting one or the other target class, also improves. In terms of predictive parity, which uses the precision as the performance function, the MMDM algorithm did not improve the parity among groups.
Model  Algorithm  Accuracy  FNR gap  EO gap  DemP gap  PP gap 

Deep  SGD  0.858  0.070  0.070  0.117  0.006 
Shallow  SGD  0.857  0.071  0.071  0.113  0.016 
Deep  MMDM  0.854  0.020  0.026  0.087  0.074 
Shallow  MMDM  0.855  0.019  0.029  0.092  0.044 
The second experiment is to study how clipping and DP deprecates the performance for the underrepresented groups, thus increasing the fairness gap on the different fairness metrics. For that, we trained the same shallow and deep networks with FederatedSGD and versions of this algorithm where only clipping was performed and where both clipping and DP where included. They were trained with a cohort size of for iterations and the model with best training cohort accuracy was selected. The results are displayed in Table 2. First, we note how the performance and fairness of the deep network does not change much when going from the central to the federated setting. The shallow network, on the other hand, becomes less fair under all the metrics considered. The introduction of clipping largely increases the unfairness of the models reaching more than a 16% gap in FNR. The addition of DP does not have a larger effect on the unfairness of the models. These observations are in line with [bagdasaryan2019differential], where they note that the underrepresented groups usually have the higher loss gradients and thus clipping affects them more than the majority groups.
After that, we repeated the above experiment with FPFL. However, we noted that FPFL converged to a solution faster than FederatedSGD, and thus the models were trained for only iterations. Here the model with the best training cohort accuracy that respected the fairness condition on the training cohort data was selected. Note that the fairness condition is evaluated with the noisy statistics of the cohort users recovered from the aggregation done by the third party, so a model may be deemed as fair while slightly violating the desired constraints. The results are also included in Table 2 to aid the comparison. We note how, similarly to the central case, models trained with FPFL achieve to enforce the fairness constraints while keeping a similar accuracy. Then, clipping does not seem to affect largely the performance of FPFL since it compensates the gradient loss clipping with the fairness enforcement. Finally, the addition of noise to guarantee DP is not a concern to the shallow network but it deteriorates the performance of the deep network. This is largely due to the fact that the noise is large enough so that sign of the constraints’ gradient, see (11), is sometimes mistaken.
Finally, we repeat the experiments with PFL and FPFL with a larger cohort size, , to see if a smaller relative noise would aid the training with PFL or with FPFL. The results with PFL were almost identical, with similar levels of accuracy and unfairness. On the other hand, the larger signaltoDP noise ratio helped the models trained with FPFL to keep models with the desired levels of FNR gap and lower unfairness measured with any other metric. Moreover, the accuracy of the models, that now work better for the underrepresented group, is in fact slightly higher than for the models trained with PFL.
Model  Algorithm  Accuracy  FNR gap  EO gap  DemP gap  PP gap 

Deep  FL  0.853  0.078  0.078  0.125  0.015 
Shallow  FL  0.851  0.121  0.121  0.122  0.036 
Deep  FFL  0.854  0.001  0.030  0.093  0.048 
Shallow  FFL  0.855  0.036  0.039  0.108  0.033 
Deep  FL + Clip  0.848  0.160  0.160  0.131  0.056 
Shallow  FL + Clip  0.844  0.169  0.169  0.129  0.051 
Deep  FFL + Clip  0.852  0.008  0.023  0.081  0.031 
Shallow  FFL + Clip  0.853  0.018  0.029  0.090  0.016 
Deep  PFL  0.849  0.123  0.123  0.126  0.057 
Shallow  PFL  0.844  0.167  0.167  0.129  0.530 
Deep  FPFL  0.804  0.079  0.080  0.045  0.092 
Shallow  FPFL  0.840  0.001  0.028  0.080  0.020 
Deep  PFL  0.847  0.167  0.167  0.132  0.043 
Shallow  PFL  0.847  0.148  0.148  0.126  0.041 
Deep  FPFL  0.848  0.027  0.027  0.080  0.026 
Shallow  FPFL  0.851  0.001  0.027  0.087  0.002 
4.2 Results on the modified FEMNIST dataset
FEMNIST dataset [caldas2018leaf].
This dataset is an adaptation of the Extended MNIST dataset
[cohen2017emnist], which collects more than 800,000 samples of digits and letters distributed across 3,550 users. The task considered is to predict which of the 10 digits or 26 letters (upper or lower case) is depicted in the image, so it is a multiclass classification with 62 possible classes.Unfair FEMNIST dataset.
We considered the FEMNIST dataset with only the digit samples. This restriction consists of 3,383 users spanning 343,099 training and 39,606 testing samples. The task now is to predict which of the 10 digits is depicted in the image, so it is a multiclass classification with 10 possible classes. Since the dataset does not contain clear sensitive groups, we artificially create three classes (see fig. 1):

Users that write with a black pen in a white sheet. These users represent the first (lexicographical) 45% of the users, i.e. users. These users contain 146,554 (42.7%) training and 16,689 (42.1%) testing samples.
The images belonging to this group are unchanged.

Users that write with a blue pen in a white sheet. These users represent the second (lexicographical) 45% of the users, i.e. 1,522 users as well. These users contain 159,902 (46.6%) training and 18,672 (47.1%) testing samples.
The images belonging to this group are modified making sure that the digit strokes are blue instead of black.

Users that write with white chalk in a blackboard. These users represent the last remaining 10% of the users, i.e. 339 users. These users contain 36,643 (10.7%) training and 4,245 (10.7%) testing samples.
The images belonging to this group are modified making sure that the digit strokes are white and the background is black. Moreover, to make the task more unfair, we simulated the blurry effect that chalk leaves in a blackboard. With this purpose, we added Gaussian blurred noise to the image, and then we blended them with further Gaussian blur. To be precise, if is the image normalized to , the blackboard effect is the following.
(19) where is Gaussian noise of the size of the image, and are Gaussian kernels^{5}^{5}5We used the Gaussian filter implementation from SciPy [2020SciPyNMeth].
with standard deviation 1 and 2, respectively, and
represents the convolution operation. Moreover, the images are rotated 90 degrees, simulating how the pictures were taken with the device in horizontal mode due to the usual shape of the blackboards.
Privacy and fairness parameters.
For all our experiments, we considered the privacy parameters and . However, for the last experiment, we consider the hypothetical scenario where we had a larger number of users and , thus decreasing the privacy parameter to and and reducing the added noise in the analysis from [wang2019subsampled].
Model considered.
We experimented with a network with 2 convolution layers with kernel of size
, stride of 2, ReLU activation function, and 32 and 64 filters respectively. These layers are followed by a fully connected layer with 100 hidden units and a ReLU activation function, and a fully connected output layer with 10 hidden units and a Softmax activation function. From now on, this model will be referred as the convolutional network.
Hyperparameters.
For all the experiments the learning rate was for the network parameters and for the Lagrange multipliers. The damping parameter was . Note that we set a larger damping parameter to increase the strength to which we want to enforce the constraints, given that the task is harder than before. The cohort sizes considered were and . Finally, the clipping bound for the convolutional network was if the training algorithm was PFL and if it was FPFL. As previously, these hyperparameters where not selected with a private hyperparameter search and were just set as an exemplary configuration. If one desires to find the best hyperparameters, one can do so at an additional privacy budget cost following e.g. [abadi2016deep, Appendix D].
We start our experiments confirming again the hypothesis and findings from [bagdasaryan2019differential] stating that clipping and DP disproportionately affect underrepresented groups. For that, we trained a convolutional network with FederatedSGD and versions of this algorithm where only clipping was performed and where both clipping and DP where included. They were trained with a cohort size of for iterations and the model with best training cohort accuracy was selected. The results are displayed in Table 3. Similarly to before, we see how clipping increases the accuracy gap from 13% to almost 17%. In this case, since the number of users is small, the necessary DP noise standard deviation is large compared to the users’ sent statistics norm, and thus both the accuracy and the accuracy gap are severely affected by the addition of DP. Namely, the accuracy drops from more than 94% with clipping to 80.7% when DP is also included, and the accuracy gap increases until more than 40%.
Algorithm  Population  Accuracy  Accuracy gap 

FL  0.960  0.134  
FFL  0.950  0.047  
FL + Clip  0.946  0.166  
FFL + Clip  0.954  0.053  
PFL  0.807  0.409  
FPFL  0.093  0.015  
PFL  0.951  0.157  
FPFL  0.903  0.074  
PFL  0.951  0.153  
FPFL  0.927  0.073 
The second experiment tests if FPFL can remedy the unfairness without deteriorating too much the accuracy. We trained the same convolutional network for also iterations and the model with the best training cohort accuracy that respected the fairness condition on the training cohort data was selected. We see how, when DP noise is not included, FPFL manages to reduce the accuracy gap with respect to FederatedSGD by around 9% while keeping the accuracy within 1%. We note how, as before, clipping does not affect largely the ability of FPFL to enforce fairness. However, note that since the data is more noniid than before (i.e. there are more differences between the distribution of each user and the general data distribution) the models that are deemed fair in the training cohort may not be as fair in the general population, and now we see a larger gap between the desired tolerance and the obtained accuracy gap from FPFL without noise (0.047 and 0.053 without and with clipping).
When DP is included, the noise is too big for FPFL to function properly and many times the sign of the constraints’ gradient, see (11), is flipped. Note that in the estimation of the performance function, i.e. , both the numerator and denominator are obtained from a noisy vector, thus increasing the variance of the estimation and being more sensitive to noise than the estimators for FederatedSGD.
For this reason, we considered the scenario where the number of users was 100 and 1,000 times larger, i.e. and , which is a conservative assumption for federated learning settings [apple_privacy_scale]. Then, we repeated the experiment with DP FederatedSGD and FPFL where the DP noise was calculated assuming this larger number of users and where we increased the cohort size to . In this scenario, DP FederatedSGD maintained an accuracy gap of more than 15% while FPFL reduced this gap to less than a half in both cases. Nonetheless, the accuracy was slightly more deteriorated than before, with a reduction of around 5% and 2% with respect to DP FederatedSGD when the hypothetical population was increased by 100 and 1,000, respectively.
5 Conclusions
In this paper, we studied and proposed a solution for the often overlooked problem of group fairness in private federated learning. For this purpose, we adapt the modified method of multipliers (MMDM) [platt1987constrained] to empirical loss minimization with fairness constraints, which in itself serves as an algorithm for enforcing fairness in central learning. Then, we extend this algorithm to private federated learning.
Through experiments in the Adult [Dua:2019] and a modified version of the FEMNIST [caldas2018leaf] datasets, we first corroborate previous knowledge that DP disproportionately affects the performance to underrepresented groups [bagdasaryan2019differential], with the further observation that this is true for many different fairness metrics, and not only for accuracy parity. Moreover, we demonstrate how the proposed FPFL algorithm is able to remedy this unfairness even in the presence of DP.
Limitations.
The FPFL algorithm is more sensitive to DP noise than other algorithms for PFL. In our experiments, this usually requires either to increase the cohort size or to ensure that enough users take part on the model’s training. Nonetheless, for the present experiments, the number of users required is still lower (more than an order of magnitude) than the usual amount of users available in professional federated learning settings [apple_privacy_scale].
6 Acknowledgments
The authors would like to thank Kamal Benkiran and Áine Cahill for their helpful discussions.
Comments
There are no comments yet.