A key learning scenario in large-scale applications is that of federated learning. In that scenario, a centralized model is trained based on data originating from a large number of clients, which may be mobile phones, other mobile devices, or sensors (Konečnỳ, McMahan, Yu, Richtárik, Suresh, and Bacon, 2016b, a). The training data typically remains distributed over the clients, each with possibly unreliable or relatively slow network connections.
Federated learning raises several types of issues and has been the topic of multiple research efforts. These include systems, networking and communication bottleneck problems due to frequent exchanges between the central server and the clients . To deal with such problems, McMahan et al. (2017) suggested an averaging technique that consists of transmitting the central model to a subset of clients, training it with the data locally available, and averaging the local updates. Smith et al. (2017) proposed to further leverage the relationship between clients, assumed to be known, and cast the problem as an instance of multi-task learning to derive local client models benefiting from other similar ones.
The optimization task in federated learning, which is a principal problem in this scenario, has also been the topic of multiple research work. That includes the design of more efficient communication strategies (Konečnỳ, McMahan, Yu, Richtárik, Suresh, and Bacon, 2016b, a; Suresh, Yu, Kumar, and McMahan, 2017), devising efficient distributed optimization methods benefiting from differential privacy guarantees (Agarwal, Suresh, Yu, Kumar, and McMahan, 2018), as well as recent guarantees for parallel stochastic optimization with a dependency graph (Woodworth, Wang, Smith, McMahan, and Srebro, 2018).
Another key problem in federated learning which appears more generally in distributed machine learning and other learning setups is that offairness. In many instances in practice, the resulting learning models may be biased or unfair: they may discriminate against some protected groups (Bickel, Hammel, and O’Connell, 1975; Hardt, Price, Srebro, et al., 2016). As a simple example, a regression algorithm predicting a person’s salary could be using that person’s gender. This is a key problem in modern machine learning that does not seem to have been specifically studied in the context of federated learning.
While many problems related to federated learning have been extensively studied, the key objective of learning in that context seems not to have been carefully examined. We are also not aware of statistical guarantees derived for learning in this scenario. A crucial reason for such questions to emerge in this context is that the target distribution for which the centralized model is learned is unspecified. Which expected loss is federated learning seeking to minimize? Most centralized models for standard federated learning are trained on the aggregate training sample obtained from the subsamples drawn from the clients. Thus, if we denote by the distribution associated to client , the size of the sample available from that client and the total sample size, intrinsically, the centralized model is trained to minimize the loss with respect to the uniform distribution
But why should be the target distribution of the learning model? Is the distribution that we expect to observe at test time? What guarantees can be derived for the deployed system?
Notice that, in practice, in federated learning, the probability that an individual data source participates in training depends on various factors such as whether the mobile device is connected to the internet or whether it is being charged. Thus, the training data may not truly reflect the usage of the learned model in inference. Additionally, these uncertainties may also affect the size of the sampleacquired from each client, which directly affects the definition of .
We argue that in many common instances, the uniform distribution is not the natural objective distribution and that seeking to minimize the expected loss with respect to the specific distribution is risky. This is because the target distribution may be in general quite different from . In many cases, that can result in a suboptimal or even a detrimental performance. For example, imagine a plausible scenario of federated learning where the learner has access to a large population of expensive mobile phones, which are most commonly adopted by software engineers or other technical users (say ) than other users (), and a small population of other mobile phones less used by non-technical users () and significantly more often by other users (). The centralized model would then be essentially based on the uniform distribution based on the expensive clients. But, clearly, such a model would not be adapted to the wide general target domain formed by the majority of phones with a population of general versus technical users. Many other realistic examples of this type can help illustrate the learning problem resulting from a mismatch between the target distribution and . In fact, it is not clear why minimizing the expected loss with respect to could be beneficial for the clients, whose distributions are s.
Thus, we put forward a new framework of agnostic federated learning (AFL), where the centralized model is optimized for any possible target distribution formed by a mixture of the client distributions. Instead of optimizing the centralized model for a specific distribution, with the high risk of a mismatch with the target, we define an agnostic and more risk-averse objective. We show that, for some target mixture distributions, the cross-entropy loss of the hypothesis obtained by minimization with respect to the uniform distribution can be worse, by a constant additive term, than that of the hypothesis obtained in AFL, even if the learner has access to an infinite sample size (Section 3.2).
We further show that our AFL framework naturally yields a notion of fairness, which we refer to as good-intent fairness (Section 3.3). Indeed, the predictor solution of the optimization problem for our AFL framework treats all protected categories similarly. Beyond federated learning, our framework and solution also cover related problems in cloud-based learning services, where customers may not have any training data at their disposal or may not be willing to share that data with the cloud. In that case too, the server needs to train a model without access to the training data. Our framework and algorithm can also be of interest to other learning scenarios such as domain adaptation, drifting, and other contexts where the training and test distributions do not coincide.
The rest of the paper is organized as follows. In Section 2, we give an extensive discussion of related work, including connections with the broad literature of domain adaptation. In Section 3, we give a formal description of the learning scenario of federated learning and the formulation of the problem as AFL. Next, we give a detailed theoretical analysis of learning in the AFL framework, including data-dependent Rademacher complexity generalization bounds (Section 4). These bounds lead to a natural learning algorithm with a regularization term based on a skewness term that we define (Section 5). We also present an efficient convex optimization algorithm for solving the optimization problem defining our algorithm (Section 5.2
). Our algorithm is a stochastic gradient-descent solution for minimax problems, for which we give a detailed analysis, including the proof of convergence in terms of the variances of the stochastic gradients. In Section6, we present a series of experiments comparing our AFL algorithm and solution with existing federated learning solutions. In Section 7, we discuss several extensions of AFL.
2 Related work
Here, we briefly discuss several learning scenarios and work related to our study of federated learning.
The problem of federated learning is closely related to other learning scenarios where there is a mismatch between the source distribution and the target distribution. This includes the problem of transfer learning or domain adaptation from a single source to a known target domain (Ben-David, Blitzer, Crammer, and Pereira, 2006; Mansour, Mohri, and Rostamizadeh, 2009b; Cortes and Mohri, 2014; Cortes, Mohri, and Muñoz Medina, 2015), either through unsupervised adaptation techniques (Gong et al., 2012; Long et al., 2015; Ganin and Lempitsky, 2015; Tzeng et al., 2015), or via lightly supervised ones (some amount of labeled data from the target domain) (Saenko et al., 2010; Yang et al., 2007; Hoffman et al., 2013; Girshick et al., 2014)
. This also includes previous applications in natural language processing(Dredze et al., 2007; Blitzer et al., 2007; Jiang and Zhai, 2007; Raju et al., 2018), speech recognition (Legetter and Woodland, 1995; Gauvain and Chin-Hui, 1994; Pietra et al., 1992; Rosenfeld, 1996; Jelinek, 1998; Roark and Bacchiani, 2003)
, and computer vision(Martínez, 2002)
A problem more closely related to that of federated learning is that of multiple-source adaptation, first formalized and analyzed theoretically by Mansour, Mohri, and Rostamizadeh (2009c, a) and later studied for various applications such as object recognition (Hoffman et al., 2012; Gong et al., 2013a, b). Recently, Zhang et al. (2015) studied a causal formulation of this problem for a classification scenario, using the same combination rules as Mansour et al. (2009c, a). The problem of domain generalization (Pan and Yang, 2010; Muandet et al., 2013; Xu et al., 2014), where knowledge from an arbitrary number of related domains is combined to perform well on a previously unseen domain is very closely related to that of federated learning, though the assumptions about the information available to the learner and the availability of unlabeled data may differ.
In the multiple-source adaptation problem studied by Mansour, Mohri, and Rostamizadeh (2009c, a) and Hoffman, Mohri, and Zhang (2018), each domain is defined by the corresponding distribution and the learner has only access to a predictor for each domain and no access to labeled training data drawn from these domains. The authors show that it is possible to define a predictor whose expected loss with respect to any distribution that is a mixture of the source domains is at most the maximum expected loss of the source predictors: . They also provide an algorithm for determining .
Our learning scenario differs from the one adopted in that work since we assume access to labeled training data from each domain . Furthermore, the predictor determined by the algorithm of Hoffman, Mohri, and Zhang (2018) belongs to a specific hypothesis set , which is that of distribution weighted combinations of the domain predictors , while, in our setup, the objective is to determine the best predictor in some global hypothesis set , which may include as a subset, and which is not depending on some domain-specific predictors.
Our optimization solution also differs from the work of Farnia and Tse (2016) and Lee and Raginsky (2017) on local minimax results, where samples are drawn from a single source , and where the generalization error is minimized over a set of locally ambiguous distributions , where is the empirical distribution. The authors propose this metric for statistical robustness. In our work, we obtain samples from unknown distributions, and the set of distributions over which we optimize the expected loss is fixed and independent of samples. Furthermore, the source distributions can differ arbitrarily and need not be close to each other. In reverse, we note that our stochastic algorithm can be used to minimize the loss functions proposed in (Farnia and Tse, 2016; Lee and Raginsky, 2017).
3 Learning scenario
In this section, we introduce the learning scenario of agnostic federated learning we consider. Next, we first argue that the uniform solution commonly adopted in standard federated learning may not be an adequate solution, thereby further justifying our agnostic model. Second, we show the benefit of our model in fairness learning.
We start with some general notation and definitions used throughout the paper. Let denote the input space and the output space. We will primarily discuss a multi-class classification problem where is a finite set of classes, but much of our results can be extended straightforwardly to regression and other problems. The hypotheses we consider are of the form , where stands for the simplex over . Thus,
is a probability distribution over the classes or categories that can be assigned to. We will denote by a family of such hypotheses . We also denote by a loss function defined over and taking non-negative values. The loss of for a labeled sample is given by . One key example in applications is the cross-entropy loss, which is defined as follows: . We will denote by the expected loss of a hypothesis with respect to a distribution over :
and by its minimizer: .
3.1 Agnostic federated learning
We consider a learning scenario where the learner receives samples , with each of size drawn i.i.d. from a different domain or distribution . The learner’s objective is to determine a hypothesis that performs well on some target distribution. We will also denote by the empirical distribution associated to sample of size drawn from .
This scenario coincides with that of federated learning where training is done with the uniform distribution over the union of all samples , that is , and where the underlying assumption is that the target distribution is . We will not adopt that assumption since it is rather restrictive and since, as discussed later, it can lead to solutions that are disadvantageous to domain users. Instead, we will consider an agnostic federated learning (AFL) scenario where the target distribution can be modeled as an unknown mixture of the distributions , , that is for some . Since the mixture weight is unknown, here, the learner must come up with a solution that is favorable for any in the simplex, or any in a subset . Thus, we define the agnostic loss (or agnostic risk) associated to a predictor as
We will extend our previous definitions and denote by the minimizer of this loss:
In practice, the learner has access to the distributions only via the finite samples . Thus, for any , instead of the mixture , only the -mixture of empirical distributions, , is accessible.111Note, is distinct from an empirical distribution which would be based on a sample drawn from . is based on samples drawn from s. This leads to the definition of , the agnostic empirical loss of a hypothesis for a subset of the simplex :
We will denote by the minimizer of this loss: . In the next section, we will present generalization bounds relating the expected and empirical agnostic losses and for all .
Notice that the domains discussed thus far need not coincide with the clients. In fact, when the number of clients is very large and is the full simplex, , it is typically preferable to consider instead domains defined by clusters of clients, as discussed in Section 7. On the other hand, if is small or more restrictive, then the model may not perform well on certain domains of interest. We mitigate the effect of large values using a suitable regularization term derived from our theory.
3.2 Comparison with federated learning
Here, we further argue that the uniform solution commonly adopted in federated learning may not provide a satisfactory performance compared with a solution of the agnostic problem. This further motivates our AFL model.
As already discussed, since the target distribution is unknown, the natural method for the learner is to select a hypothesis minimizing the agnostic loss . Is the predictor minimizing the agnostic loss coinciding with the solution of standard federated learning? How poor can the performance of the standard federated learning be? We first show that the loss of can be higher than that of the optimal loss achieved by by a constant loss, even if the number of samples tends to infinity, that is even if the learner has access to the distributions and uses the predictor
. Similar results are known for universal compression, where the goal is to compress a sequence of random variables without knowledge of the generating distribution(Grünwald, 2007). Let be the cross-entropy loss. Then, there exist , , and , , such that the following inequality holds:
Consider the following two distributions with support reduced to a single element and two classes : , , , and . Let , where , , denotes the Dirac measure on index . We will consider the case where the sample sizes are all equal, that is . Let denote the probability that assigns to class and the one it assigns to class . Then, the cross-entropy loss of a predictor can be expressed as follows:
where the last inequality follows the non-negativity of the relative entropy. Furthermore, equality is achieved when , which defines , the minimizer of . In view of that, is given by the following:
We now compute the loss of :
since is the solution of the convex optimization in , in view of for .
3.3 Good-intent fairness in learning
Here, we further discuss the relationship between our model of AFL and fairness in learning.
Fairness in machine learning has received much attention in recent past (Bickel et al., 1975; Hardt et al., 2016). There is now a broad literature on the topic with a variety of definitions of the notion of fairness. In a typical scenario, there is a protected class among classes . While there are many definitions of fairness, the main objective of a fairness algorithm is to reduce bias and ensure that the model is fair to all the protected categories, under some definition of fairness. The most common reasons for bias in machine learning algorithms are training data bias and overfitting bias. We first provide a brief explanation and illustration for both:
the training data is biased: consider the regression task, where the goal is to predict the salary of a person based on features such as education, location, age, gender. Let gender be the protected class. If in the training data, there is a consistent discrimination against women irrespective of their education, e.g., their salary is lower, then we can conclude that the training data is inherently biased.
the training procedure is biased: consider an image recognition task where the protected category is race. If the model is heavily trained on images based on certain races, then the resulting model will be biased because of over-fitting.
Our model of AFL can help define a notion of good-intent fairness, where we reduce the bias in the training procedure. Furthermore, if training procedure bias exists, it naturally highlights it.
Suppose we are interested in a classification problem and there is a protected feature class , which can be one of values . Then, we define as the conditional distribution with the protected class being . If is the true underlying distribution, then
Let be the collection of Dirac measures over the indices in . With this definition, we define a good-intent fairness algorithm as one seeking to minimize the agnostic loss . Thus, the objective of the algorithm is to minimize the maximum loss incurred on any of the underlying protective classes and hence does not overfit the data to any particular model at the cost of others. Furthermore, it does not degrade the performance of the other classes so long as it does not affect the loss of the most-sensitive protected category. We further note that our approach does not reduce bias in the training data and is useful only for mitigating the training procedure bias.
4 Learning bounds
In this section, we present learning guarantees for agnostic federated learning. Let denote the family of the losses associated to a hypothesis set : . Our learning bounds are based on the following notion of weighted Rademacher complexity which is defined for any hypothesis set
, vector of sample sizesand mixture weight , by the following expression:
where is a sample of size and a collection of Rademacher variables, that is uniformly distributed random variables taking values in . We also defined the minimax weighted Rademacher complexity for a subset by
Let denote the empirical distribution over defined by the sample sizes , where . We define the skewness of with respect to by
where, for any two distributions and in , the chi-squared divergence is given by . We will also denote by a minimum -cover of in distance, that is,
where is a set of distributions such that for every , there exists such that .
Our first learning guarantee is presented in terms of , the skewness parameter and the -cover .
Assume that the loss is bounded by . Fix and . Then, for any , with probability at least over the draw of samples , the following inequality holds for all and :
Let be a sample differing from only by point in and in . Then, since the difference of suprema over the same set is bounded by the supremum of the differences, we can write
Thus, by McDiarmid’s inequality, for any , the following inequality holds with probability at least for any :
Therefore, by the union over , with probability at least , for any and the following holds:
By definition of , for any , there exists such that . In view of that, with probability at least , for any and the following holds:
The expectation appearing on the right-hand side can be bounded following standard proofs for Rademacher complexity upper bounds (see for example (Mohri et al., 2018)), leading to
The sum can be expressed in terms of the skewness of , using the following equalities:
This completes the proof. It can be proven that the skewness parameter appears in a lower bound on the generalization bound. We will include that result in the final version of this paper. The theorem yields immediately upper bounds for agnostic losses by taking the maximum over : for any , with probability at least , for any ,
The following result shows that, for a family of functions taking values in , the Rademacher complexity can be bounded in terms of the VC-dimension and the skewness of .
Let be a loss function taking values in and such that the family of losses admits VC-dimension . Then, the following upper bound holds for the weighted Rademacher complexity of :
For any , define the set of vectors in by
For any , . Then, by Massart’s lemma, for any , the following inequalities hold:
By Sauer’s lemma, the following holds for : . Plugging in the right-hand side in the inequality above completes the proof. Both Lemma 4 and the generalization bound of Theorem 4 can thus be expressed in terms of the skewness parameter . Note that modulo the skewness parameter, the results look very similar to standard generalization bounds (Mohri et al., 2018). Furthermore, when contains only one distribution and is the average distribution, that is
, then the skewness is equal to one and the results coincide with the standard guarantees in supervised learning.
Theorem 4 and Lemma 4 also provide guidelines for choosing the domains and . When is large and , then, the number of samples per domain could be small, the skewness parameter would then be large and the generalization guarantees for the model would become weaker. We suggest some guidelines for choosing domains in Section 7. We further note that for a given , if contains distributions that are close to , then the model generalizes well.
The corollary above can be straightforwardly extended to cover the case where the test samples are drawn from some distribution , instead of . Define by . Then, the following result holds.
Assume that the loss function is bounded by . Then, for any and , with probability at least , the following inequality holds for all :
One straightforward choice of the parameter is , but, depending on and other tperms of the bound, more favorable choices may be possible. We conclude this section by adding that alternative learning bounds can be derived for this problem, as discussed in Appendix A.
In this section, we introduce a learning algorithm for agnostic federated learning using the guarantees proven in the previous section and discuss in detail an optimization solution.
The learning guarantees of the previous section suggest minimizing the asum of the empirical AFL term , a term controlling the complexity of and a term depending on the skewness parameter. Observe that, since is linear in , the following equality holds:
where is the convex hull of . Assume that is a vector space that can be equipped with a norm , as with most hypothesis sets used in learning applications. Then, given and the regularization parameters and , our learning guarantees suggest minimizing the regularized loss , where is a suitable norm controlling the complexity of and where is defined by . This can be equivalently formulated as the following minimization problem:
is a hyperparameter. This defines our algorithm for AFL.
Assume that is a convex function of its first argument. Then, is a convex function of . Since is a convex function of for any choice of the norm, for a fixed , the objective is a convex function of . The maximum over (taken in any set) of a family of convex functions is convex. Thus, is a convex function of and, when the hypothesis set is a convex, (6) is a convex optimization problem. In the next subsection, we present an efficient optimization solution for this problem, for which we prove convergence guarantees.
5.2 Optimization algorithm
When the loss function is convex, the AFL minmax optimization problem above can be solved using projected gradient descent or other instances of the generic mirror descent algorithm (Nemirovski and Yudin, 1983). However, for large datasets, that is and large, this can be computationally costly and typically slow in practice. Juditsky, Nemirovski, and Tauvel (2011) proposed a stochastic Mirror-Prox algorithm for solving stochastic variational inequalities, which would be applicable in our context. We present a simplified version of their algorithm for the AFL problem that admits a more straightforward analysis and that is also substantially easier to implement.
Our optimization problem is over two sets of parameters, the hypothesis and the mixture weight . In what follows, we will denote by a vector of parameters defining a predictor and will rewrite losses and optimization solutions only in terms of , instead of . We will use the following notation:
where stands for , the empirical loss of hypothesis (corresponding to ) on domain :
Since the regularization terms do not make the optimization problem harder, to simplify the discussion, we will consider the unregularized version of problem (6). Thus, we will study the following problem given by the set of variables :
Observe that problem (8) admits a natural game-theoretic interpretation as a two-player game, where nature selects to maximize the objective, while the learner seeks minimizing the loss. We are interested in finding the equilibrium of this game, which is attained for some , the minimizer of Equation 8 and , the hardest domain mixture weights. At the equilibrium, moving away from or from , increases the objective function. Hence, can be viewed as the center of in the manifold imposed by the loss function , whereas , the empirical distribution of samples, may lie elsewhere, as illustrated by Figure 2.
By Equation (5), using the set instead of does not affect the solution of the optimization problem. In view of that, in what follows, we will assume, without loss of generality, that is a convex set. Observe that, since is not an average of functions, standard stochastic gradient descent algorithms cannot be used to minimize this objective. We will present instead a new stochastic gradient-type algorithm for this problem.
Let denote the gradient of the loss function with respect to and the gradient with respect to . Let , and
be unbiased estimates of the gradient, that is,
We first give an optimization algorithm Stochastic-AFL for the AFL problem, assuming access to such unbiased estimates. The pseudocode of the algorithm is given in Figure 3. At each step, the algorithm computes a stochastic gradient with respect to and and updates the model accordingly. It then projects to by computing a value in via convex minimization. If is the full simplex, then there is a near-linear time algorithm for this projection Wang and Carreira-Perpinán (2013). It then repeats the process for steps and return the average of the weights. We provide guarantees for this algorithm in terms of the variance of the stochastic gradients when the loss function is convex and when the set of s, , is a compact set.