Collaborative learning refers to the setting where several agents, each having access to their own personal training data, collaborate in hopes of learning a better model. In most real applications, the data distributions are different for each agent. Each agent therefore faces a key decision, in order to obtain the highest quality model: Should they ignore their peers and simply train a model on their local data alone—or should they collaborate to potentially benefit from additional training data from other agents?
This fundamental question is at the center of two areas that are currently of high interest in both research and industry applications, namely model personalization as well as federated and decentralized learning.
In federated learning (FL) Konečný et al. (2016), training data is distributed over several agents or locations. For instance, these could be several hospitals collaborating on a clinical trial, or billions of mobile phones involved in training a next-word prediction model for a virtual keyboard application. The purpose of FL is to train a global model on the union of all agents’ individual data. The training is coordinated by a central server, while the agents’ local data never leaves its device of origin. Owing to the data locality, FL has become the most prominent collaborative learning approach in recent years towards privacy-preserving machine learning Kairouz et al. (2019); Li et al. (2020). Decentralized learning refers to the analogous setting without a central server, where agents communicate peer-to-peer during training, see e.g. Nedic (2020).
When each agent has a different data distribution111 See Kairouz et al. (2019, Section 3.1) for common types of violations of independence and identical distribution in FL. (and thus a different learning task), a “one model fits all” approach leads to poor accuracy on individual agents. Instead, a given global model (such as from FL) needs to be personalized, e.g., by additional training on the local data of our agent. Prominent approaches to address this important problem of statistical heterogeneity are additional local training or fine-tuning Wang et al. (2019), or weighted averaging between a global model and a locally trained model, during or after training Mansour et al. (2020); Deng et al. (2020).
In this paper, we address weighted averaging of several models, asking: When can a weighted model average outperform an individual agent’s local model? How much can we gain by using it? How much can we lose? We aim to answer this while making only minimal assumptions on the data distributions.
In the context of FL, an answer to these questions serves the users to determine under what conditions FL should be preferred to independently training a local model. It is also of interest to the FL server to identify and potentially reward participants with high contributions to the model quality. In decentralized learning, it can be used by agents to select most compatible peers during training Grimberg et al. (2020). The question also naturally extends to model personalization, as in Deng et al. (2020) who ask “when is personalization better?" and “what degree of personalization is best?" in the different context of model interpolation.
In this work, we analyse the linear combination of two models for arbitrary scalar mean estimation problems. Given a local empirical mean and some other empirical mean , we ask whether the weighted model average is a better estimator of the local true mean , than itself. For instance, could be a global model obtained through federated learning without the training data of .
We calculate the error of the weighted model average with respect to the local true mean, taking the expectation over and . We find the optimal weight to minimize this error, showing that the error of the optimally weighted average is reduced by a fraction (equal to the weight itself), compared to the error of .
In a variant of the bias-variance trade-off, we find that there is always some positive amount of model averaging that reduces the error compared to the local model (provided that has a non-zero variance). Recognizing that the optimal weight depends on quantities that are likely unknown to the experimenter, we quantify the error of a sub-optimally weighted model average with weight . We show that the error is better than the local model’s if , and that even a small weight can lead to a relatively large improvement. On the other hand, the error is worse than the local model’s if . Thus, choosing an exceedingly large weight can be harmful in situations where . This could easily happen in practice, if is chosen based on the observed data in the presence of a large sampling bias.
We introduce our model in Section 3, showing how it relates to practical use cases and reflecting on the assumptions underlying our results. We prove our main results in Section 4 and visualize and discuss them in Section 5. Here, we also interpret weighted model averaging as a general form of shrinkage and explain how our results compare to recent related work. We conclude with a brief summary of our results and pointers to open problems in Section 6.
2 Related Work
A properly identified source of heterogeneity between the agents can sometimes be addressed by incorporating context information as additional features in the ML model: location, time, user demographic, etc. Several other solutions have been proposed to tackle unidentified sources of heterogeneity:
- Local fine-tuning,
- Multi-task learning and clustering,
- Gradient-based personalization methods,
- Joint local and global learning,
- Merging already trained models
Theoretical Analysis of Weighted Model Averaging.
Donahue and Kleinberg (2020)
investigate FL from a game-theory point of view, to test whether self-interested players (i.e., agents) have an incentive to join an FL task. They analyze the formation of clusters(stable coalitions)
of players in a linear regression task with specific assumptions on the data generation process. Once allowing each player to use a weighted average of their local model with their cluster’s global model, thegrand coalition (i.e. a federation of all agents) is weakly preferred by all players over their local model. Further, if players can weight each other player’s local model individually rather than selecting a single weight for their cluster’s model, the grand coalition becomes core stable, where no other coalition is preferred over the grand coalition by all players in .
Thus, weighted model averaging seems more promising than clustering-based approaches in the setting under consideration. We expand on this analysis of weighted model averaging, proving that the results about the optimal model averaging weight hold even under minimal assumptions on the data generation process. is based on unrealistic assumptions of identical variance and similar means across players. In contrast, we allow arbitrary finite means and variances, only requiring a positive variance for the local model. Rather, our analysis is limited to a one-dimensional mean estimation problem with two players (). As shown in Section 5.5, our results are equivalent to theirs when our respective assumptions are applied jointly.
The Bias-Variance Trade-Off.
Weighted model averaging is a form of bias-variance trade-off: It aims to leverage the vast quantity of data on the network to reduce the model’s generalization error, at the expense of increased bias if the local and global distributions match poorly. Theoretical research on the bias-variance trade-off dates back to the surprising results of James and Stein (1961), who constructed a biased empirical estimator that provably dominates the maximum likelihood estimator for a specific higher-dimensional mean estimation problem (cf. Section 5.4). Today, we would call this a shrinkage estimator.
Despite their long history, shrinkage estimators are still relevant and can be found in recent applications. For instance, Su et al. (2020) recently investigated how to shrink importance weights to reduce the mean squared error of doubly-robust estimators in off-policy evaluation. In Section 5.4, we show that our analysis of weighted model averaging includes shrinkage as a special case, when is set to a constant value, such as .
3 Model and Assumptions
To introduce the theoretical setting, we suppose that an agent has drawn
independent samples from the random variable, and that wishes to estimate the unknown true mean . For instance, could be a clinical researcher interested in accurately estimating the effect size of a new treatment from the patients participating in a trial at their hospital.
Agent can construct an empirical estimator for based on its own samples. For instance, could use the local empirical mean from Definition 1, which is unbiased and consistent. Alternatively, agent can enlist the help of another agent, , who has calculated the empirical mean of samples drawn from a different random variable . This allows to estimate by a weighted average of and , denoted as , where is the weight of (Definition 2).
Definition 1 (Empirical mean).
Let , . We recall the empirical means and :
Definition 2 (Weighted average).
We define the weighted model average of the two means as:
Examples of Helper Agent .
In our example of a clinical trial, the helper agent could be another researcher conducting a trial of a similar treatment in a different location. Alternatively, could also be a global model obtained through federated learning by several other hospitals conducting treatment trials. As we do not make any assumptions on , it could also simply be an arbitrary constant (cf. Section 5.4). For instance, picking results in the shrinkage estimator .
3.2 Problem Definition
We assume that wants to use the estimator whose expected squared error (ESE) with respect to is minimal. Optimality is therefore understood throughout this paper in terms of the ESE. Note that the estimator is biased for unless , see Definition 3.
Definition 3 (Expected squared error).
Recall that . By , we denote the ESE of w.r.t. .
This focus on the estimator’s ESE implies that is equally concerned about underestimating the mean, as about overestimating it—which need not be the case in practice. Furthermore, by focusing on the squared error, we assume that
prefers a high probability of incurring a relatively small error, than antimes lower probability of incurring an times larger error.
The task of reduces to selecting an optimal weight (Definition 4). Indeed, and can be expressed as and , respectively. The global model, i.e. the empirical mean over the union of and ’s samples, is obtained by selecting .
The global model is optimal if and have the same mean and variance, because it is the weight for which the variance of is minimal. However, a greater weight can be optimal (e.g., if and ), whereas a smaller weight is optimal if the true means of and are very dissimilar or if .
To quantify the optimal weight for any distributions and , we express the ESE of as a function of . We answer the questions: Which weight would we tell to use, if we had perfect knowledge of and ? Exactly how much smaller would the error be, relative to ? How large would the error be if used a different weight ?
Definition 4 (Optimal model averaging weight).
We denote the weight which minimizes the ESE of w.r.t. :
3.3 Assumptions on and
We make no assumptions on the data distributions and —not even that they are related. However, our results in their strongest form depend on being non-zero, and on and having a finite mean and variance. Instead of making specific assumptions on and , we express our results as functions of the unknown quantities , , and . This approach allows us to quantify the potential of weighted model averaging exactly, for any distributions and (where could be the union of all other agents’ data sets in FL). On the other hand, since our results depend on unknown quantities, they cannot be used directly in empirical estimation problems. While it is certain that no empirical estimator based on the weighted averaging of and can do better than the optimal estimator , it is much less clear that this bound can actually be attained in practice.
4 Theoretical Results
In Theorem 1, we find an analytic expression for the optimal model weight . We conclude in Corollary 1.1, that the local empirical mean is dominated unless is deterministic. Subsequently, we give the ESE of the weighted average for any in Theorem 2, finding its minimum in Corollary 2.1 and its maximum in Corollary 2.4.
4.1 Prelude: Estimator Properties
Let , , , and . We recall the properties of the empirical means and :
Recall the nomenclature of Lemma 1. The weighted average has the following properties:
We compute the ESE of the empirical means:
We find the ESE of analytically below. It is convex w.r.t. . If , then it is strictly convex.
4.2 Optimal Model Averaging Weight
In Section 3.2, we defined the goal of agent as selecting an optimal weight to minimize the ESE of w.r.t. . We prove the existence and (under minimal assumptions) uniqueness of in Theorem 1, where we also find its analytical form. This result is illustrated in Figure 2. In Corollary 1.1, we conclude that, unless , the empirical mean is always dominated by some weighted average with . Finally, in Corollary 1.2 we prove an upper bound on that is simpler than Theorem 1.
Proof of Theorem 1.
Recall from Lemma 4 that the error is convex in , from which we conclude that it admits at least one global minimum. If , then the error vanishes irrespective of (by Lemma 4). Thus, any minimizes the error, including all . Otherwise, the error is strictly convex (by Lemma 4) and is uniquely minimized by s.t. . Thus, we find:
In the last line above, we have made use of the assumption that , which implies that or is strictly positive.
To prove that , we rewrite the last equation to obtain (using the non-negativity of the ESE):
Using Theorem 1, we can ask exactly when the empirical mean is optimal, leading to the following corollary:
Let the means and variances of and be finite, and let . Then, the empirical mean is dominated by some with . Formally:
Proof of Corollary 1.1.
Finally, we can prove simpler bounds on to improve our intuition: is only high when the local model has a lot of variance to trade away ( is large) and when has a low variance and is not unreasonably biased ( and are small):
Let . Then:
4.3 Expected Squared Error
Armed with a formula for , we now investigate the effect of (optimally) weighted model averaging on the ESE. Theorem 2, which is illustrated in Figure 1 (Section 5.1), relates the error of any to that of . In Corollaries 2.3, 2.2 and 2.1, we calculate the ESE of , , and . Finally, we prove in Corollary 2.4 that the error of is bounded by and .
If is unique and positive (i.e., if ), then we can express as:
Proof of Theorem 2.
We use our assumption () to express as a function of :
Inserting into Lemma 4 yields:
Inserting into Theorem 2, we obtain the minimum error of weighted model averaging:
Estimating by instead of leads to a reduction in ESE if . Formally:
Proof of Corollary 2.1.
If , then the ESE of (w.r.t. ) is equal to that of . Formally:
If , then the ESE of is given by:
As a counterpart to the global lower bound on found in Corollary 2.1, we prove a global upper bound below:
The ESE of is bounded above by:
Proof of Corollary 2.4.
5 Implications of our Findings
Throughout this section, we make the assumption that . Indeed, if had zero variance, the local empirical mean would neither need, nor permit, any further improvement.
We summarize our main results about the ESE of weighted model averaging in Section 5.1, and those about the optimal weight of in Section 5.2. Then, we present numerical examples in Section 5.3 to illustrate the various dependencies of . In Section 5.4, we show that model averaging generalizes shrinkage estimators. Finally, in Section 5.5, we discuss the similarities and differences between our results and those obtained by Donahue and Kleinberg (2020).
5.1 Quality of the Weighted Model Average
We compute the ESE of the weighted average , for any distributions and and for any weight , in Lemma 4. In Theorem 2, we express it as a function of the weight , the optimal weight , and (the error of the local empirical mean ). This dependency is illustrated in Figure 1.
In Corollary 2.1, we conclude that can have an ESE lower than that of by up to a fraction , and no lower. However, the optimal weight is a function of the (unknown) means and variances of and . It is unclear whether it is possible to achieve the same ESE with a purely empirical model average, whose weight does not depend on unknown distribution parameters.
Nevertheless, even a sub-optimally weighted model average can significantly improve upon the local model if is not too small. Indeed, Figure 1 illustrates that using with reduces the error by more than a fraction of , and that any has . Specifically, all yield an improvement over if (because this implies ).
What Figure 1 cannot show, however, is the tight upper bound to proven in Corollary 2.4. For , is upper-bounded by . Otherwise, it is upper-bounded by , which is related to and by Corollary 2.3. Thus, this upper bound can be added to the figure for a given by drawing a vertical line at . If , then the vertical line crosses at a value below . Corollary 2.3 shows that the error can be increased substantially over that of , especially if is very close to .
5.2 Optimal Model Averaging Weight
In Section 5.1, we show that is an upper bound to the potential benefits of weighted model averaging, and that a small value of opens up the possibility of doing worse than . This central role motivates us to visualize and its dependency on the relevant parameters. In Figure 2, the contour lines help appreciate Corollary 1.2, which states that is bounded above by the lower of the two terms: (on the horizontal axis), and (on the vertical axis).
Indeed, the contour lines are approximately horizontal when (bottom right corner of Figure 2), and vertical when (top left corner). What’s more, the contour lines for and approximately asymptotically approach the and -grid-lines respectively, as the distance from the diagonal increases. Thus, the bound in Corollary 1.2 is approximately tight as the relative difference between and increases. This approximation is particularly good if the bound is much smaller than , and falls apart of course when it is greater than —as exemplified by the contour line at .
As a consequence of this bound, increasing yields diminishing returns: Doubling almost doubles when both and are much smaller than , but is barely affected if one of these terms is much greater than . If the variance of is much greater than that of , however, may already be close to . In this case, a further increase of may no longer produce a big change in , but it may still reduce quite drastically by lowering the term .
Importantly, both terms in Corollary 1.2 feature the variance of (which equals ). Naturally, as the local estimator grows more accurate with increasing , the need for collaboration vanishes along with its potential benefits. Moreover, even for small , model averaging is useful only if is not too small compared to and .
5.3 Numerical Examples
To give some intuition of its dependencies, we calculate for a variety of scenarios in Table 1. We also use the value of to calculate the relative size of the error for three estimators, compared to the error of :
the optimally weighted model average,
the average with a weight of 20% (i.e., ),
the average with a weight of 50% (i.e., ).
We describe each scenario by four easily interpretable quantities (Table 1, left of the vertical bar):
the squared bias of w.r.t. , relative to ,
the number of samples drawn from ,
the variance of , relative to that of ,
the number of samples drawn from , relative to .
However, several configurations of these four variables are equivalent because is fully determined by the terms and (cf. Equation 2). This is why the second, fourth, sixth, and twelfth () rows of Table 1 are equivalent to their respective predecessor.
The second row demonstrates that setting is equivalent to setting . The fourth and sixth row show that the multiplication of with a constant can be compensated by multiplying with the same constant. Furthermore, the third row of Table 1 embodies the diminishing returns phenomenon described in Section 5.2: raising until already takes most of the way towards its limit for . Finally, the eleventh and twelfth rows (from to ) exemplify the connection between the first two columns, which is that only depends on their product. This also explains the equivalence of the last two rows of Table 1. It also reflects Corollary 1.2, as the influence of is largest when is small, and becomes vanishingly small when the product exceeds (see bottom rows). Another interpretation is that, when (i.e., ), then using as a proxy for is exactly as good as using a single realization of (in terms of ESE).222 Indeed, after drawing one sample , we find:
5.4 Relation to the James-Stein Estimator
Weighted model averaging generalizes the fundamental statistical notion of shrinkage. Conceptually, shrinkage (towards ) simply corresponds to multiplying an empirical estimator with a weight . This reduces the estimator’s variance at the cost of additional bias, thus reducing its ESE if is sufficiently close to Gruber (2017). For , we recover the case of estimator shrinkage towards an arbitrary anchor value (). If we pick , then is simply a shrunken version of : .
We find that the optimal amount of shrinkage is never quite for any combination of finite means, variances, and number of samples. However, we see in Table 1 that it quickly tends to as we shrink towards an increasingly unrelated value (with ), and as the quality of the local estimator increases (with ).
The first example of shrinkage is the infamous James-Stein (JS) estimator, which dominates the empirical mean for the problem of estimating from just one observation the mean of a spherically symmetrical multivariate normal random variable (RV) with three or more dimensions Gruber (2017); James and Stein (1961). While the JS estimator can be used without prior knowledge of the mean and variance of the RV, it only dominates the empirical mean under the assumptions that all coordinates of the RV (a) have the same variance, and that they are (b) normal and (c) mutually uncorrelated. It is also restricted to estimation in at least three dimensions. By contrast, our results hold for arbitrary one-dimensional estimation problems, with minimal assumptions on and . However, we express them as functions of statistics of and that would not be available in a practical setting.
5.5 Connection to Recent Related Work
Donahue and Kleinberg (2020) prove corresponding results to our Corollaries 2.1, 1.1 and 1 in their Lemmas 6.1 and 6.3, under comparatively more complex assumptions. Our model is simpler in three ways, making it more general for some aspects and more specific for others.
The most important difference is that our assumptions on the input data are minimal and significantly more realistic, as we assume nothing more than finite means and variances, and a non-zero variance for . In contrast, Donahue and Kleinberg (2020) assume that and are drawn independently from the same random variable with variance , and that and have the same variance ().
Secondly, we consider only two nodes (), whereas their analysis includes an arbitrary number of players to study clustering approaches to model personalization. Nevertheless, our two-node model reproduces their coarse-grained federation model (Section 6 in their paper) by setting: . Finally, our analysis is confined to one-dimensional linear regression () compared to theirs in dimensions.
We will demonstrate that these corresponding results become equivalent in the special case when both respective classes of assumptions are applied jointly.
Under the assumption that , denoting by and by , Lemma 6.3 from Donahue and Kleinberg (2020) simplifies to:
Lemma 5.2 is obtained with the substitutions:
In this work, we have quantified the improvement of quality (reduction in expected squared error) that can be achieved through weighted model averaging with peers, as compared to using exclusively local data. Our results concern the mean estimation problem for any scalar random variable , and include the case of shrinkage towards a fixed value. While we limit our analysis to averaging between two models and , our results apply to federated learning by letting denote the union of all other agents. We derive the optimal averaging weight for the helper model , proving that it is strictly positive unless . Through examples, we explain that this holds true even for an arbitrarily unsuitable helper model (thus for arbitrary non-IID client data), though the optimal weight converges towards as the difference between the expectations of and increases.
Further, we show that model averaging with the optimal weight reduces the ESE by a fraction equal to , compared to the local model . We then analyse the ESE of the weighted model average, as a function of only the optimal weight and the weight that is actually used. We find that any weight reduces the ESE compared to , by more than a fraction of if , and that any weight of results in a larger ESE, compared to .
Motivated by its central role, we visualize and investigate the dependencies of the optimal model averaging weight . It depends mainly on the lower of two ratios, which both involve the variance of the local estimator, . Thus, can only be large if is significantly larger than the variance of the helper model and the squared difference between the expectations of and .
We prove theorems under realistic assumptions about the distributions of and , in the limited setting of one-dimensional parameter estimation with two agents. Future work could extend our results to multivariate parameter estimation and to an arbitrary number of agents. Further, in addition to interpreting it as shrinkage, we could apply shrinkage to the weighted model average, thus obtaining , where .
Finally, whether our theoretical lower bound on the ESE of the weighted model average can possibly be achieved without perfect knowledge of the bias and variances of and , merits investigation. It would thus be interesting to quantify the error reduction that can be achieved in practice, potentially under specific assumptions on and .
- Adaptive Personalized Federated Learning. arXiv:2003.13461 [cs, stat] (en). External Links: Cited by: §1, §1, item Joint local and global learning,.
- Model-sharing games: analyzing federated learning under voluntary participation. arXiv preprint arXiv:2010.00753. External Links: Cited by: §2, §5.5, §5.5, §5, Lemma 5.2, Lemma 5.
- Personalized federated learning: a meta-learning approach. arXiv preprint arXiv:2002.07948. Cited by: item Gradient-based personalization methods,, item Joint local and global learning,.
- Weight erosion: an update aggregation scheme for personalized collaborative machine learning. In Domain Adaptation and Representation Transfer, and Distributed and Collaborative Learning, pp. 160–169. Cited by: §1, item Gradient-based personalization methods,.
Improving efficiency by shrinkage: the james–stein and ridge regression estimators. Routledge. Cited by: §5.4, §5.4.
- Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, Calif., pp. 361–379. External Links: Cited by: §2, §5.4.
- Advances and Open Problems in Federated Learning. arXiv:1912.04977 [cs, stat] (en). External Links: Cited by: §1, item Joint local and global learning,, footnote 1.
- Federated Optimization: Distributed Machine Learning for On-Device Intelligence. arXiv:1610.02527 [cs] (en). External Links: Cited by: §1.
- Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine 37 (3), pp. 50–60. Note: Conference Name: IEEE Signal Processing Magazine External Links: Cited by: §1.
- Ensemble distillation for robust model fusion in federated learning. NeurIPS - Advances in Neural Information Processing Systems 33. Cited by: item Merging already trained models.
- Three Approaches for Personalization with Applications to Federated Learning. arXiv:2002.10619 [cs, stat] (en). External Links: Cited by: §1, item Multi-task learning and clustering,, item Joint local and global learning,.
- Distributed gradient methods for convex machine learning problems in networks: distributed optimization. IEEE Signal Processing Magazine 37 (3), pp. 92–101. Cited by: §1.
- Model fusion via optimal transport. NeurIPS - Advances in Neural Information Processing Systems 33. Cited by: item Merging already trained models.
- Doubly robust off-policy evaluation with shrinkage. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, Virtual, pp. 9167–9176. External Links: Cited by: §2.
- Federated evaluation of on-device personalization. arXiv preprint arXiv:1910.10252. External Links: Cited by: §1, item Local fine-tuning,.
Appendix A Lemmas in Section 4
Proof of Lemma 1.
The proofs for apply analogously to . The unbiasedness of the empirical mean () follows from the linearity property of the expectation. The variance of follows from the independence of the realizations :
Proof of Lemma 2.
The expectation of follows from the linearity property of the expectation. Its variance also results from the properties of the variance:
The result follows from Lemma 1. ∎
Proof of Lemma 3.
To compute the ESE of the empirical means, we make use of the bias-variance decomposition:
Plugging in , and , we obtain:
The result follows from Lemma 1. ∎