# Linear Speedup in Personalized Collaborative Learning

Personalization in federated learning can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In order to develop training algorithms that optimally balance this trade-off, it is necessary to extend our theoretical foundations. In this work, we formalize the personalized collaborative learning problem as stochastic optimization of a user's objective f_0(x) while given access to N related but different objectives of other users {f_1(x), …, f_N(x)}. We give convergence guarantees for two algorithms in this setting – a popular personalization method known as weighted gradient averaging, and a novel bias correction method – and explore conditions under which we can optimally trade-off their bias for a reduction in variance and achieve linear speedup w.r.t. the number of users N. Further, we also empirically study their performance confirming our theoretical insights.

• 2 publications
• 23 publications
• 38 publications
• 34 publications
• 89 publications
10/25/2021

### Optimal Model Averaging: Towards Personalized Collaborative Learning

In federated learning, differences in the data or objectives between the...
02/25/2020

### Three Approaches for Personalization with Applications to Federated Learning

The standard objective in machine learning is to train a single model fo...
10/13/2021

### WAFFLE: Weighted Averaging for Personalized Federated Learning

In collaborative or federated learning, model personalization can be a v...
07/20/2021

### Precision-Weighted Federated Learning

Federated Learning using the Federated Averaging algorithm has shown gre...
04/28/2022

### Personalized Federated Learning with Multiple Known Clusters

We consider the problem of personalized federated learning when there ar...
06/04/2021

### Strategyproof Learning: Building Trustworthy User-Generated Datasets

Today's large-scale machine learning algorithms harness massive amounts ...
04/21/2020

### Federated Learning with Only Positive Labels

We consider learning a multi-class classification model in the federated...

## 1 Introduction

In collaborative learning, whether federated, distributed or decentralized, a single user typically has limited data available to them. Instead, they join multiple (say

) other users to train a machine learning model on their combined datasets

(Kairouz et al., 2019). This vastly increases the amount of data available for training. However, the other users may be heterogeneous i.e. they may have datasets and objectives which do not match those of the considered user. Combining data from such heterogeneous users can significantly hamper performance, with even worse performance than when training alone (Yu et al., 2020).

Training alone and on combined data represent two extremes, with the former having no bias but high variance and the latter having low variance but high bias. Alternatively, personalized collaborative learning algorithms (Wang et al., 2019; Mansour et al., 2020) attempt to find ‘in between’ models which trade off some bias against variance. In the best case setting, we can use the data from the other users to reduce our variance by a factor (called linear speedup) while simultaneously not incurring any bias. In this work, we explore under what conditions can personalized collaborative learning algorithms achieve such an optimal linear speedup?

We study this question under the framework of stochastic optimization given its success in machine learning (Bottou et al., 2018). We formulate our goal as optimizing a fixed user’s stochastic function , while also given access to stochastic gradients of other collaborators . We start with the simple strategy of weighted gradient averaging

that uses a weighted average of the gradient estimates as a pseudo-gradient and then takes an SGD step. We show that while there do exist scenarios where this simple strategy suffices, it can also incur significant bias introduced by the collaborators. This then motivates our main method of

bias correction which uses the past observed gradients to estimate and correct for these biases.

Contributions. Our main contributions include:

• [nosep,leftmargin=12pt]

• We formalize the collaborative stochastic optimization problem where a user is required to minimize their objective by collaborating with other users.

• We prove convergence of weighted gradient averaging and present and analyze a novel bias correction algorithm.

• We show that under optimal choices of the hyper-parameters, bias correction enjoys a linear speedup with variance reducing as collaborators increase.

## 2 Related Work

Federated Learning (FL). FL (Konecny et al., 2016; McMahan et al., 2017; Mohri et al., 2019) denotes a machine learning setting where training data is distributed over multiple users (also called agents or clients). These users form a ‘federation’ to train a global model on the union of all users’ data. The training is coordinated by a central server, and the users’ local data never leaves its device of origin. Owing to data locality and growing privacy constraints, FL has become prominent for privacy-preserving machine learning (Kairouz et al., 2019; Li et al., 2020a; Wang et al., 2021). Decentralized learning refers to the analogous setting without a central server, where users communicate peer-to-peer during training, see e.g. (Nedic, 2020).

Personalization. Due to device heterogeneity, and data heterogeneity, a ‘one model fits all’ approach leads to poor accuracy on individual users. Instead, we need to learn personalized models for each user. Prominent approaches for this include performing additional local adaptation or fine-tuning (Wang et al., 2019; Fallah et al., 2020), or weighted averaging between a global model and a locally trained model (Mansour et al., 2020; Deng,Y. et al., 2020). Collins et al. (2020); Khodak et al. (2019) investigate how such local fine-tuning can improve performance in some simple settings if the users’ optima are close to each other. In another highly relevant line of work, Maurer et al. (2016); Tripuraneni et al. (2020); Koshy Thekumparampil et al. (2021); Feng et al. (2021) show how a shared representation can be leveraged to perform efficient transfer of knowledge between different tasks (and users). Li et al. (2020b); Mohri et al. (2019); Yu et al. (2020) investigate how FL distributes the accuracy across users and show that personalization gives a more equitable distribution. We refer to (Kulkarni et al., 2020) for a broader survey of personalization methods.

Unlike most of the above works, we consider the perspective of a single user for the sake of simplicity. Further, while our weighted gradient averaging is closely related to weighted model averaging, the bias correction method is novel and is directly motivated by our theory. Finally, while several of the above works (e.g. Mansour et al., 2020; Deng,Y. et al., 2020)

also provide theoretical guarantees, they use a statistical learning theory view point whereas we use a stochastic optimization lens.

Perhaps the works closest to ours are (Donahue and Kleinberg, 2020) and (Grimberg et al., 2021)

, both of whom study model averaging. The former uses game theory to investigate whether self-interested players have an incentive to join an FL task. This is true as long as users achieve significantly better performance when training together than when training alone. Their work further highlights the importance of understanding when personalization can improve performance. More recently,

Grimberg et al. (2021) consider a weighted model averaging of two users for mean estimation in . Both these works study only toy settings with restrictive assumptions. Our results are more general and include non-convex optimization.

In the personalized optimization setting, two very recent empirical works propose rules to learn collaboration weights. Beaussart et al. (2021) modify Scaffold (Karimireddy et al., 2019)

to use Euclidean distances of the updates between different agents to derive a heuristic for weight definition. It uses both local and global control variates, though without a decay mechanism.

(Zhang et al., 2021)

on the other hand use an idea from meta learning to learn the collaboration weights, by using a first-order approximation of the objective with respect to these weights. While demonstrating practical performance on deep learning tasks, neither of the two methods come with convergence guarantees. Our approach in contrast chooses collaboration weights as to optimize the upper bounds for provable convergence.

## 3 Setup and Assumptions

In this section, we formalize personalized collaborative optimization and discuss our assumptions.

### 3.1 Personalized Collaborative Stochastic Optimization

We model collaborative optimization as an environment where users denoted can communicate with each other. Each user has only access to its own objective

(e.g. a loss function evaluated on their own data), where

is a random variable from which we can sample without necessarily knowing its distribution (this covers the online optimization setting as well as optimizing over finite training data sets). The users can collaborate with each other by sharing (stochastic) gradients that they compute on their private loss function

on a shared input parameter .

We formalize the personalized collaborative stochastic optimization problem as solving for user ’s goal:

 minx∈Rdf0(x) , (1)

by exchanging gradients with the other users. This exchange of information between the main user ‘0’ and their collaborators can be done in many ways. In this work, to solve problem (1), user updates their state by using different variants of a gradient estimate and step size :

 xt+1=xt−ηtg(xt) . (2)

As illustrated in Algorithm 1, each collaborator computes an unbiased local gradient estimate of at , and shares those with the main user . Using these helper gradients as well as its own gradient, user then forms the final and takes an update step.

The simplest baseline to consider (henceforth called the ‘Alone’ method) is the case where user ignores the collaborators and decides to work alone by setting . In general, can be formed in several different ways, using current gradients as well as past gradients.

### 3.2 Assumptions

We study algorithms that aggregate gradients in a linear fashion. This allows to reduce the problem of collaboration with users to collaboration with only a single (virtual) user ‘avg’, who represents the average of collaborators that can access gradients of the average loss . The collaboration weights were chosen for simplicity of exposition and do not incur any loss of generality.111In general, the relative weight of user would be , where is the variance of the gradient estimates of . Here we will suppose for simplicity that which leads to .

Notation. For each user , we denote by a stationary point of , and its corresponding value. We denote the gradient noise and the average gradient noise.

We will use the following common assumptions:

A1 (Smoothness) is smooth, i.e., :

 f0(y)≤f0(x)+∇xf0(x)⊤(y−x)+L2∥y−x∥2.

A2 (-PL) satisfies the PL condition, i.e.:

 ∀x∈Rd:∥∇xf0(x)∥2≥2μ(f0(x)−f⋆0).

A3 (-Bounded Hessian Dissimilarity) or -BHD

 ∀x∈Rd:∥∇2xfavg(x)−∇2xf0(x)∥≤δ.

A4 (Gradient Similarity) s.t. :

 ∥∇xfavg(x)−∇xf0(x)∥2≤m∥∇xf0(x)∥2+ζ2.

A5 (Bounded variance) s.t. :

 {E[∥n0(x,ξ(0)t)∥2]≤σ20,E[∥navg(x,ξi=(1…N)t)∥2]≤σ20/N.

A1 is a very generic assumption. A2 will not be assumed in the general non-convex case, but only in the -PL cases in our theorems, instead of convexity. A3 is implied by smoothness, and equivalent up to multiplying by to (Karimireddy et al., 2020, Assumption A2), and appears for quadratic functions in (Shamir et al., 2014; Reddi et al., 2016; Karimireddy et al., 2019). A4 is also very generic, and coincides with (Ajalloeian and Stich, 2020, Assumption 4). Similar assumptions to bound the bias appeared also in (Bertsekas and Tsitsiklis, 2000, though they require vanishing bias), in (Bertsekas, 2002, pg. 38–39) and more recently in (Karimireddy et al., 2020, 2019; Deng,Y. et al., 2020). A5 can be relaxed to allow an additional unbounded variance term which grows with the norm of the estimated gradient. Convergence results under this relaxed assumption are provided in the supplementary material. Our main conclusions are maintained in this generalized case.

Hessian dissimilarity : We note that Hessian dissimilarity as in A2 for is directly implied by -smoothness of the users. In practice, if users are similar (and not adversarial) we expect .

Bias parameters and : To showcase the intuition behind the bias parameters and we can limit ourselves to the case of one collaborator ‘1’. The parameter quantifies translation between and , while quantifies the scaling. To be more precise, if we were collaborating with a translated copy i.e. then and . If we were collaborating with a scaled copy i.e. then and . Even simpler than this, determines whether the bias is bounded or not, means the bias can be bounded independently of , this should be the simplest case. The constant bias term also quantifies how much the two collaborators’ goals are different, this can be seen from the approximation , in other words how much distant are their two stationary points that they would have found by ignoring each other. In particular, corresponds to the case of and sharing the same optimum.

## 4 Weighted Gradient Averaging

As a first basic algorithm, we here introduce weighted gradient averaging and analyze its convergence in the non-convex case and under the -PL condition. We show that for the special case of collaborative mean estimation, when every user has its own distribution, we exactly recover the existing theoretical results of Grimberg et al. (2021). While recuperating analogous main ideas, our results are more general applying to any smooth stochastic optimization problem in arbitrary dimensions with multiple collaborators.

#### WGA Algorithm.

As illustrated in Algorithm 2, at each time step , using the current state , each collaborator computes an unbiased local gradient estimate of , and sends those to user . Then using these gradient estimates and the collaboration weight , the main user forms

 g(xt):=(1−αt)g0(xt)+αt1NN∑k=1gk(xt) ,

and performs an SGD step with the obtained gradient estimate , reaching the new state . We analyze now precisely the convergence rate of Algorithm 2 under heterogeneous data across the users, in the non-convex and -PL case in the following Theorem 1.

###### Theorem 1 (Convergence of WGA).

Under Assumptions A1, A4, A5, Algorithm 2 after rounds for constant collaboration weight , and constant step-size satisfies the following convergence bound, (where ):
Non-convex case. For :

 1−α2m2TT−1∑t=0E[∥∇f0(xt)∥2]=O(LF0T+√LF0~σ2(α)T+α2ζ2).

-PL case. If in addition A2 holds, then for the choice :

where suppresses factors and we defined .

Similar to (Karimi et al., 2016, Theorem 4), we can get rid of the logarithmic factors in the -PL case by choosing a decreasing step-size.

Bias-variance trade-off. Crucially, the collaborative variance is smaller than the individual variance of user ’s gradient estimates, however, this decrease in variance is accompanied by an additional bias term , hence we have established a bias-variance trade-off, which motivates the proper choice of the collaboration weight .

#### Application of WGA to collaborative mean estimation.

Weighted gradient averaging generalizes the model averaging problem studied in (Donahue and Kleinberg, 2020; Grimberg et al., 2021). We show how to recover their results here.

Suppose we want to estimate mean of real random stochastic samples with . Consider

 minx f0(x):=12(x−μ0)2,

with unbiased stochastic gradients given as . Similarly, we define our collaborator with a different mean and its stochastic gradients. We have that is -PL, -smooth, , and . Let us also use a starting point to get . Plugging these values into Theorem 1, we get that

 E(xT−μ0)2≤~O(σ20exp(−T)+~σ(α)2T+α2(μ0−μ1)2)

Note that here represents the number of stochastic samples of we use. Compare this with (Grimberg et al., 2021) who show a rate of . Thus, we recover their results for a large enough  and ignoring logarithmic factors. These logarithmic factors can also be avoided by using a decreasing step-size as noted earlier.

#### Speedup over training alone.

Due to the bias-variance trade-off in Theorem 1, the best choice of is

 αopt=argminα∈(0,1√m) L~σ(α)2μ2T(1−α2m)2+α2ζ2μ(1−α2m).

In particular for , which means the bias is bounded, we have and we obtain a speed-up . The speedup factor is illustrated in Figure 6.

Note that if we have both and , which means (collaboration with a copy), then and we get a linear speedup in as expected. However, when the functions are minimized at the same point () but with unbounded bias (), the collaboration weight is bounded by . In addition, the term in the denominator produces a speedup relative to training alone that is sub-linear.

Finally, when both and , the speedup gained due to weighted averaging is further limited. In fact, in this case when we need to take making the gain . Intuitively, WGA controls for the bias introduced by using by down-weighting them. While this may reduce the bias in a single round, the bias keeps accumulating over multiple rounds. Thus, the benefit of WGA diminishes with increasing . In the next section, we see how we could more directly remove this bias.

#### Remark.

for the optimal choice of we minimize (with respect to ) the bound obtained in the -PL case. We could’ve done the same for the non-convex case, however we choose to solely focus on the -PL case as the bound obtained can be interpreted as a bound on the loss which has better meaning than the norm of gradients.

## 5 Bias Correction

In Section 4, bias was identified as the major problem limiting the performance of WGA. Therefore we propose a bias correction algorithm which directly tackles this issue. Our strategy consists of estimating the bias between the gradients of and using past gradients. Then, we subtract this bias from the current gradient estimate from .222This follows a similar strategy as control variates in SCAFFOLD (Karimireddy et al., 2019). We first demonstrate the utility of such bias correction assuming access to some ideal bias oracle. Then, we show how to use an exponential moving average of past gradients to approximate the oracle.

#### BC Algorithm.

As usual, at each time , each user computes their own local gradient estimate . Then, as illustrated in Algorithm 3, user uses —an estimate of the bias —and the collaboration weight :

 g(xt):=(1−αt)g0(xt)+αt(1NN∑k=1gk(xt)−ct) .

Then user updates their parameters using this pseudo gradient as . We next discuss how to compute this estimate .

### 5.1 Using a Bias Oracle

As a warm up, let us suppose we have access to an oracle that gives a noisy unbiased estimate of the true bias

 coracle,t=∇xfavg(xt)−∇xf0(xt)+noracle,t.

The quantity is the noise of the oracle and is independent from the gradient estimates. Using this, we have that the update satisfies

 E[1NN∑k=1gk(xt)−coracle,t]=∇f0(x).

Hence, this becomes similar to the case where and with WGA, enabling linear speedup. Theorem 2 formalizes this intuition.

###### Theorem 2 (Convergence given a bias oracle).

Under Assumption A1, using an ideal oracle of the mean bias with variance (i.e., is the variance of the bias oracle associated to each collaborator), for constant collaboration weight , and constant step-size we have the following:
Non-convex case. For :

 12T T−1∑t=0E[∥∇f0(xt)∥2]≤LF0T+√2LF0~σ2(α)T.

-PL case. If in addition A2 holds, then for the choice :

 FT=~O(F0exp(−μTL)+L~σ(α)2μ2T),

where .

Speedup over training alone. First, note that the rate of Theorem 2 when matches Theorem 1 with and . We examine two cases.

• [leftmargin=12pt]

• If : In this case, we choose

 αopt∈argminα~σ2(α)=NN+1+v2σ20,

giving . For large enough (), this simplifies to and a convergence rate of in the general non-convex case and with -PL inequality. Thus, we achieve linear speedup.

• If the baseline here is gradient descent. If then both the non-convex and -PL convergence rates are slower than GD. The best choice of collaboration weight here is .

### 5.2 Approximating the Oracle Using EMA

Clearly, the previous discussion shows that given access to a bias oracle, using bias correction gives significant speedup even when we have large bias i.e.  and  are large. Algorithm 3 shows how we can use exponential moving average of past gradients to estimate this bias without an oracle:

 ct+1:=(1−βt)ct+βt(gavg(xt)−g0(xt)).

intuitively, this averages over past independent stochastic bias estimates reducing the variance of . We next examine the effect of replacing our bias oracle using such a .

###### Theorem 3 (Convergence of bias correction).

Under Assumptions A1 and A3–A5, Algorithm 3 for constant collaboration weight , and constant step-size satisfies the following:
Non-convex case. For step-size we have

 1TT−1∑t=0E[∥∇f0(xt)∥2]=O(LF0T+(F0T)2/3(α2δ2~ζ2β2)1/3+√LF0~σ2(α)T).

-PL case. If in addition Assumption A2 holds, for step-size we have

 FT=~O⎛⎝F0exp(−μTL)+L~σ(α)2μ2T+α2δ2~ζ2μ3β2T2⎞⎠,

where and .

Our rate of convergence now depends on both a choice of as well as . Comparing to Theorem 2, our bias correction algorithm is equivalent up to leading order to a bias oracle with a bias noise .

Case 1: If , the baseline is user 0 doing gradient descent alone. In this case, the optimal choice would be to use and (these recover training alone). Intuitively, this is because our method relies on performing a bias-variance trade-off. If i.e. there is no variance, we are just adding bias with no hope of any reduction in variance.

#### Case 2: If σ20>0.

This is the more interesting (and realistic) setting we consider for the rest of the section. We can choose an optimal as

 αopt ∈argminα∈(0,1)L~σ(α)2μ2T+α2δ2~ζ2μ3β2T2 =NN+1+β+N(β+δ2~ζ2β2σ20TLμ).

With this choice we have:

 FT=~O(Lσ20μ2T(1−αopt)). (3)

Choice of . From Equation (3), should be as big as possible, this motivates choosing

 βopt ∈argminβ∈(0,1)β+N(β+δ2~ζ2β2σ20TLμ) =(2Nδ2~ζ2σ20LμT(N+1))1/3.

For this choice of we get :

 αopt=NN+1+3N2(N+1N)2/3(2δ2~ζ2σ20LμT)1/3.

Now notice that for large enough (for ), we have . Contrast this with WGA with which had, . Further, as we next show, this directly translates to BC having a linear speedup in the number of collaborators.

###### Corollary 1 (linear speedup of BC).

For and , appropriate choices of and yield:

Non-convex case.

 1TT−1∑t=0E[∥∇f0(xt)∥2]=O⎛⎜⎝ ⎷LF0σ20(N+1)T⎞⎟⎠.

-PL case.

 FT=~O(Lσ20μ2(N+1)T).

Remarkably, the above corollary proves that BC always achieves the optimal trade-off between bias and variance, reducing the variance by using the stochastic gradients of the other users.

Note that we need . This means that if either the bias is large or if we have a lot of collaborators with large , the linear speedup regime will require longer time horizons. Also, if we do not use the optimal and it is instead kept constant independent of and , then we have

 limN→∞,T→∞(1−αopt)=ββ+1.

This means that if we use a fixed , collaborating with more users will quickly yield decreasing returns. This highlights the need for decreasing as increases.

#### Conclusion: BC solves the problems of WGA.

From the above discussion, BC solves the problems WGA had with bias. First of all, there is no dependence on the heterogeneity parameter , in particular the collaboration weight can range freely in the interval . Secondly, with BC, the bias  does not accumulate with time. On the contrary, with increasing time our EMA estimates of the bias improve and we more successfully eliminate bias and improve convergence.

## 6 Experiments

To validate our theory we consider the noisy quadratic model i.e. optimizing a function of the type

 f0(x):=12(x−m⋆0)⊤A0(x−m⋆0),m⋆0∼N(x⋆0,Σ0) .

While simple, this model can serve as an illustrative test for our theory and is often used to test machine learning and federated learning algorithms (Schaul et al., 2013; Wu,Yuhuai et al., 2018; Martens and Grosse, 2015; Zhang et al., 2019). One common simplification is to consider both and to be diagonal (or co-diagonalizable). This assumption makes it possible to optimize the function over each of its dimensions independently. So it suffices to consider a noisy quadratic model in 1D: optimizing by collaborating with . Here, we have as the number of collaborators, , and . The quantity can be interpreted as a test loss (called simply loss in the plots). In our plots we use by default , and .

Convergence speed. Figure 2 shows convergence curves of the three competing algorithms we have discussed before: working alone, weighted gradient averaging (WGA) and bias correction (BC). In particular, we see that BC reaches a lower error level compared to both other algorithms. This confirms our theory that BC reduces the bias in the algorithm enabling it to reach a lower error level. The initial increase in the loss is also characteristic of BC, and is because during the initial stages our EMA estimate of the bias is quite poor. Eventually, the bias estimate improves and we get fast convergence.

Dependence on data heterogeneity. Figure 3 shows how the bias parameter influences the performance of BC. As predicted by the theory, we see that BC always converges to the same error level uninfluenced by . This bias only effects the time horizon needed for convergence. In contrast, WGA is strongly influenced by the bias as we see in Figure 4. In fact, the convergence error level of WGA is directly proportional to , meaning that we would need to set (i.e train alone) to ensure low error. This demonstrates that the bias correcting technique employed by BC indeed succeeds, validating our theory.

Dependence on the number of collaborators. Figure 5 shows how the number of collaborators influences the convergence of BC. We see that increasing does have a positive effect on BC and decreases the error level to which it converges. However, the benefit saturates quickly. While there is a substantial improvement from to , the rest only sees negligible improvement. This is because our experiments used a fixed value of independently of . In such a case, our theory predicts a saturation as improves, exactly as seen here. To avoid saturation, we need to choose for each the best possible.

## 7 Limitations and Extensions

#### Bias Correction in deep learning.

In this work we have employed the idea of gradient bias correction using SGD. Our methods can also be extended to other optimizers such as momentum or Adam. A larger empirical exploration of such algorithms, as well as more real world deep learning experiments would be valuable, but is out of scope for our more theoretical work.

#### Adding local steps.

Currently, the users communicate with each other after every gradient computation. We can develop more communication-efficient schemes by instead allowing multiple local steps before communication such as in FedAvg (McMahan et al., 2017). Similarly, extending our algorithms to allow personalization for all users instead of focusing only on user 0 would improve practicality in the federated learning setting.

#### Fine-grained measures of similarity.

Our choice of algorithms as well as the assumptions use static global measures of dissimilarity. Time-varying adaptive weighting strategies such as cosine similarity between gradients may further improve our algorithms. Similarly, we do not distinguish between our collaborators and combine them all into a single function

. Instead, it is likely that only a subset of them are similar to us, whereas the rest may be quite different. Thus, using individual user-level similarities such as e.g. in (Grimberg et al., 2020) would also be a fruitful extension. Similarity-based user selection rules are also closely related to Byzantine robust learning, where they are used to exclude malicious participants (Blanchard et al., 2017; Baruch et al., 2019; Karimireddy et al., 2021).

## 8 Conclusion

In this work, we have introduced the collaborative stochastic optimization framework where one “main” user collaborates with a set of willing-to-help collaborators. We considered the simplest method to solve this problem, using SGD with weighted gradient averaging. We discussed in detail the limitations of this idea arising mainly due to the bias introduced by the collaboration. To solve this bias problem, we proposed a second algorithm bias correction. We showed that our bias correction algorithm manages to remove the effect of this bias and under some optimal choices of its parameters leads to a linear speedup as we increase the number of collaborators.

## References

• A. Ajalloeian and S. U. Stich (2020) On the convergence of sgd with biased gradients.. arXiv:2008.00051 [cs.LG]. Cited by: §3.2.
• M. Baruch, G. Baruch, and Y. Goldberg (2019) A little is enough: circumventing defenses for distributed learning. arXiv preprint arXiv:1902.06156. Cited by: §7.
• M. Beaussart, F. Grimberg, M. Hartley, and M. Jaggi (2021) WAFFLE: weighted averaging for personalized federated learning. In NeurIPS 2021 Workshop on New Frontiers in Federated Learning, Cited by: §2.
• D. P. Bertsekas and J. N. Tsitsiklis (2000) Gradient convergence in gradient methods with errors. SIAM Journal on Optimization 10 (3), pp. 627–642. Cited by: §3.2.
• D. Bertsekas (2002) Nonlinear programming. Athena scientific. Cited by: §3.2.
• P. Blanchard, E. M. E. Mhamdi, R. Guerraoui, and J. Stainer (2017) Byzantine-tolerant machine learning. arXiv preprint arXiv:1703.02757. Cited by: §7.
• L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §1.
• L. Collins, A. Mokhtari, and S. Shakkottai (2020) Why does maml outperform erm? an optimization perspective. arXiv preprint arXiv:2010.14672. Cited by: §2.
• Deng,Y., M. M. Kamani, and M. Mahdavi (2020) Adaptive personalized federated learning.. arXiv:2003.13461 [cs, stat]. Cited by: §2, §2, §3.2.
• K. Donahue and J. Kleinberg (2020) Model-sharing games: analyzing federated learning under voluntary participation.. arXiv preprint arXiv:2010.00753. Cited by: §2, §4.
• A. Fallah, A. Mokhtari, and A. Ozdaglar (2020) Personalized federated learning: a meta-learning approach. arXiv preprint arXiv:2002.07948. Cited by: §2.
• Z. Feng, S. Han, and S. S. Du (2021) Provable adaptation across multiway domains via representation learning. arXiv preprint arXiv:2106.06657. Cited by: §2.
• F. Grimberg, M. Hartley, M. Jaggi, and S. P. Karimireddy (2020) Weight erosion: an update aggregation scheme for personalized collaborative machine learning. In MICCAI Workshop on Distributed and Collaborative Learning, pp. 160–169. Cited by: §7.
• F. Grimberg, M. Hartley, S. P. Karimireddy, and M. Jaggi (2021) Optimal model averaging: towards personalized collaborative learning. ICML Workshop on Federated Learning for User Privacy and Data Confidentiality https://fl-icml.github.io/2021/papers/FL-ICML21_paper_56.pdf. Cited by: §2, §4, §4, §4.
• P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao (2019) Advances and open problems in federated learning. arXiv:1912.04977 [cs, stat]. Cited by: §1, §2.
• H. Karimi, J. Nutini, and M. Schmidt (2016) Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition. In ECML - European Conference on Machine Learning and Knowledge Discovery in Databases - Volume 9851, pp. 795–811. Cited by: §4.
• S. P. Karimireddy, L. He, and M. Jaggi (2021) Learning from history for byzantine robust optimization. In International Conference on Machine Learning, pp. 5311–5319. Cited by: §7.
• S. P. Karimireddy, M. Jaggi, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh (2020) Mime: mimicking centralized stochastic algorithms in federated learning. arXiv:2008.03606 [cs.LG]. Cited by: §3.2.
• S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh (2019) SCAFFOLD: stochastic controlled averaging for federated learning. arXiv:1910.06378v4[cs.LG]. Cited by: §2, §3.2, footnote 2.
• M. Khodak, M. Balcan, and A. Talwalkar (2019) Adaptive gradient-based meta-learning methods. arXiv preprint arXiv:1906.02717. Cited by: §2.
• J. Konecny, H. B. McMahan, D. Ramage, and P. Richtarik (2016) Federated optimization : distributed machine learning for on-device intelligence.. arxiv.org/abs/1610.02527. Cited by: §2.
• K. Koshy Thekumparampil, P. Jain, P. Netrapalli, and S. Oh (2021) Sample efficient linear meta-learning by alternating minimization. arXiv e-prints, pp. arXiv–2105. Cited by: §2.
• V. Kulkarni, M. Kulkarni, and A. Pant (2020) Survey of personalization techniques for federated learning. arXiv:2003.08673 [cs.LG]. Cited by: §2.
• T. Li, A. K. Sahu, A. Talwalkar, and V. Smith (2020a) Federated learning: challenges, methods, and future directions.. IEEE Signal Processing Magazine, 37(3):50–60. Cited by: §2.
• T. Li, M. Sanjabi, A. Beirami, and V. Smith (2020b) Fair resource allocation in federated learning. In ICLR - International Conference on Learning Representations, Cited by: §2.
• Y. Mansour, M. Mohri, and A. T. Suresh (2020) Three approaches for personalization with applications to federated learning.. arXiv:2002.10619 [cs, stat]. Cited by: §1, §2, §2.
• J. Martens and R. Grosse (2015)

Optimizing neural networks with kronecker-factored ap- proximate curvature.

.
In International conference on machine learning, pages 2408–2417. Cited by: §6.
• A. Maurer, M. Pontil, and B. Romera-Paredes (2016) The benefit of multitask representation learning. Journal of Machine Learning Research 17 (81), pp. 1–32. Cited by: §2.
• B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas (2017) Communication-efficient learning of deep networks from decentralized data.. In Proceedings of AISTATS, pp. 1273–1282. Cited by: §2, §7.
• M. Mohri, G. Sivek, and A. T. Suresh (2019) Agnostic federated learning.. arXiv preprint arXiv:1902.00146. Cited by: §2, §2.
• A. Nedic (2020) Distributed gradient methods for convex machine learning problems in networks: distributed optimization.. IEEE Signal Processing Magazine, 37(3):92–101. Cited by: §2.
• S. J. Reddi, J. Konečnỳ, P. Richtárik, B. Póczós, and A. Smola (2016) AIDE: fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879. Cited by: §3.2.
• T. Schaul, S. Zhang, and Y. LeCun (2013) No more pesky learning rates.. In International Conference on Machine Learning, pages 343–351. Cited by: §6.
• O. Shamir, N. Srebro, and T. Zhang (2014) Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pp. 1000–1008. Cited by: §3.2.
• N. Tripuraneni, M. I. Jordan, and C. Jin (2020)

On the theory of transfer learning: the importance of task diversity

.
arXiv preprint arXiv:2006.11650. Cited by: §2.
• J. Wang, Z. Charles, Z. Xu, G. Joshi, H. B. McMahan, M. Al-Shedivat, G. Andrew, S. Avestimehr, K. Daly, D. Data, et al. (2021) A field guide to federated optimization. arXiv preprint arXiv:2107.06917. Cited by: §2.
• K. Wang, R. Mathews, c. Kiddon, H. Eichner, F. Beaufays, and D. Ramage (2019) Federated evaluation of on-device personalization. arXiv preprint arXiv:1910.10252. Cited by: §1, §2.
• Wu,Yuhuai, Ren,Mengye, Liao,Renjie, and Grosse,Roger (2018) Understanding short-horizon bias in stochastic meta-optimization.. arXiv preprint arXiv:1803.02021. Cited by: §6.
• T. Yu, E. Bagdasaryan, and V. Shmatikov (2020) Salvaging federated learning by local adaptation. arXiv preprint arXiv:2002.04758. Cited by: §1, §2.
• G. Zhang, L. Li, Z. Nado, J. Martens, S. Sachdeva, G. E. Dahl, C. J. Shallue, and R. Grosse (2019) Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model.. arXiv preprint arXiv:1907.04164. Cited by: §6.
• M. Zhang, K. Sapra, S. Fidler, S. Yeung, and J. M. Alvarez (2021) PERSONALIZED federated learning with first order model optimization. ICLR2021. Cited by: §2.

## Appendix A Relaxing Noise Assumptions

We start by relaxing our assumptions on the noise. In general we can make the following assumptions:

First relaxation of A5 (Bounded variance) s.t. :

 {E[∥n0(x,ξ(0)t)∥2]≤M0∥∇xf0(x)∥2+σ20,E[∥navg(x,ξi=(1…N)t)∥2]≤Mavg∥∇xfavg(x)∥2/N+σ2avg/N.

The quantity is the average variance of collaborator’s gradient estimates when has converged to a stationary point. Using this new assumption with the gradient dissimilarity assumption:
A4 (Gradient Similarity) s.t. :

 ∥∇xfavg(x)−∇xf0(x)∥2≤m∥∇xf0(x)∥2+ζ2.

We get

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩E[∥n0(x,ξ(0)t)∥2]≤M0∥∇xf0(x)∥2+σ20,E[∥navg(x,ξi=(1…N)t)∥2]≤Mavgm∥∇xf0(x)∥2/N+~σ2avg/N,~σ2avg=σ2avg+Mavgζ2.

The quantity measures the average variance of collaborators’ gradient estimates this time when agent ”0” has converged to a stationary point. However, we can argue that when the hessian dissimilarity parameter i.e. is a translated copy of then the noise will not be changed from its original level by translation (adding a constant to a random variable does not change its variance) and thus instead of the additional we should expect an additional term to the variance of that is proportional to the parameter . This motivates the final form of our assumption:

Final form of A5 (Bounded variance) s.t. :

 ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩E[∥n0(x,ξ(0)t)∥2]≤M0∥∇xf0(x)∥2+σ20,E[∥navg(x,ξi=(1…N)t