# FedSplit: An algorithmic framework for fast federated optimization

Motivated by federated learning, we consider the hub-and-spoke model of distributed optimization in which a central authority coordinates the computation of a solution among many agents while limiting communication. We first study some past procedures for federated optimization, and show that their fixed points need not correspond to stationary points of the original optimization problem, even in simple convex settings with deterministic updates. In order to remedy these issues, we introduce FedSplit, a class of algorithms based on operator splitting procedures for solving distributed convex minimization with additive structure. We prove that these procedures have the correct fixed points, corresponding to optima of the original optimization problem, and we characterize their convergence rates under different settings. Our theory shows that these methods are provably robust to inexact computation of intermediate local quantities. We complement our theory with some simple experiments that demonstrate the benefits of our methods in practice.

## Authors

• 4 publications
• 94 publications
02/25/2021

### Distributionally Robust Federated Averaging

In this paper, we study communication efficient distributed algorithms f...
06/29/2021

### Achieving Statistical Optimality of Federated Learning: Beyond Stationary Points

Federated Learning (FL) is a promising framework that has great potentia...
04/03/2020

### From Local SGD to Local Fixed Point Methods for Federated Learning

Most algorithms for solving optimization problems or finding saddle poin...
08/12/2021

### An Operator Splitting View of Federated Learning

Over the past few years, the federated learning () community has witness...
07/02/2020

### Federated Learning with Compression: Unified Analysis and Sharp Guarantees

In federated learning, communication cost is often a critical bottleneck...
03/08/2021

### Convergence and Accuracy Trade-Offs in Federated Learning and Meta-Learning

We study a family of algorithms, which we refer to as local update metho...
04/13/2021

### Sample-based and Feature-based Federated Learning via Mini-batch SSCA

Due to the resource consumption for transmitting massive data and the co...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Federated learning is a rapidly evolving application of distributed optimization for estimation and learning problems in large-scale networks of remote clients

[13]. These systems present new challenges, as they are characterized by heterogeneity in computational resources and data across the network, unreliable communication, massive scale, and privacy constraints [16]. A typical application is for developers of cell phones and cellular applications to model the usage of software and devices across millions or even billions of users.

Distributed optimization has a rich history and extensive literature (e.g., see the sources [2, 5, 8, 31, 15, 24] and references therein), and federated learning has led to a flurry of interest in the area. A number of different procedures have been proposed for federated learning and related problems, using methods based on stochastic gradient methods or proximal procedures. Notably, McMahan et al. [18] introduced the FedSGD and FedAvg algorithms, which both adapt the classical stochastic gradient method to the federated setting, considering the possibility that clients may fail and may only be subsampled on each round of computation. Another recent proposal has been to use regularized local problems to mitigate possible issues that arise with device heterogeneity and failures [17]. These authors propose the FedProx procedure, an algorithm that applied averaged proximal updates to solve federated minimization problems.

Currently, the convergence theory and correctness of these methods is currently lacking, and practitioners have documented failures of convergence in certain settings (e.g., see Figure 3 and related discussion in the work [18]). Our first contribution in this paper is to analyze the deterministic analogues of these procedures, in which the gradient or proximal updates are performed using the full data at each client; such updates can be viewed as the idealized limit of a minibatch update based on the entire local dataset. Even in this especially favorable setting, we show that most versions of these algorithms fail to preserve the fixed points of the original optimization problem: that is, even if they converge, the resulting fixed points need not be stationary. Since the stochastic updates implemented in current practice are randomized versions of the underlying deterministic procedures, they also fail to preserve the correct fixed points in general.

In order to address this issue, we show how operator splitting techniques [5, 28, 7, 1] can be exploited to permit the development of provably correct and convergence procedures for solving federated problems. Concretely, we propose a new family of federated optimization algorithms, that we refer to as FedSplit. These procedures us to solve distributed convex minimization problems of the form

 minimizeF(x)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=m∑j=1fj(x), (1)

where are cost functions that each client assigns to the optimization variable

. In machine learning applications, the vector

is a parameter of a statistical model. In this paper, we focus on the case when are finite convex functions, with Lipschitz continuous gradient. While such problems are pervasive in data fitting applications, this necessarily precludes the immediate application of our methods and guarantees to constrained, nonsmooth, and nonconvex problems. We leave the analysis and development of such methods to future work.

As previously mentioned, distributed optimization is not a new discipline, with work dating back to the 1970s and 1980s [2]. Over the past decade, there has been a resurgence of research on distributed optimization, specifically for learning problems [5, 8, 31, 15, 24]. This line of work builds upon even earlier study of distributed first- and second-order methods designed for optimization in the “data center” setting [2]. In these applications, the devices that carry out the computation are high performance computing clusters with computational resources that are well-known. This is in contrast to the federated setting, where the clients that carry out computation are cell phones or other computationally-constrained mobile devices for which carrying out expensive, exact computations of intermediate quantities may be unrealistic. Therefore, it is important to have methods that permit approximate computation, with varying levels of accuracy throughout the network of participating agents [4].

Our development makes use of a long line of work that adapts the theory of operator-splitting methods to distributed optimization [5, 28, 7, 1]. In particular, the FedSplitprocedure developed in this paper is based upon an application of the Peaceman-Rachford splitting [21] to the distributed convex problem (1). This method and its variants have been studied extensively in the general setting of root-finding for the sum of two maximally monotone operators. Recent works have studied such splitting schemes for convex minimization under strong convexity and smoothness assumptions [11, 10, 19]. In this paper, we adapt this general theory to the specific setting of federated learning, and extend it to apply beyond strongly convex losses. Furthermore, we extend this previous work to the setting when specific intermediate quantities—likely to dominate the computational cost of on-device training—are inexact and cheaply computed.

The remainder of this paper is organized as follows. We begin with a discussion of two previously proposed methods for solving federated optimization problems in Section 2. We show that these methods cannot have a generally applicable convergence theory as these methods have fixed points that are not solutions to the federated optimization problems they are designed to solve. In Section 3, we present the FedSplitprocedure, and demonstrate that, unlike some other methods in use, it has fixed points that do correspond to optimal solutions of the original federated optimization problem. After presenting convergence results, we present numerical experiments in Section 4. These experiments confirm our theoretical predictions and also demonstrate that our methods enjoy favorable scaling in the problem conditioning. Section 5 is devoted to the proofs of our results. We conclude in Section 6 with future directions suggested by the development in this paper.

### 1.1 Notation

For the reader’s convenience, we collect here our notational conventions.

##### Set and vector arithmetic:

Given vectors , we use to denote their Euclidean inner product, and to denote the Euclidean norm. Given two non-empty subsets , their Minkowski sum is given by . We also set for any point .

For an integer , we use the shorthand . Given a block-partitioned vector with for , we define the block averaged vector . Very occasionally, we also slightly abuse notation by defining arithmetic between vectors of dimension with a common factor. For example, if and , then

 x+yto0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=(x+y1,…,x+ym).
##### Regularity conditions:

A differentiable function is said to be -strongly convex if

 f(y) ⩾f(x)+∇f(x)T(y−x)+ℓ2∥y−x∥2,for allx,y∈Rd.

It is simply convex if this condition holds with . Similarly, a differentiable function is -smooth if its gradient is -Lipschitz continuous,

 ∥∇f(x)−∇f(y)∥⩽L∥x−y∥,for all x,y∈Rd.
##### Operator notation:

Given an operator and a positive integer , we use to denote the composition of with itself times—that is, is a new operator that acts on a given as . An operator is said to be monotone if

 (Ty−Tx)T(y−x)⩾0for all x,y∈Rd. (2)

## 2 Existing algorithms and their fixed points

Prior to proposing our own algorithms, let us discuss the fixed points of the deterministic analogues of some methods recently proposed for federated optimization problems (1). We focus our discussion on two recently proposed procedures—namely, FedSGD [18] and FedProx [17].

In understanding these and other algorithms, it is convenient to introduce the consensus reformulation of the distributed problem (1), which takes the form

 minimizeF(x)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=∑mj=1fj(xj)subject tox1=x2=⋯=xm. (3)

Although this consensus formulation involves more variables, it is more amenable to the analysis of distributed procedures [5].

The recently proposed FedSGD method [18] is based on a multi-step projected stochastic gradient method for solving the consensus problem. Given the iterates at iteration , the method is based on taking some number of stochastic gradient steps with respect to each loss , and then passing to the coordinating agent to compute an average, which yields the next iterate in the sequence. When a single stochastic gradient step (

) is taken between the averaging steps, this method can be seen as a variant of projected stochastic gradient descent for the consensus problem (

3) and by classical theory of convex optimization enjoys convergence guarantees [14]

. On the other hand, when the number of epochs

is strictly larger than 1, it is unclear a priori if the method should retain the same guarantees.

As we discuss here, even without the additional inaccuracies introduced by using stochastic approximations to the local gradients, FedSGD with will not converge to minima in general. More precisely, let us consider the deterministic version of this method (which can be thought of the ideal case that would be obtained when the mini-batches at each device are taken to infinity). Given a stepsize , define the gradient mappings

 Gj(x)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=x−s∇fj(x) for j=1,…,m. (4)

For a given integer , we define the -fold composition

 Gej(x) (5)

corresponding to taking gradient steps from a given point . We also define to be the identity operator—that is, for all .

In terms of these operators, we can define a family of FedGD algorithms, with each algorithm parameterized by a choice of stepsize and number of gradient rounds . Given an initialization , it performs the following updates for :

 x(t+1/2)j to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=Gej(x(t)j), for j∈[m]to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$={1,2,…,m}, and (6a) x(t+1)j to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=¯¯¯x(t+1/2), for j∈[m], (6b)

where the reader should recall that is the block average.

###### Proposition 1.

For any and , the sequence generated by the FedGD algorithm in equation (6) has the following properties:

1. [font=,label=()]

2. If is convergent, then the local variables share a common limit such that as for .

3. Moreover, the limit satisfies the fixed point relation

 e∑i=1m∑j=1∇fj(Gi−1j(x⋆))=0. (7)

See Section 5.3.1 for the proof of this claim.

Unpacking this claim slightly, suppose first that , meaning that a single gradient update is performed at each device between the global averaging step. In this case, recalling that is the identity mapping, we have , so that if has a limit , it must satisfy the relations

 x1=x2=⋯=xmandm∑j=1∇fj(xj)=0.

Consequently, provided that the losses are convex, Proposition 1 implies that the limit of the sequence , when it exists, is a minimizer of the consensus problem (3).

On the other hand, when , a limit of the iterate sequence must satisfy the equation (7), which in general causes the method to have limit points which are not minimizers of the consensus problem. For example, when , a fixed point satisfies the condition

 m∑j=1{∇fj(x⋆)+∇fj(x⋆−s∇fj(x⋆))} =0.

This is not equivalent to being a minimizer of the distributed problem or its consensus reformulation, in general.

It is worth noting a very special case in which FedGD will preserve the correct fixed points, even when . In particular, suppose that all of local cost functions share a common minimizer , so that for . Under this assumption, we have all , and hence by arguing inductively, we have for all . Consequently, the fixed point relation (7) reduces to

 e∑i=1m∑j=1∇fj(x⋆)=0,

showing that is optimal for the original federated problem. However, the assumption that all the local cost functions share a common optimum , either exactly or approximately, is not realistic in practice. In fact, if this assumption were to hold in practice, then there would be little point in sharing data between devices by solving the federated learning problem.

Returning to the general setting in which the fixed points need not be preserved, let us make our observation concrete by specializing the discussion to a simple class of distributed least squares problems.

##### Incorrectness for least squares problems:

For , suppose that we are given a design matrix and a response vector

associated with a linear regression problem (so that our goal is to find a weight vector

such that ). The least squares regression problem defined by all the devices takes the form

 minimizeF(x)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=12m∑j=1∥Ajx−bj∥2. (8)

This problem is a special case of our general problem (1) with the choices

 fj(x)=12∥Ajx−bj∥2,j=1,…,m.

Note that these functions are convex and differentiable. For simplicity, let us assume that the problem is nondegenerate, meaning that the design matrices have full rank. In this case, the solution to this problem is unique, given by

 x⋆ls=(m∑j=1ATjAj)−1m∑j=1ATjbj. (9)

Now suppose that we apply the FedGD procedure to the least-squares problem (8). Some straightforward calculations yield

 ∇fj(x)=ATj(Ajx−bj)andGj(x)=(I−sATjAj)x−sATjbj,j=1,…,m.

Thus, in order to guarantee that converges, it suffices to choose the stepsize small enough so that for , where

denotes the maximum singular value of a matrix. In Given the structure of the least-squares problem, the iterated operator

takes on a special form—namely:

 Gkj(x) =(I−sATjAj)kx+s(k−1∑ℓ=0(I−sATjAj)ℓ)ATjbj =(I−sATjAj)kx+(ATjAj)−1(I−(I−sATjAj)k)ATjbj.

Hence, we conclude that if generated by the federated gradient recursion (6a) and (6b) converges for the least squares problem (8), then the limit takes the form

 x⋆FedGD=(m∑j=1ATjAj{e−1∑k=0(I−sATjAj)k})−1(m∑j=1{e−1∑k=0(I−sATjAj)k}ATjbj). (10)

Comparing this to the optimal solution from equation (9)), we see that as previously mentioned when , that the federated solution agrees with the optimal solution—that is, . However, when using a number of epochs and a number of devices , the fact that the coefficients in braces in display (10) are nontrivial implies that in general . Thus, in this setting, federated gradient methods do not actually have the correct fixed points, even in the idealized deterministic limit of full mini-batches. See Section 2.3 for numerical results that confirm this observation.

### 2.2 Federated proximal algorithms

Another recently proposed algorithm is FedProx [17], which can be seen as a distributed method loosely based on the classical proximal point method [26]. Let us begin by recalling some classical facts about proximal operators and the Moreau envelope; see Rockafellar [25] for more details. For a given stepsize , the proximal operator of a function is given by as

 proxsf(z)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=argminx∈Rd{f(x)+12s∥z−x∥2}. (11)

It is a regularized minimization of around . The interpretation of the parameter as a stepsize remains appropriate in this context: as the stepsize grows, the penalty for moving away from decreases, and thus, the proximal update will be farther away from . When is convex, the existence of such a (unique) minimizer follows immediately, and in this context, the regularized problem itself carries importance:

 Msf(z)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=infx∈Rd{f(x)+12s∥z−x∥2}.

This function is known as the Moreau envelope of with parameter [27, 20, chap. 1.G].

With these definitions in place, we can now study the behavior of the FedProx method [17]. In order to bring the relevant issues into sharp focus, let us consider a simplified deterministic version of FedProx, in which we remove any inaccuracies introduced by stochastic approximations of the gradients (or subsampling of the devices). For a given initialization , we perform the following steps for iterations :

 x(t+1/2)j to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=proxsfj(x(t)j), for j∈[m], and (12a) x(t+1)j to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=¯¯¯x(t+1/2), for j∈[m]. (12b)

The following result characterizes the fixed points of this method:

###### Proposition 2.

For any stepsize , the sequence generated by the FedProx algorithm (see equations (12a) and (12b)) has the following properties:

1. [font=,label=()]

2. If is convergent then, the local variables share a common limit such that as for each .

3. The limit satisfies the fixed point relation

 m∑j=1∇Msfj(x⋆)=0. (13)

See Section 5.3.2 for the proof of this claim.

Hence, we see that this algorithm will typically be a zero of the sum of the gradients of the Moreau envelopes , rather than a zero of the sum of the gradients of the functions themselves. When , these fixed point relations are, in general, different.

As with federated gradient schemes, one very special case in which FedProx preserves the correct fixed points is when the cost functions at all devices share a common minimizer . Under this assumption, the vector satisfies relation (13). because the minimizers of and coincide, and hence, we have for all . Thus, under strong regularity assumptions about the shared structure of the device cost , it is possible to provide theoretical guarantees for FedProx. However, as noted the assumption that the cost functions all share a common optimum , either exactly or approximately, is not realistic in practice. In contrast, the FedSplitalgorithm to be described in the next section retains correct fixed points in this setting without any such additional assumptions.

##### Incorrectness for least squares problems:

In order to illustrate the fixed point relation (13) from Proposition 2 in a concrete setting, let us return to our running example of of least squares regression. In this setting, recall that . Thus, we see that for any , we have

 ∇Msfj(x)=1s(x−(I+sATjAj)−1(x+sATjbj)),j=1,…,m.

Thus, according to Proposition 2, limits of the federated proximal recursion given by (12a) and (12b) have the form

 x⋆FedProx=(m∑j=1{I−(I+sATjAj)−1})−1(m∑j=1(ATjAj+(1/s)I)−1ATjbj).

Hence, comparing with as in equation (9), in general we will have . See Section 2.3 for numerical results that confirm this observation.

### 2.3 Illustrative simulation

It is instructive to perform a simple numerical experiment to see that even in the simplest deterministic setting considered here, the FedProx [17] and FedSGD [18] procedures, as specified in Sections 2.2 and 2.1 respectively, need not converge to the minimizer of the original function .

For the purposes of this illustration, we simulate an instance of our running least squares example. Suppose that for each device , the response vector is related to the design matrix via the standard linear model

 bj=Ajx0+vj,

where is the unknown parameter vector to be estimated, and the noise vectors are independently distributed as for some . For our experiments reported here, we constructed a random instance of such a problem with

 m=25,d=100,nj≡500,andσ2=0.25.

We generated the design matrices with i.i.d.entries of the form , for and . The aspect ratios of satisfy for all , thus by construction the matrices

are full rank with probability 1.

Figure 1 shows the results of applying the (deterministic) versions of FedProx and FedSGD, with varying numbers of local epochs for the least squares minimization problem (8). As expected, we see that FedProx and multi-step, deterministic FedSGD fail to converge to the correct fixed point for this problem. Although the presented deterministic variant of FedSGD will converge when a single local gradient step is taken between communication rounds (i.e., when ), we see that it also does not converge to the optimal solution as soon as .

## 3 A splitting framework and convergence guarantees

We now turn to the description of a framework that allows us to provide a clean characterization of the fixed points of iterative algorithms and to propose algorithms with convergence guarantees. Throughout our development, we assume that each function is convex and differentiable.

### 3.1 An operator-theoretic view

We begin by recalling the consensus formulation (3) of the problem in terms of a block-partitioned vector , the function given by , and the constraint set is the feasible subspace for problem (3). By appealing to the first-order optimality conditions for the problem (3), it is equivalent to find a vector such that belongs to the normal cone of the constraint set , or equivalently such that . Equivalently, if we define a set-valued operator as

 NE(x) to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$={E⊥,x1=x2=⋯=xm,∅,else (14)

then it is equivalent to find a vector that satisfies the inclusion condition

 0∈∇F(x)+NE(x). (15)

where .

When the loss functions

are convex, both and are monotone operators on , as defined in equation (2). Thus, the display (15) is a monotone inclusion problem. Methods for solving monotone inclusions have a long history of study within the applied mathematics and optimization literatures [28, 7]. We now use this framework to develop and analyze algorithms for solving the federated problems of interest.

### 3.2 Splitting procedures for federated optimization

As discussed above, the original distributed minimization problem can be reduced to finding a vector that satisfies the monotone inclusion (15). We now describe a method, derived from splitting the inclusion relation, whose fixed points do correspond with global minima of the distributed problem. It is an instantiation of the Peaceman-Rachford splitting, which we refer to as the FedSplitalgorithm in this distributed setting.

• Algorithm 1 [FedSplit] Splitting scheme for solving federated problems of the form (1) Given initialization , proximal solvers Initialize for :  1. for :  a. Local prox step: set  b. Local centering step: set    end for  2. Compute global average: set . end for

As laid out in Algorithm 3.2, at each time , the FedSplitprocedure maintains and updates a parameter vector for each device . The central server maintains a parameter vector , which collects averages of the parameter estimates at each machine.

The local update at device is defined in terms of a proximal solver . In the ideal setting, this proximal solver corresponds to an exact evaluation of the proximal operator for some stepsize . However, in practice, these proximal operators will not evaluated exactly, so that it is convenient to state the algorithm more generally in terms of proximal solvers with the property that

 prox_updatej(x)≈proxsfj(x),for all x∈Rd,

for a suitably chosen stepsize . We make the sense of this approximation precise in Section 3.3, where we give convergence results under access to both exact and approximate proximal oracles.

An immediate advantage to the scheme above is that it preserves the correct fixed points for the distributed problem:

###### Proposition 3.

Given any , suppose that is a fixed point for the FedSplit  procedure (Algorithm 3.2), meaning that

 z⋆j=z⋆j+2(proxsfj(2¯¯¯¯¯z⋆−z⋆j)−¯¯¯¯¯z⋆),for all j∈[m]. (16)

Then the average is an optimal solution to the original distributed problem—that is,

 m∑j=1fj(x⋆)=infx∈Rdm∑j=1fj(x).

See Section 5.2 for the proof of this claim.

Note that Proposition 3 does not say anything about the convergence of the FedSplitscheme. Instead, it merely guarantees that if the iterates of the method do converge, then they converge to optimal solutions of the problem that is being solved. This is to be contrasted with Propositions 1 and 2, that show the incorrectness of other proposed algorithms. It is the focus of the next section to derive conditions under which we can guarantee convergence of the FedSplitscheme.

### 3.3 Convergence results

In this section, we give convergence guarantees for the FedSplitprocedure in Algorithm 3.2. By appealing to classical first-order convex optimization theory, we are also able to give iteration complexities under various proximal operator implementations.

#### 3.3.1 Strongly convex and smooth losses

We begin by considering the case when the losses are -strongly convex and -smooth. We define the quantities

 ℓ∗to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=minj=1,…,mℓj,L∗to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=maxj=1,…,mLj,andκto0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=L∗ℓ∗. (17)

Note that corresponds to the smallest strong convexity parameter; corresponds to the largest smoothness parameter; and corresponds to the induced condition number of such a problem.

The following result demonstrates that in this setting, our method enjoys geometric convergence to the optimum, even with inexact proximal implementations.

###### Theorem 1.

Suppose that the local proximal updates of Algorithm 3.2 (Step 1A) are possibly inexact, with errors bounded as

 ∥prox_updatej(z)−proxsfj(z)∥⩽bfor all j and all z∈Rd. (18)

Then for any initialization , the FedSplitalgorithm with stepsize satisfies the bound

 ∥x(t+1)−x⋆∥⩽(1−2√κ+1)t∥z(1)−z⋆∥√m+(√κ+1)b,for all t=1,2,…. (19)

We prove Theorem 1 in Section 5.1 as a consequence of a more general result that allows for different proximal evaluation error at each round, as opposed to the uniform bound (18) assumed here.

##### Exact proximal evaluations:

In the special (albeit unrealistic) case when the proximal evaluations are exact, the uniform bound (18) holds with , and the bound (19) simplifies to

 ∥x(t+1)−x⋆∥ ⩽(1−2√κ+1)t∥z(1)−z⋆∥√m.

Consequently, given some initial vector , if we want to obtain a solution that is -accurate (i.e., with ), it suffices to take

 T(ε,κ) =c√κlog(∥z(1)−z⋆∥ε√m)

iterations of the overall procedure, where is a universal constant.

In practice, the FedSplitalgorithm will be implemented using an approximate prox-solver; here we consider doing so by using a gradient method on each device . Recall that the proximal update at device at round takes the form:

 proxsfj(x(t)j) =argminu∈Rd{sfj(u)+12∥u−x(t)j∥22hj(u)}.

A natural way to compute an approximate minimizer is to run rounds of gradient descent on the function . (To be clear, this is not the same as running multiple rounds of gradient descent on as in the FedGD procedure.) Concretely, at round , we initialize the gradient method with the initial point , let us run gradient descent on with a stepsize , thereby generating the sequence

 u(t+1) =u(t)−α∇hj(u(t))=u(t)−αs∇fj(u(t))+(u(t)−x(t)j) (20)

We define to be the output of this procedure after steps.

###### Corollary 1 (FedSplitconvergence with inexact proximal updates).

Consider the FedSplit  procedure run with proximal stepsize , and using approximate proximal updates based on rounds of gradient descent with stepsize initialized (in round ) at the previous iterate . Then the the bound (18) holds at round with error at most

 b ⩽(1−1√κ+1)e∥x(t)j−proxsfj(x(t)j)∥2. (21)

Given the exponential decay in the number of rounds exhibited in the bound (21), in practice, it suffices to take a relatively small number of gradient steps. For instance, in our experiments to be reported in Section 4, we find that suffices to track the evolution of the algorithm using exact proximal updates up to relatively high precision.

It should be noted that the guarantees provided in Theorem 1 and Corollary 1 both depend on stepsize choices that involve knowledge of the smoothness parameter and/or the strong convexity parameter , as defined in equation (17). With reference to the gradient updates in Corollary 1, we can adapt standard theory (e.g., [6]) to show that if the gradient stepsize parameter were chosen with a backtracking line search, we would obtain the same error bound (21), up to a multiplicative pre-factor applied to the term . As for the proximal stepsize choice , we are not currently aware of standard procedures for setting it that are guaranteed to preserves the convergence bound of Theorem 1. However, we believe that this should be possible, and this is an interesting direction for future research.

#### 3.3.2 Smooth but not strongly convex losses

We now consider the case when are -smooth and convex, but not necessarily strongly convex. Given these assumptions, the consensus objective is a -smooth function on the product space . So as to avoid degeneracies, we assume that the federated objective is bounded below, and achieves its minimum.

Our approach to solving such a problem is to apply the FedSplitprocedure to a suitably regularized version of the original problem. More precisely, given some initial vector and a regularization parameter , let us define the function

 Fλ(z)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=m∑j=1{fj(zj)+λ2∥zj−x(1)∥2}. (22)

We see that is a -strongly convex and -smooth function. The next result shows that for any , minimizing the function up to an error of order , using a carefully chosen , yields an -cost-suboptimal minimizer of the original objective function .

###### Theorem 2.

Given some and any initialization , suppose that we run the FedSplitprocedure (Algorithm 3.2) on the regularized objective using exact prox steps with stepsize . Then the FedSplitalgorithm outputs a vector satisfying after at most

 ˜O⎛⎝√L∗∥x(1)−x⋆∥2ε⎞⎠ (23)

iterations.

See Section 5.2.4 for the proof of this result.

To be clear, the notation in the bound (23) denotes the presence of constant and polylogarithmic factors that are not dominant.

## 4 Experiments

In this section, we present some simple numerical results for FedSplit. We begin by presenting results for least squares problems in Section 4.1

as well as logistic regression in Section

4.2. This section concludes with a comparison of the performance of FedSplitversus federated gradient procedures in Section 4.3. All of the experiments here were conducted on a machine running Mac OS 10.14.5, with a 2.6 GHz Intel Core i7 processor, in Python 3.7.3. In order to implement the proximal operators for the logistic regression experiments, we used CVXPY, a modelling language for disciplined convex programs [9].

### 4.1 Least squares

As an initial object of study, we consider the least squares problem (8). In order to have full control over the conditioning of the problem, we consider instances defined by randomly generated datasets. Given a collection of design matrices for , we generate random response vectors according to a linear measurement model

 bj=Ajx0+vj,for j∈[m]={1,…,m},

where the noise vector is generated as . For this experiment, we set

 d=500,m=25,nj≡5000,andσ2=0.25.

We also generate random versions of the design matrices , from one of two possible ensembles:

• Isotropic ensemble: each design matrix is generated with i.i.d.entries , for all and . In the regime

considered here, known results in non-asymptotic random matrix theory (e.g.

[29]) guarantee that the matrix will be well-conditioned.

• Spiked ensemble: in order to illustrate how algorithms depend on conditioning, we also generate design matrices according the procedure described in Section 4.3 with . This leads to a problem that has condition number .

Finally, we construct a random parameter vector

by sampling from the standard Gaussian distribution

.

We solve this federated problem via FedSplit, implemented with both exact proximal updates as well as with a constant number of local gradient steps, . For comparison, we also apply the FedGD procedure (6).

We solve the resulting optimization problem using various methods: the FedGD procedure with , which is the only setting guaranteed to preserve the fixed points; the exact form of FedSplitprocedure, in which the proximal updates are computed exactly; and inexact versions of the FedSplitprocedure using the gradient method (see Corollary 1) to compute approximations to the updates with rounds of gradient updates per machine.

Figure 2 shows the results of these experiments, plotting the log optimality gap versus the iteration number for these algorithms; see the caption for discussion of the behavior.

### 4.2 Logistic regression

Moving beyond the setting of least squares regression, we now explore the behavior of various algorithms for solving federated binary classification. In this problem, we again have fixed design matrices , but the response vectors take the form of labels, . Here the rows of , denoted by for are collections of features, associated with class label , the entries of . We assume that for and unknown parameter vector , the conditional probability of observing a positive class label is given by

 P{bij=1}=eaTijx01+eaTijx0,for i =1,…,nj. (24)

Given observations of this form, the maximum likelihood estimate for is then a solution to the convex program

 minimizem∑j=1nj∑i=1log(1+e−bijaTijx), (25)

with variable . This problem is referred to as logistic regression.

Since the function has bounded, positive second derivative, it is straightforward to verify that the objective function in problem (25) is smooth and convex. The local cost functions are given by the corresponding sums over the local datasets—that is

 fj(x)to0.0pt\raisebox1.29pt$⋅$\raisebox−1.29pt$⋅$=nj∑i=1log