# Distributed Fixed Point Methods with Compressed Iterates

We propose basic and natural assumptions under which iterative optimization methods with compressed iterates can be analyzed. This problem is motivated by the practice of federated learning, where a large model stored in the cloud is compressed before it is sent to a mobile device, which then proceeds with training based on local data. We develop standard and variance reduced methods, and establish communication complexity bounds. Our algorithms are the first distributed methods with compressed iterates, and the first fixed point methods with compressed iterates.

## Authors

• 1 publication
• 9 publications
• 17 publications
• 116 publications
• 14 publications
• 41 publications
04/03/2020

### From Local SGD to Local Fixed Point Methods for Federated Learning

Most algorithms for solving optimization problems or finding saddle poin...
09/10/2019

### Gradient Descent with Compressed Iterates

We propose and analyze a new type of stochastic first order method: grad...
12/07/2020

### Improved Convergence Rates for Non-Convex Federated Learning with Compression

Federated learning is a new distributed learning paradigm that enables e...
02/14/2021

### Smoothness Matrices Beat Smoothness Constants: Better Communication Compression Techniques for Distributed Optimization

Large scale distributed optimization has become the default tool for the...
12/03/2017

### ALLSAT compressed with wildcards. Part 4: An invitation for C-programmers

The model set of a general Boolean function in CNF is calculated in a co...
02/02/2022

### DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization

We develop and analyze DASHA: a new family of methods for nonconvex dist...
06/19/2020

### A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

Modern large-scale machine learning applications require stochastic opti...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Communication efficiency and memory issues often arise in machine learning algorithms dealing with large models. This is specifically true in federated learning, where a network of

computing agent is required to jointly solve a machine learning task [Konecny16, caldas2018expanding]. In this case, dealing with large models implies suffering from a high communication time, which is known to be the bottleneck in distributed applications [alistarh2017qsgd].

To overcome this issue, standard techniques proceed by compressing the gradients of some SGD (Stochastic Gradient Descent) like training algorithm as in

[alistarh2017qsgd, WenTernGrad17, bernstein2018signsgd]. We consider the case where the iterates themselves need to be compressed. This case is relevant even in the case, where there is only one computing agent provided that the model is too large to keep in memory and needs to be compressed. Training with compressed iterates was only considered in one recent work [GDCI], which introduced the gradient descent algorithm with compressed iterates. In this paper we improve the results of [GDCI] and generalize them in two ways. First, in the case , we consider iterates compression in any algorithm that can be formulated as a (stochastic) fixed point iteration. This covers gradient descent and stochastic gradient descent, among others. Second, we consider the distributed case , where the network has to jointly find a fixed point of some map, in a distributed manner over the nodes, and using iterate compression. This distributed fixed point problem covers many applications of federated learning, including distributed minimization or distributed saddle point problems.

To address these problems we first study a naive approach that relies on compressing the iterates after each iteration. This iterates compression introduces an extra source of variance in the algorithms. We then propose a variance reduced approach that allows to remove the variance induced by the compression.

In summary, we make the following contributions:

• We propose new distributed algorithms (non variance reduced and variance reduced) to learn with compressed iterates in the fixed point framework, which we show captures gradient descent as well as a variety of other methods.

• We derive non asymptotic convergence rates for these methods. Our theory allows improved rates when specialized to gradient descent compared to prior work, and captures other variants of gradient descent that have not been previously studied.

• We experiment numerically with the developed algorithms on synthetic and real datasets and report our findings.

#### 1.1 Related Work

Communication-Efficiency. In distributed optimization, the communication cost is the bottleneck. In order to reduce it, many methods have been suggested including the use of intermittent communication and decentralization [Wang18], as well as exchanging only compressed or quantized information between the computing units [tsitsiklis1987communication]. Usually, the exchanged information is usually some compressed gradient [seide20141, alistarh2017qsgd, bernstein2018signsgd] or compressed model update [Basu2019, reisizadeh2019fedpaq] in a distributed master-worker setting. We note that in the setting of gradient compression, various methods have also been developed to reduce the noise from gradient compression and the method we develop is similar in spirit to some of them, as [DIANA2, mishchenko2019distributed]. As noted before, iterate compression is also used in Federated Learning, see e.g. [Konecny16, caldas2018expanding], where concerns of communication efficiency and memory usage are particularly important.

Decentralized Methods. In decentralized settings, one can distinguish between exact methods and approximate methods. Among inexact methods, some provide algorithms that compute (sub)gradients at compressed iterates: [nedic2008distributed, rabbat2005quantized]. Exact methods usually exchange compressed gradients or compressed iterates [koloskova2019decentralized, doan2018accelerating, reisizadeh2018quantized, berahas2019nested, zhang2019compressed, lee2018finite]. Our focus in this work is on centralized methods, and we leave an extension to decentralized settings to future work.

In addition to concerns of communication efficiency, because compression operators satisfying Assumption 3 are not all necessarily quantization operators, our method (in the case) can also be seen as an analysis of fixed point methods with perturbed iterates in a similar spirit to [Devolder2014], who analyze gradient descent methods given access to an inexact oracle.

The remainder is organized as follows. In the next section we provide some background on distributed fixed point problems and compression operators, and make our assumptions. In Section 3, we consider the case where there is only one computing unit. We describe our algorithms, state the main results and instantiate the algorithms to practical (stochastic) fixed point iterations. The case is generalized in Section 4 where a network of computing units is considered. We describe our distributed algorithms, state the main results and instantiate the distributed algorithms to practical distributed (stochastic) fixed point iterations. Finally, simulations on a federated learning task is provided in Section 5. The proofs of our theorems are postponed to the appendix.

### 2 Background

#### 2.1 Distributed fixed point

Let be operators on , i.e., . Denoting

 T(x)\coloneqq1nn∑i=1Ti(x), (1)

our goal is to find a fixed point of , i.e., a point such that

 T(x⋆)=x⋆. (2)

Consider a probability space

, a family

of random variables defined on

with values in some measurable space . Denote the distribution of and the distribution (over ) of . We allow each to have the following stochastic representation:

 Ti(x)=Esi[Ti(x,si)], (3)

where, with a small abuse of notation, denotes an -integrable function for every . We also denote for every ,

 T(x,s)\coloneqq1nn∑i=1Ti(x,si). (4)

Note that is -integrable and that .

We assume the following contraction property for the stochastic map .

###### Assumption 1.

There exist , and such that for every ,

 (5)

This assumption is satisfied by many maps describing (stochastic) optimization algorithms under some strong convexity / smoothness assumption; see Sections 3 and 4. We shall also use the expected Lipschitz continuity of defined as follows.

###### Assumption 2.

For every , there exists such that for every :

 Es[∥Ti(x,s)−Ti(y,s)∥2]≤ci∥x−y∥2, (6)

and we denote

 c2\coloneqq1nn∑i=1c2i.

#### 2.2 Compression operator

In order to overcome communication issues, we apply a compression operator to the iterates.

Consider a family of random variables defined on with values . If , we shall prefer the notation for . We consider a measurable map such that for every ,

 x=Eξi[C(x,ξi)]. (7)

The map is called a compression operator. We make the following assumption on .

###### Assumption 3.

There exists such that for every and every ,

 Eξi[∥C(x;ξi)−x∥2]≤ω∥x∥2. (8)

Assumption 3 has been used before, either in this general form or in special cases, in the analysis of gradient methods with compressed gradients [koloskova2019decentralized, DIANA2] and compressed iterates [GDCI]. Many practical compression operators satisfy this assumption; e.g., natural compression and natural dithering, standard dithering, sparsification, and quantization [horvath2019, DIANA2, GDCI, Stich18].

### 3 Results in the case n=1

In this section, we present two algorithms to solve (2) in the case when and state two theorems related to these algorithms.

Consider stochastic fixed point iterations of the form

 xk+1=T(xk,sk), (9)

where is a sequence of i.i.d. copies of . Our first algorithm compresses all iterates for .

Theorem 1 states the convergence result obtained for Algorithm 1.

###### Theorem 1.

Suppose that Assumptions 1, 2 and 3 hold. Let . Then the iterates defined by Algorithm 1 satisfy

 E[rk]≤(1−ρ+2ωc2)kr0+B+2ωσ2ρ−2ωc2,

where

The convergence rate of is linear up to a ball of squared radius

 Bρ−2ωc2+2ωσ2ρ−2ωc2. (10)

The first term is coming from Assumption 1. The value of is usually zero for deterministic fixed point maps , see the next subsection. In this case, the first term of (10) is zero. The presence of the second term is mainly a consequence of the variance of the compression operator. If (no compression), then the second term is equal to zero.222Having is hopeless, except in very particular cases like deterministic and .

In order to remove this variance term, we develop a variance reduced version of Algorithm 1 to solve (2).

The improved convergence rate of Algorithm 2 is stated by the next theorem.

###### Theorem 2.

Let be the following Lyapunov function:

 Ψk\coloneqq∥∥xk−x⋆∥∥2+4η2ωαEs[∥∥hk−T(x⋆,sk)∥∥2].

Suppose that Assumptions 1, 2 and 3 hold. Then the iterates defined by Algorithm 2 satisfy

 E[Ψk]≤(1−min{α,ηρ}2)kE[Ψ0]+2ηBmin{α,ηρ}, (11)

if the stepsizes satisfy

 α≤1ω+1andη=min{1,ρ12ωc2}.

Therefore, Algorithm 2 converges linearly if and allows for arbitrarily large compression variance.

#### 3.1 Examples

We now give some instances of our algorithms 1 and 2 by particularizing the map .

Consider an -smooth -strongly convex objective function and a step-size . Then

 TGD:x↦x−γ∇F(x) (12)

satisfies Assumption 1 with and , and Assumption 2 with  [bau-com-livre11]. As a result, for any compression operator satisfying Assumption 3, Theorem 1 states that

 E[rk]≤(1−γμ+2ω)kr0+2ωγμ−2ω∥x⋆∥2.

This result improves upon the result obtained in [GDCI] by requiring rather than while still guaranteeing convergence. Moreover, using Theorem 2, converges linearly to zero, rather to a neighbourhood of the solution, if Algorithm 1 is applied.

Consider a -strongly convex objective function and

(). Assume that there exists such that

 Es[∥g(x,s)−g(y,s)∥]≤L∥x−y∥.

Then, a simple calculation shows that Assumption 2 is satisfied by the map

 TSGD:(x,s)↦x−γg(x,s).

It is also known that Assumption 1 is satisfied, with in general, see e.g. [Gower2019].

##### Proximal SGD

One can generalize the previous example to the map

 Tprox-SGD:(x,s)↦proxγH(x−γg(x,s)),

where is a convex, lower semicontinuous and proper function , and is the proximity operator of defined as

 proxγH(x)\coloneqqargminy∈Rd{12∥x−y∥2+γH(y)}.

The map also satisfies the Assumptions [atc-for-mou-14, atc-for-mou-17]. A fixed point of is a minimizer of .

##### Davis-Yin splitting

Davis-Yin splitting [davis2017three] is an optimization algorithm to minimize a sum of three convex functions . It is a generalization of Gradient Descent, Proximal Gradient Descent, and Douglas Rachford algorithms [boyd2011distributed, bau-com-livre11] and it takes the form of fixed point iterations . The map satisfies Assumptions 1 and 2 with if at least one of or is strongly convex and at least one of or is smooth [davis2017three]. Therefore Algorithm 2 converges linearly in this case.

##### Vu-Condat splitting

Vu condat splitting [con-jota13, vu2013splitting] is an optimization algorithm to minimize a sum of three convex functions where is a matrix. It is a generalization of Gradient Descent, Proximal Gradient Descent, and Douglas Rachford, ADMM and Chambolle-Pock algorithms [chambolle2011first] and it takes the form of fixed point iterations . The map satisfies Assumptions 1 and 2 with if is strongly convex and is smooth. Therefore Algorithm 2 converges linearly in this case.

Consider a -strongly convex-concave function defined by , (strongly convex in and strongly concave in ) with -Lipschitz continuous gradient. Then, the map

 TGDA:(x,y)↦(x,y)T−γ(∇xF(x,y),−∇yF(x,y))T, (13)

satisfies Assumption 2 and Assumption 1 with if is small enough. In this case, Algorithm 4 will converge linearly to a saddle point of . This example can be generalized to the case where the gradient is replaced by an unbiased estimate with the expected Lipschitz continuity property, in which case Assumption 1 holds with in general.

### 4 The case n>1

We now consider the case where computing agents are required to compute a fixed point of , under the restriction that each node only have access to the ”local” random map . We solve this problem in a distributed master/slave setting, where each iteration is divided into a computation step and a communication step. During the computation step, every node uses to update some ”local” variable. Then, during the communication step, each node sends its local variable to the master node of the network that aggregates the variables and sends back the result to the other nodes. We extend Algorithm 1 (resp. Algorithm 2) to this setting, as well as Theorem 1 (resp. Theorem 2). The distributed (non variance reduced) fixed point algorithm is summarized in Table 3.

The convergence rate of this method is a direct generalization of Theorem 1

###### Theorem 3.

Let Assumptions 12 and 3 hold. Assume moreover that are independent. Let . Then the iterates defined by Algorithm 3 satisfy

 E[rk]≤(1−ρ+2ωc2n)kr0+B+2ωnσ2ρ−2ωc2n,

where

Once again, the rate suffers from the variance term which is removed by our variance reduced approach summarized in Table 4.

Finally, the next theorem is the analogue of Theorem 2 in the distributed setting.

###### Theorem 4.

Define the Lyapunov function

 Ψk\coloneqq∥∥xk−x⋆∥∥2+4η2ωαn2n∑i=1Esi[∥∥hki−Ti(x⋆,ski)∥∥2].

Suppose that Assumptions 1, 2 and 3 hold. Assume moreover that are independent. Then the iterates defined by Algorithm 4 satisfy

 E[Ψk]≤(1−min{α,ηρ}2)kE[Ψ0]+2ηBmin{α,ηρ}, (15)

if the stepsizes satisfy

 α≤1ω+1andη=min{ρn12ωc2,1}.

Algorithm 4 converges linearly if . We further note that in the special case the algorithm reduces to quantizing the model update in expectation, a practice that is already common in practice. This further mirrors the result of [DIANA2] where quantizing gradient differences rather than gradients allows several benefits over quantizing gradients. The message of Theorem 4, therefore, is that quantizing iterate differences rather than iterates also leads to better convergence properties.

#### 4.1 Examples

Consider a -strongly convex objective function expressed as an empirical mean

 F(x)=1nn∑i=1fi(x),

where each is -smooth and convex. Then it is easy to check that the map defined in (12) takes the form (1) and that Assumptions 1 and 2 are satisfied by this map if is small enough, see e.g. [Gower2019]. Algorithm 4 is then a distributed gradient descent algorithm with iterates compression that converges linearly. If the are themselves written as expectations and have the expected Lipschitz continuity property and convexity, one can check that Assumptions 1 and 2 are also satisfied.

##### Distributed (Stochastic) Gradient Descent Ascent

The example (13) can be extended to the case where is expressed as an empirical mean if each term has a Lipschitz continuous gradient. In this case, the distributed Algorithm 4 still converges linearly to a saddle point of .

### 5 Empirical results

Here we present very preliminary numerical results.

###### Experiment.

We minimize an

regularized loss of a linear regression problem using gradient descent and natural compression

[horvath2019]. We carry out this experiment for different condition numbers.

###### Parametrisation.

We use an artificial regression dataset which allows us to have control over the conditioning of the loss function (by controling the singular values of the feature matrix). We use the stepsize

and in the variance reduced iteration we chose and (in the case of natural compression ).

###### Results.

In the following we plot the evolution for GD (Gradient Descent), GDCI (Gradient Descent with Compressed Iterates), VR-GDCI (Variance Reduced Gradient Descent with Compressed Iterates).

### Appendix A Basic Facts

We recall the following fact about the variance of a random variable: Given a fixed and a random variable , we have

 (16)

If are independent random variables then

 (17)

We also recall the following inequality from linear algebra: for any we have,

 (18)

We will also use the following fact: which follows from the convexity of the squared Euclidean norm: for we have,

 ∥ηa+(1−η)b∥2≤η∥a∥2+(1−η)∥b∥2. (19)

Moreover, we shall use the following lemma without mention.

###### Lemma 1.

Let and and let be a sequence of real numbers with satisfying the recursion

 rk+1≤Ark+B.

Then

 rk≤Akr0+B1−A.

### Appendix B Proof of Theorem 3

Since Theorem 1 is a particular case of Theorem 3, we only prove Theorem 3. From (14), we have conditionally on ,

 E[∥∥xk+1−x⋆∥∥2] = E⎡⎣∥∥ ∥∥1nn∑i=1C(Ti(xk,ski);ξki)−x⋆∥∥ ∥∥2⎤⎦ (20) E⎡⎣∥∥ ∥∥1nn∑i=1C(Ti(xk,ski);ξki)−1nn∑i=1Ti(xk,ski)∥∥ ∥∥2⎤⎦+∥∥ ∥∥1nn∑i=1Ti(xk,ski)−x⋆∥∥ ∥∥2 1n2E⎡⎣∥∥ ∥∥n∑i=1(C(Ti(xk,ski);ξki)−Ti(xk,ski))∥∥ ∥∥2⎤⎦+∥∥T(xk,sk)−x⋆∥∥2

The first term in (20) can be bounded using Assumption 3:

 ≤ ωn2n∑i=1∥∥Ti(xk,ski)∥∥2 (21) 2ωn2n∑i=1∥∥Ti(xk,ski)−Ti(x⋆,ski)∥∥2+2ωn2n∑i=1∥∥Ti(x⋆,ski)∥∥2 2ωn2n∑i=1c2i∥∥xk−x⋆∥∥2+2ωn2n∑i=1∥∥Ti(x⋆,ski)∥∥2 = 2ωc2n∥∥xk−x⋆∥∥2+2ωn2n∑i=1∥∥Ti(x⋆,ski)∥∥2.

Plugging in (21) in (20),

 E[∥∥xk+1−x⋆∥∥2] ≤

Therefore, conditionally on ,

 E[∥∥xk+1−x⋆∥∥2] ≤ 2ωc2n∥∥xk−x⋆∥∥2+2ωn2n∑i=1E[∥∥Ti(x⋆,ski)∥∥2]+E[∥∥T(xk,sk)−x⋆∥∥2] (2ωc2n+1−ρ)∥∥xk−x⋆∥∥2+B+2ωn2n∑i=1E[∥∥Ti(x⋆,ski)∥∥2].

Finally taking unconditional expectations yields the theorem’s claim.

### Appendix C Proof of Theorem 4

Since Theorem 2 is a particular case of Theorem 4, we only prove Theorem 4.

###### Lemma 2.

Under Assumption 3, if , then for every the iterates of Algorithm 4 satisfy conditionally on and :

 E[∥∥hk+1i−Ti(x⋆,ski)∥∥2]≤(1−α)E[∥∥hki−Ti(x⋆,ski)∥∥2]+αE[∥∥Ti(xk,ski)−Ti(x⋆,ski)∥∥2] (22)
###### Proof.

Conditionally on we have

 E[∥∥hk+1i−Ti(x⋆,ski)∥∥2] = E[∥∥hki−Ti(x⋆,ski)+αδki∥∥2] = ≤ +α2(ω+1)∥∥Ti(xk,ski)−hki∥∥2 ≤ +α∥∥Ti(xk,ski)−hki∥∥2 =

For the inner product in the last inequality, we have

 ⟨2hki−2Ti(x⋆,ski)+Ti(xk,ski)−hki,Ti(xk,ski)−hki⟩ =⟨hki−Ti(x⋆,ski)+Ti(xk,ski)−Ti(x⋆,ski),Ti(xk,ski)−Ti(x⋆,ski)−(hki−Ti(x⋆,ski))⟩

Using this in the previous inequality, we get

 E[∥∥hk+1i−Ti(x⋆,ski)∥∥2] =(1−α)∥∥hki−Ti(x⋆,ski)∥∥2+α∥∥Ti(xk,ski)−Ti(x⋆,ski)∥∥2.

It remains to take expectation with respect to the randomness in .

###### Lemma 3.

Under Assumptions 1 and 3, the iterates of Algorithm 4 satisfy,

 E[∥∥xk+1−x⋆∥∥2] ≤ (1−ηρ)∥∥xk−x⋆∥∥2+ηB (23)
###### Proof.

Conditionally on we have,

 E[∥∥xk+1−x⋆∥∥2] = E⎡⎣∥∥ ∥∥(1−η)xk+ηnn∑i=1(δki+hki)−x⋆∥∥ ∥∥2⎤⎦ ∥∥(1−η)xk+ηT(xk,sk)−x⋆∥∥2+η2n2E⎡⎣∥∥ ∥∥n∑i=1δki−E[δki]∥∥ ∥∥2⎤⎦ ∥∥(1−η)xk+ηT(xk,sk)−x⋆∥∥2+η2n2n∑i=1E[∥∥δki−E[δki]∥∥2] ≤

We now take expectation with respect to the randomness in and conditionally on :

 E[∥∥xk+1−x⋆∥∥2] ≤ E[∥∥(1−η)xk+ηT(xk,sk)−x⋆∥∥2]+η2ωn2n∑i=1E[∥∥Ti(xk,ski)−hki∥∥2]. (24)

To bound the first term in (24) we use the convexity of the squared norm as follows,

 E[∥∥(1−η)xk+ηT(xk,sk)−x⋆∥∥2] = E[∥∥(1−η)(xk−x⋆)+η(T(xk,sk)−x⋆)∥∥2] (25) (1−η)∥∥xk−x⋆∥∥2+ηE[∥∥T(xk,sk)−x⋆∥∥2] (1−η+η(1−ρ))∥∥xk−x⋆∥∥2+ηB = (1−ηρ)∥∥xk−x⋆∥∥2+ηB.

For the second term in (24) we have,

 E[∥∥Ti(xk,ski)−hki∥∥2] 2E[∥∥Ti(xk,ski)−Ti(x⋆,ski)∥∥2]+2E[∥∥Ti(x⋆,ski)−hki∥∥2]. (26)

It remains to substitute with (25) and (26) in (24):

 E[∥∥xk+1−x⋆∥∥2] ≤ (1−ηρ)∥∥xk−x⋆∥∥2+ηB

We now prove Theorem 4. By Lemmas 3 and 2 taking conditional expectation w.r.t. ,

 E[Ψk+1] = E[∥∥xk+1−x⋆∥∥2]+4η2ωαn2n∑i=1E[∥∥hk+1i−x⋆∥∥2] (27) (1−ηρ)∥∥xk−x⋆∥∥2+6η2ωn2n∑i=1E[∥∥Ti(xk,ski)−Ti(x⋆,ski)∥∥2] +4η2ωαn2(1−α2)n∑i=1E[∥∥hki−Ti(x⋆,ski)∥∥2]+ηB (1−ηρ)∥∥xk−x⋆∥∥2+6η2ωn2n∑i=1c2i⋅∥∥xk−x⋆∥∥2 +4η2ωαn2(1−α2)n∑i=1