Communication efficiency and memory issues often arise in machine learning algorithms dealing with large models. This is specifically true in federated learning, where a network ofcomputing agent is required to jointly solve a machine learning task [Konecny16, caldas2018expanding]. In this case, dealing with large models implies suffering from a high communication time, which is known to be the bottleneck in distributed applications [alistarh2017qsgd].
To overcome this issue, standard techniques proceed by compressing the gradients of some SGD (Stochastic Gradient Descent) like training algorithm as in[alistarh2017qsgd, WenTernGrad17, bernstein2018signsgd]. We consider the case where the iterates themselves need to be compressed. This case is relevant even in the case, where there is only one computing agent provided that the model is too large to keep in memory and needs to be compressed. Training with compressed iterates was only considered in one recent work [GDCI], which introduced the gradient descent algorithm with compressed iterates. In this paper we improve the results of [GDCI] and generalize them in two ways. First, in the case , we consider iterates compression in any algorithm that can be formulated as a (stochastic) fixed point iteration. This covers gradient descent and stochastic gradient descent, among others. Second, we consider the distributed case , where the network has to jointly find a fixed point of some map, in a distributed manner over the nodes, and using iterate compression. This distributed fixed point problem covers many applications of federated learning, including distributed minimization or distributed saddle point problems.
To address these problems we first study a naive approach that relies on compressing the iterates after each iteration. This iterates compression introduces an extra source of variance in the algorithms. We then propose a variance reduced approach that allows to remove the variance induced by the compression.
In summary, we make the following contributions:
We propose new distributed algorithms (non variance reduced and variance reduced) to learn with compressed iterates in the fixed point framework, which we show captures gradient descent as well as a variety of other methods.
We derive non asymptotic convergence rates for these methods. Our theory allows improved rates when specialized to gradient descent compared to prior work, and captures other variants of gradient descent that have not been previously studied.
We experiment numerically with the developed algorithms on synthetic and real datasets and report our findings.
1.1 Related Work
Communication-Efficiency. In distributed optimization, the communication cost is the bottleneck. In order to reduce it, many methods have been suggested including the use of intermittent communication and decentralization [Wang18], as well as exchanging only compressed or quantized information between the computing units [tsitsiklis1987communication]. Usually, the exchanged information is usually some compressed gradient [seide20141, alistarh2017qsgd, bernstein2018signsgd] or compressed model update [Basu2019, reisizadeh2019fedpaq] in a distributed master-worker setting. We note that in the setting of gradient compression, various methods have also been developed to reduce the noise from gradient compression and the method we develop is similar in spirit to some of them, as [DIANA2, mishchenko2019distributed]. As noted before, iterate compression is also used in Federated Learning, see e.g. [Konecny16, caldas2018expanding], where concerns of communication efficiency and memory usage are particularly important.
Decentralized Methods. In decentralized settings, one can distinguish between exact methods and approximate methods. Among inexact methods, some provide algorithms that compute (sub)gradients at compressed iterates: [nedic2008distributed, rabbat2005quantized]. Exact methods usually exchange compressed gradients or compressed iterates [koloskova2019decentralized, doan2018accelerating, reisizadeh2018quantized, berahas2019nested, zhang2019compressed, lee2018finite]. Our focus in this work is on centralized methods, and we leave an extension to decentralized settings to future work.
In addition to concerns of communication efficiency, because compression operators satisfying Assumption 3 are not all necessarily quantization operators, our method (in the case) can also be seen as an analysis of fixed point methods with perturbed iterates in a similar spirit to [Devolder2014], who analyze gradient descent methods given access to an inexact oracle.
The remainder is organized as follows. In the next section we provide some background on distributed fixed point problems and compression operators, and make our assumptions. In Section 3, we consider the case where there is only one computing unit. We describe our algorithms, state the main results and instantiate the algorithms to practical (stochastic) fixed point iterations. The case is generalized in Section 4 where a network of computing units is considered. We describe our distributed algorithms, state the main results and instantiate the distributed algorithms to practical distributed (stochastic) fixed point iterations. Finally, simulations on a federated learning task is provided in Section 5. The proofs of our theorems are postponed to the appendix.
2.1 Distributed fixed point
Let be operators on , i.e., . Denoting
our goal is to find a fixed point of , i.e., a point such that
Consider a probability space, a family
of random variables defined onwith values in some measurable space . Denote the distribution of and the distribution (over ) of . We allow each to have the following stochastic representation:
where, with a small abuse of notation, denotes an -integrable function for every . We also denote for every ,
Note that is -integrable and that .
We assume the following contraction property for the stochastic map .
There exist , and such that for every ,
This assumption is satisfied by many maps describing (stochastic) optimization algorithms under some strong convexity / smoothness assumption; see Sections 3 and 4. We shall also use the expected Lipschitz continuity of defined as follows.
For every , there exists such that for every :
and we denote
2.2 Compression operator
In order to overcome communication issues, we apply a compression operator to the iterates.
Consider a family of random variables defined on with values . If , we shall prefer the notation for . We consider a measurable map such that for every ,
The map is called a compression operator. We make the following assumption on .
There exists such that for every and every ,
Assumption 3 has been used before, either in this general form or in special cases, in the analysis of gradient methods with compressed gradients [koloskova2019decentralized, DIANA2] and compressed iterates [GDCI]. Many practical compression operators satisfy this assumption; e.g., natural compression and natural dithering, standard dithering, sparsification, and quantization [horvath2019, DIANA2, GDCI, Stich18].
3 Results in the case
In this section, we present two algorithms to solve (2) in the case when and state two theorems related to these algorithms.
Consider stochastic fixed point iterations of the form
where is a sequence of i.i.d. copies of . Our first algorithm compresses all iterates for .
The convergence rate of is linear up to a ball of squared radius
The first term is coming from Assumption 1. The value of is usually zero for deterministic fixed point maps , see the next subsection. In this case, the first term of (10) is zero. The presence of the second term is mainly a consequence of the variance of the compression operator. If (no compression), then the second term is equal to zero.222Having is hopeless, except in very particular cases like deterministic and .
The improved convergence rate of Algorithm 2 is stated by the next theorem.
Therefore, Algorithm 2 converges linearly if and allows for arbitrarily large compression variance.
Consider an -smooth -strongly convex objective function and a step-size . Then
This result improves upon the result obtained in [GDCI] by requiring rather than while still guaranteeing convergence. Moreover, using Theorem 2, converges linearly to zero, rather to a neighbourhood of the solution, if Algorithm 1 is applied.
Stochastic Gradient Descent (SGD)
One can generalize the previous example to the map
where is a convex, lower semicontinuous and proper function , and is the proximity operator of defined as
The map also satisfies the Assumptions [atc-for-mou-14, atc-for-mou-17]. A fixed point of is a minimizer of .
Davis-Yin splitting [davis2017three] is an optimization algorithm to minimize a sum of three convex functions . It is a generalization of Gradient Descent, Proximal Gradient Descent, and Douglas Rachford algorithms [boyd2011distributed, bau-com-livre11] and it takes the form of fixed point iterations . The map satisfies Assumptions 1 and 2 with if at least one of or is strongly convex and at least one of or is smooth [davis2017three]. Therefore Algorithm 2 converges linearly in this case.
Vu condat splitting [con-jota13, vu2013splitting] is an optimization algorithm to minimize a sum of three convex functions where is a matrix. It is a generalization of Gradient Descent, Proximal Gradient Descent, and Douglas Rachford, ADMM and Chambolle-Pock algorithms [chambolle2011first] and it takes the form of fixed point iterations . The map satisfies Assumptions 1 and 2 with if is strongly convex and is smooth. Therefore Algorithm 2 converges linearly in this case.
(Stochastic) Gradient Descent Ascent
Consider a -strongly convex-concave function defined by , (strongly convex in and strongly concave in ) with -Lipschitz continuous gradient. Then, the map
satisfies Assumption 2 and Assumption 1 with if is small enough. In this case, Algorithm 4 will converge linearly to a saddle point of . This example can be generalized to the case where the gradient is replaced by an unbiased estimate with the expected Lipschitz continuity property, in which case Assumption 1 holds with in general.
4 The case
We now consider the case where computing agents are required to compute a fixed point of , under the restriction that each node only have access to the ”local” random map . We solve this problem in a distributed master/slave setting, where each iteration is divided into a computation step and a communication step. During the computation step, every node uses to update some ”local” variable. Then, during the communication step, each node sends its local variable to the master node of the network that aggregates the variables and sends back the result to the other nodes. We extend Algorithm 1 (resp. Algorithm 2) to this setting, as well as Theorem 1 (resp. Theorem 2). The distributed (non variance reduced) fixed point algorithm is summarized in Table 3.
The convergence rate of this method is a direct generalization of Theorem 1
Once again, the rate suffers from the variance term which is removed by our variance reduced approach summarized in Table 4.
Finally, the next theorem is the analogue of Theorem 2 in the distributed setting.
Algorithm 4 converges linearly if . We further note that in the special case the algorithm reduces to quantizing the model update in expectation, a practice that is already common in practice. This further mirrors the result of [DIANA2] where quantizing gradient differences rather than gradients allows several benefits over quantizing gradients. The message of Theorem 4, therefore, is that quantizing iterate differences rather than iterates also leads to better convergence properties.
Distributed (Stochastic) Gradient Descent
Consider a -strongly convex objective function expressed as an empirical mean
where each is -smooth and convex. Then it is easy to check that the map defined in (12) takes the form (1) and that Assumptions 1 and 2 are satisfied by this map if is small enough, see e.g. [Gower2019]. Algorithm 4 is then a distributed gradient descent algorithm with iterates compression that converges linearly. If the are themselves written as expectations and have the expected Lipschitz continuity property and convexity, one can check that Assumptions 1 and 2 are also satisfied.
Distributed (Stochastic) Gradient Descent Ascent
5 Empirical results
Here we present very preliminary numerical results.
We minimize an
regularized loss of a linear regression problem using gradient descent and natural compression[horvath2019]. We carry out this experiment for different condition numbers.
In the following we plot the evolution for GD (Gradient Descent), GDCI (Gradient Descent with Compressed Iterates), VR-GDCI (Variance Reduced Gradient Descent with Compressed Iterates).
Appendix A Basic Facts
We recall the following fact about the variance of a random variable: Given a fixed and a random variable , we have
If are independent random variables then
We also recall the following inequality from linear algebra: for any we have,
We will also use the following fact: which follows from the convexity of the squared Euclidean norm: for we have,
Moreover, we shall use the following lemma without mention.
Let and and let be a sequence of real numbers with satisfying the recursion
Appendix B Proof of Theorem 3
Therefore, conditionally on ,
Finally taking unconditional expectations yields the theorem’s claim.
Appendix C Proof of Theorem 4
Conditionally on we have
For the inner product in the last inequality, we have
Using this in the previous inequality, we get
It remains to take expectation with respect to the randomness in .