The vast size of modern machine learning problems is shifting the operating regime of optimization from centralized to distributed algorithms. This makes computations manageable but creates huge communication overheads for large dimensional problems (dean2012large; seide20141; alistarh2017qsgd). This is because distributed optimization algorithms hinge on frequent transmissions of gradients between compute nodes. These gradients are typically huge, since their size is proportional to the model size and state-of-the-art models often have millions of parameters. To get a sense of the communication costs, transmitting a single gradient or stochastic gradient using single precision (32 bits per entry) requires 40 MB for a models with 10 million parameters (which is not uncommon). This means that if we use 4G, then we can expect to transmit roughly one gradient per second. These huge communication costs easily overburden training on collocated servers and become infeasible for federated learning and learning on IoT or edge devices.
To counter these communication overheads, much recent research has focused on compressed gradient methods. These methods achieve communication efficiency by using only the most informative parts of the gradients at each iteration. We may, for example, sparsify the gradient and use only the most significant entries at each iteration, and set the rest to be zero (alistarh2017qsgd; alistarh2018convergence; stich2018sparsified; wen2017terngrad; wang2018atomo; khirirat2018gradient; wangni2018gradient). We may also quantize the gradients or do some mix of quantization and sparsification (alistarh2017qsgd; khirirat2018distributed; magnusson2017convergence; wangni2018gradient; zhu2016trained; rabbat2005quantized).
The above references show that compressed gradient methods can achieve huge communication improvements for specific training problems. However, to reap these communication benefits we usually need to carefully tune the level of accuracy of each compressor before training. For example, to sparsify the gradient we need to decide how many gradient components we will use. We cannot expect there to be a universally good compressor that works well on all problems, as suggested by the worst case communication complexity of optimization in (tsitsiklis1987communication). There is generally a delicate problem-specific balance between compressing too much or too little. Striking this balance can be achieved by hyper-parameter tuning. However, hyper-parameter tuning is expensive and the resulting tuning parameters will be problem specific. We take another approach and adaptively tune the level of accuracy by adapting to each communicated gradient.
Contributions: We propose Communication-aware Adaptive Tuning (CAT) for general compression schemes. The main idea is to find the optimal tuning for each communicated gradient by maximizing the objective function improvement achieved per bit. We illustrate these ideas on three state-of-the-art compression schemes: a) sparsification, b) sparsification with quantization and c) stochastic sparsification. In all cases, we first derive descent lemmas specific to the compression, relating the function improvement to the tuning parameter. Using these results we can find the tuning that optimizes the communication efficiency measured in descent per communicated bit. Our tuning is communication-aware, meaning that it achieves optimal efficiency for general communication schemes, e.g., communication standards or models used. Even though most of our theoretical results are for a single node, we illustrate the efficiency of CAT to all three compression schemes in large scale simulations in multi-nodes settings. Moreover, for the stochastic sparsification we prove convergence for stochastic gradient in multi-node settings.
Notation: We let , , and be the set of natural numbers, the set of natural numbers including zero, and the set of real numbers, respectively. The set is denoted by for and . For , is the norm with . A continuously differentiable function is -smooth if for
and is -strongly convex if
The main focus of this paper is empirical risk minimization
where is a set of data points and
is a loss function.
2.1 Gradient Compression
We study compressed gradient methods
where is some compression operator and is a parameter that controls the level of compression. The goal of compression is to achieve communication efficiency by using only the most informative information. We might, for example, use only the components of largest magnitude intop most significant gradient components at each iteration, i.e. sparsify the gradient as: In this case is the sparsifing operator
where is the index set for the components of with largest magnitude. Sparsification together with quantization has been shown to give good practical performance (alistarh2017qsgd). In this case, we communicate only the gradient magnitude and the sparsity structure of the gradient where
It is sometimes advantageous to use stochastic sparsification. In this case, instead of sending the top entries, we send on average components. We can achieve this by setting
Ideally, we would like to represent the magnitude of , so that if is large relative to the other entries then
should also be large. There are many heuristic methods to choose. For example, if we set with , , and then we get, respectively, the stochastic sparsifications in (alistarh2017qsgd) with , the TernGrad in (wen2017terngrad), and -quantization in (wang2018atomo). We can also find the optimal choice of , see (wang2018atomo) and Section 5 for details.
Experimental results have shown that compressed gradient methods can save a lot of communication in large-scale machine learning (shi2019distributed; ijcai2019-473). Nevertheless, we can easily create pedagogical examples where they are no more communication efficient than full gradient descent. For sparsification, consider the function . For this function, gradient descent with step-size converges in one iteration. That results in communication of floating points (one gradient ) to reach any -accuracy. On the other hand, with -sparsified gradient method (where divides ) we need iterations, which also results in communicated floating points. In fact, the sparsified method is even worse, because it requires additional communication of bits to indicate the sparsification pattern.
This means that the benefits of sparsification are not seen on worst case problems, and that traditional worst case analysis (e.g. khirirat2018gradient) is unable to guarantee improvements in computation complexity. Rather, sparsification is useful for exploiting potential structure that appears in real-world problems. The key in taking advantage of these structures is to choose the correct at each iteration. In this paper we illustrate how to choose dynamically to optimize the communication efficiency of sparsification.
The compressors discussed above have a tuning parameter , which controls the level of sparsity of the compressed gradient. Our goal is to tune adaptively to optimize the communication efficiency. To explain this we need to first discuss the communication involved. Let denote the total number of bits communicated per iteration as a function of the paramter . The value of can be split into payload (actual data) and communication overhead. The payload is the amount of bits that are needed to communicate the compressed gradient. For the sparsification in Equation (4) the payload consumes
since we need to communicate floating points and indicate indices in a
dimensional gradient vector. HereFPP is our floating point precision, e.g., or if we use, respectively, single or double precision floating-points. For the sparsification with quantization in Equation (5) the payload consumes
since only one floating point is sent per iteration. Our simplest communication model accounts only for the payload,
We call this the payload model. In real-world networks, however, each communication also includes overhead and set-up costs. A more realistic model is therefore affine
where is the payload. Here is the communication overhead while is the cost of transmitting a single payload byte. For example, if we just count transmitted bits (), a single UDP packet transmitted over Ethernet requires an overhead of bits and can have a payload of up to bytes. In the wireless standard IEEE 802.15.4, the overhead ranges from - bytes, leaving bytes of payload before the maximum packet size of bytes is reached (KoS:17). Another possibility is to consider a fixed cost per packet
where is the number of payload bits per packet. The term counts the number of packets required to send the payload bits, is the cost per packet, and is the cost of initiating the communication. These are just two examples; ideally, should be tailored to the specific communication standard in use.
2.3 Key Idea: Communication-aware Adaptive Tuning
When communicating the compressed gradients we would like to use each bit as efficiently as possible. In optimization terms, we would like the objective function improvement we get for each communicated bit to be as large as possible. In other words, we want to maximize the ratio
where is the improvement in the objective function when we use -sparsification with the given compressor. We will demonstrate how the value of can be obtained from novel descent lemmas and derive dynamic sparsification policies which, at each iteration, find the that optimizes . We work out the details for the three gradient compression families and the two communication models introduced above. However, we believe that this idea is general and can help in improving the communication efficiency for many other optimization algorithms and compression techniques.
3 Dynamic Sparsification
We now describe theintroduce how we can use our Communication-aware Adaptive Tuning (CAT) for the sparsified gradient method. The main idea is to find the best at each iteration that gives the biggest improvement in the objective function per communicated bit. The objective function improvement is captured by the following measure (we give a formal proof in § 3.1we provide intuition in Subsection 3.1)
where is the -sparsification operator defined in Equation (4). Then our CATdynamic sparsified gradient method is given byfollows the iterations
In the first step, described by Equation (12), the algorithm finds the sparsification parameter that optimizes the communication efficiency (cf.similarly as described around Equation 11 abovein the previous section). In the second step (Equation 13), the algorithm just performs a standard sparsification using the found in the previous step.
To find at each iteration we need to solve the optimization problem in Equation (12). This is is a one dimensional optimization problem and can hence be solved efficiently. In particular, as we will show, under the affine communication model, the efficiency is quasi-concave and its optimum is easily found; while under the packet-based model, the optimal is a multiple of the number of .Moreover, is a concave when extended to the the continuous interval
with linear interpolation (we show this below), meaning that we may use tools from convex optimization.
3.1 A Measure of Function Improvement
The next result shows how captures the guaranteed objective function improvement for a given : We now illustrate how captures the improvement in the objective function for a given . This is seen from the following descent lemma.
Suppose that is (possibly non-convex) -smooth and . Then for any with
Moreover, there are -smooth functions for which the inequality is tight for every .
This lemma is in the category of descent lemmas, which are standard tools to study the convergence of convex and non-convex functions. In fact, Lemma 1 is a generalization of the standard descent lemma for -smooth functions (see for example Proposition A.24 in (nonlinear_bertsekas)). In particular, if the gradient is -sparse (or ) then Lemma 1 gives the standard descent
Lemma 1 is at the heart of our dynamic sparsification. It allows us to choose adaptively at each iteration to optimize the ratio between the descent in the objective function and communicated bits. In particular, Lemma 1 implies that the descent in the objective function for a given is
Since the factor is independent ofis not affected by the choice of , the descent per communicated bit is maximized by choosing the that maximizes the ratio (as in Equation (12)). The descent always increases with and is bounded as follows.
For the function
is increaseing and concave when extended to the the continuous interval . Moreover, for all and there exists a -smooth function such that for all .
We will explore several consequences of this lemma in the next subsection, but first we make the following observation:
Let be increasing and concave. If , then is quasi-concave and has a unique maximum on . When , on the other hand, attains its maximum for a which is an integer multiple of .
The proposition demonstrates that the optimization in Equation (12) is easy to solve. For the affine communication model, one can simply sort the elements in decreasing magnitude, initialize and increase until decreases. In the packet model, the search for the optimal is even more efficient, as one can increase in steps of .
To better see the benefits of adaptively choosing it is best to look at a concrete example, which we do next.
3.2 The Many Benefits of Dynamic Sparsification
Before illustrating the practical benefits of dynamic sparsification, we will consider its theoretical guarantees in terms of iteration and communication complexity. To this end, Table 1 compares the iteration complexity of Gradient Descent (GD) and -Sparsified Gradient Descent (-SGD) with constant for strongly-convex, convex, and non-convex problems. The table shows how many iterations are needed to reach an accuracy with . The results for gradient descent are well known and found, e.g., in (nesterov2018lectures), and the worst-case analysis appeared in (khirirat2018gradient). The results for -sparsified gradient descent can be derived similarly except with using our Descent Lemma 1 instead of the standard descent lemma. We provide proof in the supplementary material.
Comparing rows 3 and 5 in the table, we see that the worst-case analysis does not guarantee any improvements in the amount of communicated floating points. Although -SGD only communicates out of gradient entries in each round, we need to perform times more iterations with -SGD than with SGD, so the two approaches will need to communicate the same number of floating points. In fact, -SGD will be worse in terms of communicated bits, since it requires communication of additional bits per iteration to indicate the sparsity pattern (see Equation (7)).
Let us now turn our attention to our novel analysis shown in row 4 of Table 1. Here, the parameter is a lower bound on over every iteration, that is
Unfortunately, is not useful for algorithm development: we know from Lemma 2 that it can be as low as , and it is not easy to compute a tight data-dependent bound off-line, since depends on the iterates produced by the algorithm. However, explains why gradient sparsification is communication efficient. In practice, the majority of the gradient energy tends to be concentrated to a few top entries, so grows rapidly for small values of and is much larger than
. To illustrate the benefits of sparsification, let us look at the concrete example of logistic regression on theRCV1 data set (a standard ML benchmark with and data points). Figure 0(a) depicts computed after running 1000 iterations of Gradient Descent and compares it to the worst case bound . The results show a dramatic difference between these two measures. We quantify this difference in terms of their ratio
Note that this measure is the ratio between row 4 and 5 in Table 1 and hence tells us the hypothetical speedup of sparsification, i.e., the ratio between the number of communicated floating points needed by GD and -SGD to reach -accuracy. The figure shows drastic speedup, for small values of it is 3 order of magnitudes (we confirm this in experiments below).
Interestingly, the speedup decrease with and the optimal speedup is obtained at . There is an intuitive explanation for this. Doubling means doubling the amount of communicated bits, while the additional gradient components that are communicated are less significant. Thus, the communication efficiency gets worse as we increase . This suggests that if we optimize the communication efficiency without considering overhead then we should always take . In the context of the dynamic algorithm in Equation (12) and (13), this leads to the following result:
Figures 0(b) and 0(c) depict, respectively, the hypothetical and true values of the total number of bits needed to reach an -accuracy for different communication models. In particular, Figure 0(b) depicts the ratio (compare with Table 1) and Figure 0(c) depicts the experimental results of running -SGD for different values of . We consider: a) the payload model in Equation (7) with (dashed lines) and b) the packet model in Equation (10) with bytes, bytes and bytes (solid lines). In both cases, the floating point precision is . We compare the results with GD (blue lines).111For fair comparison, we let the payload for gradient descent be per iteration. As expected, the results show that if we ignore overheads then is optimal, and the improvement compared to GD are of the order of 3 magnitudes. For the packet model, there is a delicate balance between choosing to small or to big. For general communication models it is difficult to find the right value of a priori, and the costs of choosing a bad can be of many orders of magnitude. To find a good
we could do hyper-parameter optimization. Perhaps by first estimatingfrom data and then use it to find optimal . However, this is going to be expensive and, moreover, might not be a good estimate of the we get at each iteration.
In contrast, our communication aware adaptive tuning rule finds the optimal at each iteration without any hyper-parameter optimization. In Figure 0(c) we show the number of communicated bits needed to reach -accuracy with our algorithm. The results show that for both communication models, our algorithm achieves the same communication efficency as if we would choose the optimal .
4 Dynamic Sparsification + Quantization
We now describe how our CAT tuning rule can be extended to improve the communication efficiency of compressed gradient methods when we use the sparsification together with quantization, i.e., with given in Equation (5). alistarh2017qsgd provide a heuristic rule for choosing dynamically at each iteration . Specifically, they choose so that is the smallest subset such that . We show below that this rule works quite well if we only consider the payload, but that it is suboptimal for general communication models.
4.1 Descent Lemma for Sparsification + Quantization
As before, our goal is to choose dynamically by maximizing the communication efficiency per iteration, i.e., the function improvement per bit. To that end, we first need a similar descent lemma for this compression as Lemma 1 was for the sparsification in the last section. By similar arguments as in Lemma 1, we obtain the following result.
Suppose that is (possibly non-convex) -smooth. Then for any with
where is as defined in Equation (5) and where
This lemma has few differences from Lemma 1, which is natural since the two compressors affect the descent in different ways. In particular, the descent measure is different than in Lemma 1 and the step-size is different. Unlike in Lemma 1, does not converge to 1 as goes to . In fact, is not even an increasing function, and does not converge to for increasing .
Nevertheless, is non-negative, increasing and concave. Under the affine communication model, is non-negative and convex, which implies that is quasi-concave. The optimal can thus be efficiently found similarly to what was done for the CAT-sparsification in § 3.1.
4.2 Dynamic Algorithm and Illustrations
Using the descent Lemma 3 we can apply CAT for this compression similarly as we did for the sparsification in the previous section. In particular, if we set
then our CAT sparsification + quantization is given by
We illustrate the algorithm on the RCV1 data set in Figure 2. We compared CAT to the dynamic tuning introduced in (alistarh2017qsgd). The black curves illustrate the results when we only communicate the payload, i.e., defined in Equation (8). The blue lines are the results for when follows the packet model in Equation (10) with bytes, bytes and bytes. The results show that if we only count the payload, then the two methods are comparable. Our CAT tuning rule outperforms (alistarh2017qsgd) by only a small margin. This suggests that the heuristic in (alistarh2017qsgd) is quite communication efficient in the simplest case when only payload is communicated. However, the heuristic rule is agnostic to the actual communication model that is used. Therefore, we should not expect it to perform well for general . The blue lines show that the CAT is roughly two times more communication efficient than the the dynamic tuning rule in (alistarh2017qsgd) for the packet communication model.
5 Dynamic Stochastic Sparsification
We finally illustrate how the CAT tuning rule can improve the communication efficiency of stochastic sparsification methods. One of the advantages of stochastic sparsification is its favorable properties that allow us to generalize our theoretical results to stochastic gradient methods and to multi node settings; we illustrate this in Subsection 5.3.
5.1 Descent Lemma for Stochastic Sparsification
Our goal is to choose and dynamically for the stochastic sparsification in Equation (6), maximizing the communication efficiency per iteration, i.e., the function improvement per bit. To this end, we need the following descent lemma, similar to the ones we proved for non-stochastic sparsifications in the last two sections.
Suppose that is (possibly non-convex) -smooth. Then for any with
where is defined in (6) and where
Similarly as before, we optimize the descent and the communication efficiency by maximizing, respectively, and . To optimize we can use that
It is illustrated in (wang2018atomo) how we can obtain the optimal for a given efficiently. That is, for fixed it provides us with the optimal solution to
In the rest of this section we always use and omit when in and .
5.2 Dynamic Algorithm and Illustrations
We can now design a dynamic tuning to optimize the communication efficiency, based on the descent lemma. In particular, we get the dynamic algorithm
In Figure 3
we evaluate the performance of our algorithm with the optimal probability scheme from(wang2018atomo) on the RCV1 data set. The communications are averaged over three Monte Carlo runs. Like for the deterministic sparsification, in the payload model it is best to choose small. However, for the packet model we need to carefully tune so that it is neither to big or to small. Our CAT rule adaptively finds the best value of in both cases.
5.3 Stocastic Gradient Descent and Multi Node Setting
For stochastic gradient descent we can extend our results to stochastic gradient methods and to multi-node setting. Suppose that we havenodes and
where the objective function is kept by node (in our setting, is the empirical loss of the data which resides in node ). Then we may use the distributed compressed gradient method
where is the stochastic sparsifier. Here is the stochastic gradient at . We assume that
is unbiased and satisfies a bounded variance assumption, i.e.
where the expectation is with respect to a distribution of local data stored in node . These conditions are standard for analysis of first-order algorithms in machine learning (feyzmahdavian2016asynchronous; lian2015asynchronous). We have the following convergence result
Suppose that is -smooth for each . Let be the iterates of Algorithm (17) and suppose that there exists such that for all and . Set and . Then
then we find in
(Convex) If is convex and
then we find
(Strongly convex) If is -strongly convex and
then we find in
Theorem 1 establishes the convergence of stochastic sparsification for stochastic gradient descent in multi-node setting. When , our iteration complexity is similar to classical results for stochastic gradient descent; see e.g., (needell2014stochastic). More generally, Theorem 1 shows that the iteration complexity is similar as if we do not use sparsification factored by . Therefore, captures the sparsification gain, similarly as did for deterministic sparsification (see in Table 1 in Section 3.2). In fact, we have a similar lower bound on as proved in Lemma 2 for deterministic sparsification (proof in supplementary materials).
For we have the lower bound
With these results, we can translate many of the conclusions for deterministic sparsifiaction to stochastic sparsification.
6 Experimental Results for Multiple Nodes
We evaluate the performance of the CAT tuning rule for all three compressors discussed in this paper in the multi-node setting on the RCV1 data-set. We compare the results to gradient descent and the sparsification with quantization from Section 4.2 but using the dynamic tuning rule in (alistarh2017qsgd) instead of CAT. We implement all algorithms in Julia, and run them on nodes using MPI by splitting the data evenly between the nodes. In all cases we use the packet communication model in Equation (10) with bytes, bytes and bytes. The results are shown in Figure 4. Our CAT with sparsification together with quantization outperforms all other compression schemes. In particular, CAT for this compression, is roughly 4 times more communication efficient than the dynamic rule in (alistarh2017qsgd) for the same compression scheme (compare number of bits needed to reach ).
We have proposed communication-aware adaptive tuning to optimize the communication-efficiency of gradient sparsification. The adaptive tuning relies on a data-dependent measure of objective function improvement, and adapts the compression level to maximize the descent per communicated bit. Unlike existing heuristics, our tuning rules are guaranteed to save communications in realistic communication models. In particular, our rules is more communication-efficient when communication overhead or packet transmissions are accounted for. In addition to centralized analysis, our tuning strategies are proven to reduce communicated bits also in distributed scenarios.
Appendix A Proofs of Lemmas and Propositions
a.1 Proof of Lemma 1
By the -smoothness of and the iterate where , from Lemma 1.2.3. of (nesterov2018lectures) we have
It can be verified that
for all and, therefore, if then we have
By the definition of we have , which yields the result.
Next, we prove that there exist -smooth functions where the inequality is tight. Consider . Then, is -smooth, and also satisfies
Since by the definition and , we have
Since , by the definition of
where is the index set of elements with the highest absolute magnitude. Therefore,
a.2 Proof of Lemma 2
Take and, without the loss of generality, we let and (otherwise we may re-order ). To prove that is increasing we rewrite the definition of equivalently as
Notice that when . Clearly, is also increasing with since each term of the sum is increasing.
We prove that is concave by recalling the slope of
for and . Since , the slope of has a non-increasing slope when increases. Therefore, is concave.
We prove the second statement by writing on the form of
where is the index set of elements with lowest absolute magnitude. Applying the fact that for and that into the main inequality, we have
By the definition of , we get
Finally, we prove the last statement by setting where .Then is -smooth and its gradient is
a.3 Proof of Proposition 1
The ratio between a non-negative concave function and a positive affine function is quasi-concave and semi-strictly quasi-concave (Schaible2013; Avriel2010), meaning that every local maximal point is globally maximal.
Next, we consider when If , then due to monotonicity of and , meaning that . This implies that maximizes for and that we can obtain the maximum of for some integers .
a.4 Proof of Proposition 2
Take and, without the loss of generality, we let and (otherwise we may re-order ). Since where , we have
Since , the solution from Equation (12) is for all .
a.5 Proof of Lemma 3
By using the -smoothness of (Lemma 1.2.3. of (nesterov2018lectures)) and the iterate where , we have
If has non-zero elements, then we can easily prove that
where is defined in Equation (14). Plugging these equations into the above inequality yields
Setting completes the proof.
a.6 Proof of Lemma 4
By using the -smoothness of (Lemma 1.2.3. of (nesterov2018lectures)) and the iterate where , we have
Taking the expectation, recalling the definition of in Equation (15), and using the unbiased property of we get
Now taking concludes the complete the proof.
a.7 Proof of Lemma 5
If in Equation (6) has for all , then
In we use the optimal that minimizes . This implies that .
Appendix B Proof of Theoretical Results for Table 1
Proof of Non-convex Optimization
By recursively applying the inequality from Lemma 1 with and , we have