Implementation of Compressed SGD with Compressed Gradients in Pytorch
Modern large-scale machine learning applications require stochastic optimization algorithms to be implemented on distributed compute systems. A key bottleneck of such systems is the communication overhead for exchanging information across the workers, such as stochastic gradients. Among the many techniques proposed to remedy this issue, one of the most successful is the framework of compressed communication with error feedback (EF). EF remains the only known technique that can deal with the error induced by contractive compressors which are not unbiased, such as Top-K. In this paper, we propose a new and theoretically and practically better alternative to EF for dealing with contractive compressors. In particular, we propose a construction which can transform any contractive compressor into an induced unbiased compressor. Following this transformation, existing methods able to work with unbiased compressors can be applied. We show that our approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions. We further extend our results to federated learning with partial participation following an arbitrary distribution over the nodes, and demonstrate the benefits thereof. We perform several numerical experiments which validate our theoretical findings.READ FULL TEXT VIEW PDF
Modern large scale machine learning applications require stochastic
Modern large scale machine learning applications require stochastic
Error feedback (EF), also known as error compensation, is an immensely
Huge scale machine learning problems are nowadays tackled by distributed...
Lossy gradient compression, with either unbiased or biased compressors, ...
We introduce a new algorithm - Artemis - tackling the problem of learnin...
Over the past few years, the federated learning () community has
Implementation of Compressed SGD with Compressed Gradients in Pytorch
We consider distributed optimization problems of the form
where represents the weights of a statistical model we wish to train, is the number of nodes, and
is a smooth differentiable loss function composed of data stored on worker. In a classical distributed machine learning scenario, is the expected loss of model with respect to the local data distribution of the form, and is the loss on the single data point . This definition allows for different distributions on each node, which means that the functions can have different minimizers. This framework covers
Stochastic Optimization when either or all are identical,
Empirical Risk Minimization (ERM), when can be expressed as a finite average, i.e, for some ,
Federated Learning (FL) (Kairouz et al., 2019) where each node represents a client.
In distributed training, model updates (or gradient vectors) have to be exchanged in each iteration. Due to the size of the communicated messages for commonly considered deep models(Alistarh et al., 2016), this represents significant bottleneck of the whole optimization procedure. To reduce the amount of data that has to be transmitted, several strategies were proposed.
One of the most popular strategies is to incorporate local steps and communicated updates every few iterations only (Stich, 2019a; Lin et al., 2018a; Stich and Karimireddy, 2020; Karimireddy et al., 2019a; Khaled et al., 2020). Unfortunately, despite their practical success, local methods are poorly understood and their theoretical foundations are currently lacking. Almost all existing error guarantees are dominated by a simple baseline, minibatch SGD (Woodworth et al., 2020).
In this work, we focus on another popular approach: gradient compression. In this approach, instead of transmitting the full dimensional (gradient) vector , one transmits a compressed vector , where is a (possibly random) operator chosen such that can be represented using fewer bits, for instance by using limited bit representation (quantization) or by enforcing sparsity. A particularly popular class of quantization operators is based on random dithering (Goodall, 1951; Roberts, 1962); see (Alistarh et al., 2016; Wen et al., 2017; Zhang et al., 2017; Horváth et al., 2019a; Ramezani-Kebrya et al., 2019). Much sparser vectors can be obtained by random sparsification techniques that randomly mask the input vectors and only preserve a constant number of coordinates (Wangni et al., 2018; Konečný and Richtárik, 2018; Stich et al., 2018; Mishchenko et al., 2019b; Vogels et al., 2019). There is also a line of work (Horváth et al., 2019a; Basu et al., 2019) in which a combination of sparsification and quantization was proposed to obtain a more aggressive effect. We will not further distinguish between sparsification and quantization approaches, and refer to all of them as compression operators hereafter.
Considering both practice and theory, compression operators can be split into two groups: biased and unbiased. For the unbiased compressors,
is required to be an unbiased estimator of the update
. Once this requirement is lifted, extra tricks are necessary for Distributed Compressed Stochastic Gradient Descent (DCSGD) utilizing such a compressor to work, even if the full gradient is computed by each node. Indeed, a naive approaches can lead to divergence(Beznosikov et al., 2020), and Error Feedback (EF) (Seide et al., 2014; Karimireddy et al., 2019b) is the only known mechanism able to remedy the situation and lead to a convergent method.
Our contributions can be summarized as follows:
We provide a theoretical analysis of Compressed SGD under weak and general assumptions. If is -quasi convex (not necessarily convex) and local functions are -smooth (weaker version of -smoothness), we obtain the rate
is the parameter which bounds the second moment of the compression operator, andis the number of iterations. This rate is strictly better than the best-known rate for Compressed SGD with EF. Moreover, the latter requires extra assumptions. In addition, our theory guarantees convergence in both iterates and functional value. For EF, the best known rates (Karimireddy et al., 2019b; Beznosikov et al., 2020) are expressed in terms of functional values only. Another practical implication of our findings is the reduction of the memory requirements by half; this is because in Compressed SGD, one does not need to store the error vector.
We propose a construction that can transform any biased compressor into an unbiased one (Section 3). We argue that using such an induced compressor within Compressed SGD is superior, both in theory and practice, to using the original biased compressor in conjunction with EF.
We further extend our results to the multi-node scenario and show that the resulting method, Distributed Compressed SGD (DCSGD), improves linearly with the number of nodes, which is not the case for EF. Moreover, we obtain the first convergence guarantee for partial participation with arbitrary distributions over nodes, which plays a key role in Federated Learning.
Finally, we provide experimental evaluation on an array of classification tasks with MNIST and CIFAR10 datasets corroborating our theoretical findings.
In this section we first introduce the notions of unbiased and general compression operators, and then compare Distributed Compressed SGD (DCSGD) without (Algorithm 1) and with (Algorithm 2) Error Feedback.
A randomized mapping is an unbiased compression operator (unbiased compressor) if there exists such that
If this holds, we will for simplicity write .
A (possibly) randomized mapping is a general compression operator (general compressor) if there exists and such that
If this holds, we will for simplicity write .
To link these two definitions, we include the following simple lemma (see, e.g. Beznosikov et al. (2020)).
If , then (3) holds with , i.e., . That is, .
Note that the opposite inclusion to that established in the above lemma does not hold. For instance, the Top- operator belongs to , but does not belong to . In the next section we develop a procedure for transforming any mapping (and in particular, any general compressor) into a closely related induced unbiased compressor.
In the rest of this section, we compare the convergence rates for Distributed Compressed SGD (Algorithm 1) and Distributed Compressed SGD with Error Feedback (Algorithm 2). We do this comparison under standard assumptions (Karimi et al., 2016; Bottou et al., 2018; Necoara et al., 2019; Gower et al., 2019; Stich, 2019b; Stich and Karimireddy, 2020), listed next.
First, we assume throughout that has a unique minimizer , and let .
is -quasi convex, i.e.,
Note that this assumption implies .
each function is -smooth, i.e., there exist constants and such that
where is the minimum functional value of .
This assumption can be seen as a generalization of standard -smoothness. For more details and discussion, see e.g. (Gower et al., 2019; Stich, 2019b). Equipped with these assumptions, we are ready to proceed with the convergence theory.
Note that the statistical term does not depend on compression and matches the optimal rate for SGD (Stich, 2019b), including the constant. The other important aspect to consider is the first term. It guarantees linear convergence if , which holds for commonly used over-parameterized networks (Vaswani et al., 2019) as one can reach zero training loss. Comparing our results to the best-known result for Error Feedback (Stich and Karimireddy, 2020) used with , our theory allows for larger stepsizes. Moreover, our convergence guarantee (7) for unbiased compressors implies convergence for both the functional values and the last iterate, rather than for functional values only. In addition, while the rate of DCSGD as captured by Theorem 2 and the rate of DCSGD with Error Feedback (Stich and Karimireddy, 2020) are the same in notation, our rate has at least times better constants and does not contain any hidden polylogarithmic factors. Another practical advantage is that there is no need to store an extra vector for the error, which reduces the storage costs by a factor of two, making Algorithm 1
a viable choice for Deep Learning models with millions of parameters. Finally, one does not need to assume standard-smoothness in order to prove convergence, while, one the other hand, -smoothness is an important building block for proving convergence for general compressors due to the presence of error (Stich and Karimireddy, 2020; Beznosikov et al., 2020). Putting all together, this suggests that standard DCSGD (Algorithm 1) is preferable, in theory, to DCSGD with Error Feedback (Algorithm 2) for .
In the previous section, we showed that compressed DCSGD is theoretically preferable to DCSGD with Error Feedback for . Unfortunately, , an example being the Top- compressor (Alistarh et al., 2018; Stich et al., 2018), which operates by keeping the top coordinates in magnitude only and setting rest to zero. This compressors belongs to , but does not belong to for any . On the other hand, multiple unbiased alternatives to Top- have been proposed in the literature, including gradient sparsification (Wangni et al., 2018) and adaptive random sparsification (Beznosikov et al., 2020).
We now propose a new way of constructing an unbiased compressor from any compressor . We shall argue that using this induced compressor within DCSGD is preferable, in both theory and practice, to using the original compressor within DCSGD + Error Feedback.
For with , choose and define the induced compressor via
The induced operator satisfies with .
To get some intuition about this procedure, first, recall the structure used in Error Feedback. The gradient estimator is first compressed with and the error is computed and stored in memory. For our proposed approach, instead of storing the error , we compress it with an unbiased compressor
and communicate both these compressed vectors. Note that this procedure results in extra variance as we do not work with the exact error, but with its unbiased estimate only. On the other hand, there is no bias. In addition, due to our construction, at least the same amount of information is sent as for plain. The only drawback is the necessity to send two compressed vectors instead of one. Theorem 3 provides freedom in generating the induced compressor through the choice of the unbiased compressor . In practice, it makes sense to choose with similar (or smaller) compression factor to the the compressor we are transforming as this way the total communication complexity per iteration is preserved, up to the factor of two.
In the light of the results in Section 2, we argue that one should always prefer unbiased compressors to biased ones as long as their variances and communication complexities are the same, e.g., Rand- over Top-. Contrary to the theory, greedy compressors are often observed to perform better due to their lower empirical variance (Beznosikov et al., 2020).
These considerations give a practical significance to Theorem 3 as we demonstrate on the following example. Let us consider two compressors–one biased and one unbiased , such that , having identical communication complexity, e.g., Top- and Rand-. The induced compressor
belongs to , where
Based on the construction of the induced compressor, one might expect that we need extra memory as “the error” needs to be stored, but during computation only. This is not an issue as compressors for DNNs are always applied layer-wise (Dutta et al., 2019), and hence the size of the extra memory is negligible. It does not help EF, as the error needs to be stored at any time for each layer.
We now develop several extensions of Algorithm 1 relevant to distributed optimization in general, and to Federated Learning in particular. This is all possible due to the simplicity of our approach. Note that in the case of Error Feedback, these extensions have either not been obtained yet, or similarly to Section 2, the results are worse when compared to our derived bounds for unbiased compressors.
We begin with the case of general . The following theorem provides the convergence rate of Algorithm 1.
Inspecting the convergence rate, observe that Theorem 2 arises as a special case of Theorem 4 for . Similar arguments and comments can be made as those we have made in the discussion after Theorem 2. However, now we need to make a comparison with the complexity results of Beznosikov et al. (2020), who analyzed Algorithm 2 in the case. In addition, the multi-node scenario reduces the effect of the variance constant by a factor of , which is not the case for EF.
In this section, we extend our results to a variant of DCSGD utilizing partial participation, which is of key relevance to Federated Learning. In this framework, only a subset of all nodes communicates to the master node in each communication round. In this work, we consider a very general partial participation framework: we assume that the subset of participating clients is determined by a fixed but otherwise arbitrary random set-valued mapping (a “sampling”) with values in , where . To the best of our knowledge, this is the first partial participation result where an arbitrary distribution over the nodes is considered.
On the other hand, this is not the first work which makes use of the arbitrary sampling paradigm; this was used before in other contexts, e.g., for obtaining importance sampling guarantees for coordinate descent (Qu et al., 2015), primal-dual methods (Chambolle et al., 2018), and variance reduction (Horváth and Richtárik, 2019).
Note that the sampling
is uniquely defined by assigning probabilities to allsubsets of . With each sampling we associate a probability matrix defined by . The probability vector associated with is the vector composed of the diagonal entries of : , where . We say that is proper if for all . It is easy to show that , and hence can be seen as the expected number of clients participating in each communication round.
There are two algorithmic changes due to this extension: line of Algorithm 1 does not iterate over every node, only over nodes , where , and the aggregation step in line is adjusted to lead to an unbiased estimator of the gradient, which gives .
To prove convergence, we exploit the following lemma.
Let be vectors in and let be their average. Let be a proper sampling. Then there exists such
where and the expectation is taken over sampling .
The following theorem establishes the convergence rate for Algorithm 1 with partial participation.
For the case with probability , one can show that Lemma 5 holds with , and hence we exactly recover the results of Theorem 4. In addition, we can quantify the slowdown factor with respect to full participation regime (Theorem 4), which is . While in our framework we assume the distribution to be fixed, using results of Eichner et al. (2019), one could extend this result to a block-cyclic structure with each block having an arbitrary distribution .
Note that in all the previous theorems, we can only guarantee a sublinear convergence rate. Linear rate is obtained in the special case when (in which case ). This is satisfied if there is no noise at the optimum, which is the case for over-parameterized models. Furthermore, linear rate can be obtained using compression of gradient differences, as pioneered in the DIANA algorithm (Mishchenko et al., 2019a). Both of these scenarios were already considered in Horváth et al. (2019b) for the framework of Theorem 4 and full participation. These results can be easily extended to partial participation using our proof technique for Theorem 6. Note that this reduction is not possible for Error Feedback as the analysis of the DIANA algorithm is heavily dependent on the unbiasedness property. This points to another advantage of the induced compressor framework introduced in Section 3.
As the last comparison, we discuss the combination of compression and acceleration/momentum. This setting is very important to consider as essentially all state-of-the-art methods for training deep learning models, including Adam (Kingma and Ba, 2015; Reddi et al., 2018), rely on the use of momentum in one form or another. One can treat the unbiased compressed gradient as a stochastic gradient (Gorbunov et al., 2020) and the theory for momentum SGD (Yang et al., 2016; Gadat et al., 2018; Loizou and Richtárik, 2017) would be applicable with an extra smoothness assumption. Moreover, it is possible to remove the variance caused by stochasticity and obtain linear convergence with an accelerated rate (Li et al., 2020). Similarly to our previous discussion, both of these techniques are heavily dependent on the unbiasedness property. It is an intriguing question, but out of the scope of the paper, to investigate the combined effect of momentum and Error Feedback and see whether these techniques are compatible theoretically.
is displayed. To be fair, we always compare methods with the same communication complexity per iteration. We report the number of epochs (passes over the dataset) with respect to training loss, testing loss, and testing accuracy. These are obtained by evaluating the best model in terms of the validation error on the test dataset. A validation error is computed based on% randomly selected training data. Similarly, we tune the step-size using the same validation set. For every experiment, we randomly distributed the training dataset among workers; each worker computes its local gradient based on its own dataset. We used a batch size of
. All the provided figures display the mean performance with one standard error over
independent runs. For a fair comparison, we use the same random seed for the compared methods. Our experimental results are based on a Python implementation of all the methods running in PyTorch. All reported quantities are independent of the system architecture and network bandwidth. Our implementation is freely available on GitHub:https://github.com/SamuelHorvath/Compressed_SGD_PyTorch.
We do an evaluation on 2 datasets – MNIST and CIFAR10. For MNIST, we consider a small neural network model with two fully connected (FC) layers with 512 neurons in the second layer. The step-size is tuned based on the valuesand . For CIFAR10, we consider VGG11 (Simonyan and Zisserman, 2015) and ResNet18 (He et al., 2016) models and step-sizes and . Some of the plots are displayed in the supplementary materials, Appendix A.
In our first experiment, we compare the effect of Error Feedback in the case when an unbiased compressor is used. Note that unbiased compressors are theoretically guaranteed to work both with Algorithm 1 and 2. We can see from Figure 1 that adding Error Feedback can hurt the performance; we use natural compression (Horváth et al., 2019a) and TernGrad (Wen et al., 2017) (coincides with QSGD (Alistarh et al., 2016) and natural dithering (Horváth et al., 2019a) (with the infinity norm and one level) as compressors. This agrees with our theoretical findings. In addition, for sparsification techniques such as Random Sparsification or Gradient Sparsification (Wangni et al., 2018), we observed that when sparsity is set to be 10 %, Algorithm 1 converges for all the selected values of step-sizes, but Algorithm 2 diverges and a smaller step-size needs to be used. This is an important observation as many practical works (Li et al., 2014; Wei et al., 2015; Aji and Heafield, 2017; Hsieh et al., 2017; Lin et al., 2018b; Lim et al., 2018) use sparsification techniques mentioned in this section, but proposed to use EF, while our work shows that using unbiasedness property leads not only to better convergence but also to memory savings.
In this section, we investigate candidates for unbiased compressors than can compete with Top-, one of the most frequently used compressors. Theoretically, Top- is not guaranteed to work by itself and might lead to divergence (Beznosikov et al., 2020) unless Error Feedback is applied. One would usually compare the performance of Top- with EF to Rand-, which keeps randomly selected coordinates and then scales the output by to preserve unbiasedness. Rather than to naively comparing to Rand-, we propose to use different unbiased approaches, which are more related to Top- compressor. The first one is Gradient Sparsification proposed by Wagni et al. (Wangni et al., 2018), which we refer to Rand- (Wangni et al.), where the probability of keeping each coordinate scales with its magnitude and communication budget. As the second alternative, we propose to use our induced compressor, where is Top- and unbiased part is Rand- (Wangni et al.) with communication budget . It should be noted that
can be considered as a hyperparameter to tune. For our experiment, we chose it to befor simplicity. Figure 2 suggests that both of the proposed techniques can outperform Top- with EF, as can be seen for CIFAR10 with VGG11, Moreover, they do not require extra memory to store the error vector. In addition, our unbiased induced compressor further improves over Rand- (Wagni et al.). Finally, Top- without EF suffers a significant decrease in performance, which stresses the necessity of error correction.
As the next experiment, we look at the effect of momentum on DCSGD with and without EF, which is set to . We consider the same setup as in the previous subsections. Based on our discussion on acceleration, we know that unbiased compressors are compatible with momentum and one can obtain theoretical guarantees, while for biased compressors with EF, this is not clear. Figure 3 shows that in terms of the training loss, Top- with EF performs slightly worse than its unbiased alternative. Similarly to the previous experiment, the performance of Top- is significantly degraded without EF. As observed in the first experiment, adding EF has a negative impact on the convergence of TernGrad.
In this experiment, we present example considered in Beznosikov et al. (2020), which was used as a counterexample to show that some form of error correction is needed in order for biased compressors to work/provably converge. In addition, we run experiments on their construction and show that while Error Feedback fixes divergence, it is still significantly dominated by unbiased non-uniform sparsification(NU Rand-), which works by only keeping one non-zero coordinate sampled with probability equal to , where denotes element-wise absolute value, as can be seen in Figure 4.
Consider and define the following smooth and strongly convex quadratic functions
where . Then, with the initial point
Using the Top- compressor, we get
The next iterate of DCGD is
Repeated application gives , which diverges exponentially fast to since .
In our experiments, we use the starting point and choose step size , where is the smoothness parameter of . Note that zero vector is the unique minimizer of .
In this paper, we argue that if compressed communication is required for distributed training due to communication overhead, it is better to use unbiased compressors. We show that this leads to strictly better convergence guarantees with fewer assumptions. In addition, we propose a new construction for transforming any compressor into an unbiased one using a compressed EF-like approach. Besides theoretical superiority, usage of unbiased compressors enjoys lower memory requirements. Our theoretical findings are corroborated with empirical evaluation.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
The 23rd International Conference on Artificial Intelligence and Statistics, 2020.
Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron.Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan, 2019.
In this section, we include extra experiments which complement the figures in the main paper. Figure 5 corresponds to the same settings as Figure 1. Analogously, Figure 6 corresponds to Figure 2 and Figure 7 to Figure 3. Essentially, the same can be concluded as we argue in the main paper.
We follow (2), which holds for .
which concludes the proof.
For the case , Algorithm 1 is reduced to , thus the update
We start with
Taking full expectation and , we obtain
The rest of the analysis is closely related to the one of Stich [2019b] with an extra adjustments such that this analysis is able to accommodate compression represented by parameter . We would like to point out that similar results to Stich [2019b] were also present in [Lacoste-Julien et al., 2012, Stich et al., 2018, Grimmer, 2019].
We first rewrite the previous inequality to the form
where , , , , and .
We proceed with lemmas that establish a convergence guarantee for every recursion of type (10).
Let , be as in (10) for and for constant stepsizes , . Then it holds for all :
This follows by relaxing (10) using ,and unrolling the recursion
Let , as in (10) for and for decreasing stepsizes , , with parameter , and weights . Then
We start by re-arranging (10) and multiplying both sides with
where the equality follows from the definition of and and the inequality from . Again we have a telescoping sum:
and for .
By applying these two estimates we conclude the proof. ∎
The convergence can be obtained as the combination of these two lemmas.
Let , as in (10), . Then there exists stepsizes and weighs , , such that
For integer , we choose stepsizes and weights as follows
for and . We will now show that these choices imply the claimed result.
We start with the case . For this case, the choice gives
If , then we obtain from Lemma 7 that
From Lemma 8 we have for the second half of the iterates:
Now we observe that the restart condition satisfies:
because . These conclude the proof.
We have to show that our new compression is unbiased and has bounded variance. We start with the first property with .
where the first equality follows from tower property and the second from unbiasedness of . For the second property, we also use tower property