sparsifiedSGD
Code for Sparsified SGD.
view repo
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (topk sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with ksparsification or compression (for instance topk or randomk) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.
READ FULL TEXT VIEW PDFCode for Sparsified SGD.
Stochastic Gradient Descent (SGD) [28] and variants thereof (e.g. [9, 15]
) are among the most popular optimization algorithms in machine and deeplearning
[5]. SGD consists of iterations of the form(1) 
for iterates , stepsize (or learning rate) , and stochastic gradient with the property
, for a loss function
. SGD addresses the computational bottleneck of full gradient descent, as the stochastic gradients can in general be computed much more efficiently than a full gradient . However, note that in general both and are densevectors^{1}^{1}1 Note that the stochastic gradientsare dense vectors for the setting of training neural networks. The
themselves can be sparse for generalized linear models under the additional assumption that the data is sparse. of size , i.e. SGD does not address the communication bottleneck of gradient descent, which occurs as a roadblock both in distributed as well as parallel training. In the setting of distributed training, communicating the stochastic gradients to the other workers is a major limiting factor for many large scale (deep) learning applications, see e.g. [32, 3, 43, 20]. The same bottleneck can also appear for parallel training, e.g. in the increasingly common setting of a single multicore machine or device, where locking and bandwidth of memory write operations for the common shared parameter often forms the main bottleneck, see e.g. [24, 13, 17].A remedy to address these issues seems to enforce applying smaller and more efficient updates instead of , where generates a compression of the gradient, such as by lossy quantization or sparsification. We discuss different schemes below. However, too aggressive compression can hurt the performance, unless it is implemented in a clever way: 1BitSGD [32, 36] combines gradient quantization with an error compensation technique, which is a memory or feedback mechanism. We in this work leverage this key mechanism but apply it within more the more general setting of SGD. We will now sketch how the algorithm uses feedback of to correct for errors accumulated in previous iterations. Roughly speaking, the method keeps track of a memory vector which contains the sum of the information that has been suppressed thus far, i.e. , and injects this information back in the next iteration, by transmitting instead of only . Note that updates of this kind are not unbiased (even if would be) and there is also no control over the delay after which the single coordinates are applied. These are some (technical) reasons why there exists no theoretical analysis of this scheme up to now.
In this paper we give a concise convergence rate analysis for SGD with memory and compression operators^{2}^{2}2See Definition 2.1., such as (but not limited to) top sparsification. Our analysis also supports ultrasparsification operators for which , i.e. where less than one coordinate of the stochastic gradient is applied on average in (1). We not only provide the first convergence result of this method, but the result also shows that the method converges at the same rate as vanilla SGD.
There are several ways to reduce the communication in SGD. For instance by simply increasing the amount of computation before communication, i.e. by using large minibatches (see e.g. [42, 11]), or by designing communicationefficient schemes [44]. These approaches are a bit orthogonal to the methods we consider in this paper, which focus on quantization or sparsification of the gradient.
Several papers consider approaches that limit the number of bits to represent floating point numbers [12, 23, 30]. Recent work proposes adaptive tuning of the compression ratio [7]
. Unbiased quantization operators not only limit the number of bits, but quantize the stochastic gradients in such a way that they are still unbiased estimators of the gradient
[3, 40]. The ZipML framework also applies this technique to the data [43]. Sparsification methods reduce the number of nonzero entries in the stochastic gradient [3, 39].A very aggressive sparsification method is to keep only very few coordinates of the stochastic gradient by considering only the coordinates with the largest magnitudes [8, 1]. In contrast to the unbiased schemes it is clear that such methods can only work by using some kind of error accumulation or feedback procedure, similar to the one the we have already discussed [32, 36], as otherwise certain coordinates could simply never be updated. However, in certain applications no feedback mechanism is needed [37]. Also more elaborate sparsification schemes have been introduced [20].
Asynchronous updates provide an alternative solution to disguise the communication overhead to a certain amount [18]. However, those methods usually rely on a sparsity assumption on the updates [24, 30], which is not realistic e.g. in deep learning. We like to advocate that combining gradient sparsification with those asynchronous schemes seems to be a promising approach, as it combines the best of both worlds. Other scenarios that could profit from sparsification are heterogeneous systems or specialized hardware, e.g. accelerators [10, 43].
Convergence proofs for SGD [28] typically rely on averaging the iterates [29, 26, 22], though convergence of the last iterate can also be proven [33]. For our convergence proof we rely on averaging techniques that give more weight to more recent iterates [16, 33, 27], as well as the perturbed iterate framework from Mania et al. [21] and techniques from [17, 35].
Simultaneous to our work, [4, 38] at NIPS 2018 propose related schemes. We will update this discussion once they become available. Another simultaneous analysis of [41]
at ICML 2018 is restricted to unbiased gradient compression. This scheme also critically relies on an error compensation technique, but in contrast to our work the analysis is restricted to quadratic functions and the scheme introduces two additional hyperparameters that control the feedback mechanism.
We consider finitesum convex optimization problems of the form
(2) 
where each is smooth^{3}^{3}3, , . and is strongly convex^{4}^{4}4, .. We consider a sequential sparsified SGD algorithm with error accumulation technique and prove convergence for compression operators, (for instance the sparsification operators top or random). For appropriately chosen stepsizes and an averaged iterate after steps we show convergence
(3) 
for and . Not only is this, to the best of our knowledge, the first convergence result for sparsified SGD with memory, but the result also shows that for the first term is dominating and the convergence rate is the same as for vanilla SGD.
We introduce the method formally in Section 2 and show a sketch of the convergence proof in Section 3. In Section 4 we include a few numerical experiments for illustrative purposes. The experiments highlight that top sparsification yields a very effective compression method and does not hurt convergence. Our multicore simulations demonstrate that SGD with memory scales better than asynchronous SGD thanks to the enforced sparsity of the updates. It also drastically decreases the communication cost without sacrificing the rate of convergence. We like to stress that the effectiveness of SGD variants with sparsification techniques has already been demonstrated in practice [8, 1, 20, 36, 32].
Although we do not yet provide convergence guarantees for parallel and asynchronous variants of the scheme, this is the main application of this method. For instance, we like to highlight that asynchronous SGD schemes [24, 2] could profit from the gradient sparsification. To demonstrate this usecase, we include in Section 4 a set of experiments for a multicore implementation.
In this section we present the sparse SGD algorithm with memory. First we introduce the sparsification operators that we use to drastically reduce the communication cost in comparison with vanilla SGD.
We consider compression operators that satisfy the following contraction property:
For a parameter , a contraction operator is a (possibly randomized) operator that satisfies the contraction property
(4) 
The contraction property is sufficient to obtain all mathematical results that are derived in this paper. However, note that (4) does not imply that is a necessarily sparse vector. Also dense vectors can satisfy (4). One of the main goals of this work is to derive communication efficient schemes, thus we are particularly interested in operators that also ensure that can be encoded much more efficiently than the original .
The following two operators are examples of sparsification operators, that is contraction operators with the additional property of being sparse vectors:
For a parameter , the operators and , where denotes the set of all element subsets of , are defined for as
(5) 
where is a permutation of such that for . We abbreviate whenever the second argument is chosen uniformly at random, .
It is easy to see that both operators satisfy Definition 2.1 of being a contraction. For completeness the proof is included in Appendix A.1.
We note that our setting is more general than simply measuring sparsity in terms of the cardinality, i.e. the nonzero elements of vectors in . Instead, Definition 2.1 can also be considered for quantization or e.g. floating point representation of each entry of the vector. In this setting we would for instance measure sparsity in terms of the number of bits that are needed to encode the vector.
We like to highlight that many other operators do satisfy Definition 2.1, not only the two examples given in Definition 2.2
. As a notable variant is to pick a random coordinate of a vector with probability
, for , property (4) holds even if . I.e. it suffices to transmit on average less than one coordinate per iteration (this would then correspond to a minibatch update).Before introducing SGD with memory we first discuss a motivating example. Consider the following variant of SGD, where random coordinates of the stochastic gradient are dropped:
(6) 
where . It is important to note that the update is unbiased, i.e. . For carefully chosen stepsizes this algorithm converges at rate on strongly convex and smooth functions , where
is an upper bound on the variance, see for instance
[45]. We havewhere we used the variance decomposition and the standard assumption . Hence, when is small this algorithm requires times more iterations to achieve the same error guarantee as vanilla SGD with .
It is well known that by using minibatches the variance of the gradient estimator can be reduced. If we consider in (6) the estimator for , and instead, we have
(7) 
This shows that, when using minibatches of appropriate size, the sparsification of the gradient does not hurt convergence. However, by increasing the minibatch size, we increase the computation by a factor of .
These two observations seem to indicate that the factor is inevitably lost, either by increased number of iterations or increased computation. However, this is no longer true when the information in (6) is not dropped, but kept in memory. To illustrate this, assume and that index has not been selected by the operator in iterations , but is selected in iteration . Then the memory contains this past information . Intuitively, we would expect that the variance of this estimator is now reduced by a factor of compared to the naïve estimator in (6), similar to the minibatch update in (7). Indeed, SGD with memory converges at the same rate as vanilla SGD, as we will demonstrate below.
We consider the following algorithm for parameter , using a compression operator which is a contraction (Definition 2.1)
(8) 
where , and denotes a sequence of stepsizes. The pseudocode is given in Algorithm 1. Note that the gradients get multiplied with the stepsize at the timestep when they put into memory, and not when they are (partially) retrieved from the memory.
We state the precise convergence result for Algorithm 1 in Theorem 2.4 below. In Remark 2.6 we give a simplified statement in big notation for a specific choice of the stepsizes .
Let be smooth, be strongly convex, , for , where are generated according to (8) for stepsizes and shift parameter . Then for such that , with , it holds
(9) 
where , for , and .
Theorem 2.4 says that for any shift there is a parameter such that (9) holds. However, for the choice one has to set such that and the last term in (9) will be of order , thus requiring steps to yield convergence. For we have and the last term is only of order instead. However, this requires typically a large shift. Observe , i.e. setting is enough. We like to stress that in general it is not advisable to set as the first two terms in (9) depend on . In practice, it often suffices to set , as we will discuss in Section 4.
As discussed in Remark 2.5 above, setting and is feasible. With this choice, equation (9) simplifies to
(10) 
for . To estimate the second term in (9) we used the property for strongly convex , as derived in [27, Lemma 2]. We observe that for the first term is dominating, and Algorithm 1 converges at rate , the same rate as vanilla SGD [16].
We now give an outline of the proof. The proofs of the lemmas are given in Appendix A.2.
Similar as discussed in [16, 33, 27] we have to define a suitable averaging scheme for the iterates to get the optimal convergence rate. In contrast to [16] that use linearly increasing weights, we use quadratically increasing weights, as for instance [33, 35].
Let , , , , be sequences satisfying
(15) 
for and constants , , . Then
(16) 
for and .
The proof of the theorem immediately follows from the three lemmas that we have presented in this section and convexity of , i.e. we have in (16), for constants and .
We present numerical experiments to illustrate the excellent convergence properties and communication efficiency of MemSGD. As the usefulness of SGD with sparsification techniques has already been shown in practical applications [8, 1, 20, 36, 32] we focus here on a few particular aspects. First, we verify the impact of the initial learning rate that did come up in the statement of Theorem 2.4. We then compare our method with QSGD [3] which decreases the communication cost in SGD by using random quantization operators, but without memory. Finally, we show the performance of the parallel SGD depicted in Algorithm 2 in a multicore setting with shared memory and compare the speedup to asynchronous Hogwild! [24].
The experiments focus on the performance of MemSGD
applied to logistic regression. The associated objective function is
, where and are the data samples, and we employ a standard regularizer. The regularization parameter is set to for both datasets following [31].We consider a dense dataset, epsilon [34], as well as a sparse dataset, RCV1 [19] where we train on the larger test set. Statistics on the datasets are listed in Table 1 below:


We use Python3 and the numpy library [14]. Our code is opensource and publicly available at: github.com/epfml/sparsifiedSGD. We emphasize that our high level implementation is not optimized for speed per iteration but for readability and simplicity. We only report convergence per iteration and relative speedups, but not wallclock time because unequal efforts have been made to speed up the different implementations. Plots additionally show the baseline computed with the standard optimizer LogisticSGD of scikitlearn [25]. Experiments were run on an Ubuntu 16.04 machine with a 24 cores processor Intel® Xeon® CPU E52680 v3 @ 2.50GHz.
We study the convergence of the method using the stepsizes and hyperparameters and set as in Table 2. We compute the final estimate as a weighted average of all iterates with weights as indicated by Theorem 2.4. The results are depicted in Figure 2. We use for epsilon and for RCV1 to increase the difference with large number of features. The variant consistently outperforms and sometimes outperforms vanilla SGD, which is surprising and might come from feature characteristics of the datasets. We also evaluate the impact of the delay in the learning rate: setting it to instead of order dramatically hurts the memory and requires time to recover from the high initial learning rate (labeled “without delay” in Figure 2).
We experimentally verified the convergence properties of MemSGD for different sparsification operators and stepsizes but we want to further evaluate its fundamental benefits in terms of sparsity enforcement and reduction of the communication bottleneck. The gain in communication cost of SGD with memory is very high for dense datasets—using the strategy on epsilon dataset improves the amount of communication by compared to SGD. For the sparse dataset, SGD can readily use the given sparsity of the gradients. Nevertheless, the improvement for on RCV1 is of approximately an order of magnitude.
Now we compare MemSGD with the QSGD compression scheme [3] which reduces communication cost by random quantization. The accuracy (and the compression ratio) in QSGD is controlled by a parameter , corresponding to the number of quantization levels. Ideally, we would like to set the quantization precision in QSGD such that the number of bits transmitted by QSGD and MemSGD are identical and compare their convergence properties. However, even for the lowest precision, QSGD needs to send the sign and index of coordinates. It is therefore not possible to reach the compression level of sparsification operators such as top or random, that only transmit a constant number of bits per iteration (up to logarithmic factors).^{5}^{5}5Encoding the indices of the top or random elements can be done with additional bits. Note that for both our examples. Hence, we did not enforce this condition and resorted to pick reasonable levels of quantization in QSGD ( with ). Note that bits stands for the number of bits used to encode levels but the actual number of bits transmitted in QSGD can be reduced using Elias coding. As a fair comparison in practice, we chose a standard learning rate [6], tuned the hyperparameter on a subset of each dataset (see Appendix B). Figure 3 shows that MemSGD with on epsilon and RCV1 converges as fast as QSGD in term of iterations for 8 and 4bits. As shown in the bottom of Figure 3, we are transmitting two orders of magnitude fewer bits with the sparsifier concluding that sparsification offers a much more aggressive and performant strategy than quantization.
We implement a parallelized version of MemSGD, as depicted in Algorithm 2. The enforced sparsity allows us to do the update in shared memory using a lockfree mechanism as in [24]. For this experiment we evaluate the final iterate instead of the weighted average above, and we also investigate constant learning rate, which turns out to work well in practice for epsilon. We use constant for epsilon and reuse the parameters from Table 2 for RCV1.
Figure 4 shows the speedup obtained when increasing the number of cores. We see that the speedup is almost linear up to 10 cores. This is especially remarkable for the dense epsilon dataset, as the dense gradient updates would usually imply many update conflicts for classic SGD and Hogwild! [24] (i.e. hurting convergence because of gradients computed on stale iterates) or require the use of locking (i.e. hurting speed). We did not use atomic updates of the parameter in the shared memory, allowing some workers to overwrite the progress of others which might contribute to the slowdown for a higher number of workers. The experiment is run on a single machine, hence no internode communication is used. The colored area depicts the best and worst results of 3 independent runs for each dataset.
Our method consistently scales better than vanilla parallel SGD with (this scheme can be seen as a naïve implementation of Hogwild!) on a non sparse problem (i.e. logistic regression on dense dataset). In this asynchronous setup, SGD with memory computes gradients on stale iterates that differ only by a few coordinates. It encounters fewer inconsistent read/write operations than lock free asynchronous SGD and exhibits better scaling properties. The operator performs better than in the sequential setup, but this is not the case in the parallel setup. A reason for this could be that due to the deterministic nature of the operator the cores are more prone to update the same set of coordinates, and more collisions appear.
We provide the first concise convergence analysis of sparsified SGD [36, 32, 8, 1]. This extremely communicationefficient variant of SGD enforces sparsity of the applied updates by only updating a constant number of coordinates in every iteration. This way, the method overcomes the communication bottleneck of SGD, while still enjoying the same convergence rate in terms of stochastic gradient computations.
Our experiments verify the drastic reduction in communication cost by demonstrating that MemSGD requires one to two orders of magnitude less bits to be communicated than QSGD [3] while converging to the same accuracy. The experiments show an advantage for the top sparsification over random sparsification in the serial setting, but not in the multicore shared memory implementation. There, both schemes are on par, and show much better scaling than a simple shared memory implementation that just writes the dense updates in lockfree asynchronous fashion (like Hogwild! [24]).
Despite the fact that our analysis does not yet comprise the parallel and distributed setting, we feel that those are the domains where sparsified SGD might have the largest impact. It has already been shown in practice that gradient sparsification can be efficiently applied to bandwidth memory limited systems such as multiGPU training for neural networks [8, 1, 20, 36, 32]. By enforcing sparsity, the scheme is not only communication efficient, it also becomes more eligible for asynchronous implementations, as limiting sparsity assumptions can be revoked.
We would like to thank Dan Alistarh for insightful discussions in the early stages of this project and Frederik Künstner for his useful comments on the various drafts of this manuscript.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages 440–445. Association for Computational Linguistics, 2017.Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 27, 2018
. AAAI Press, 2018.Nonasymptotic analysis of stochastic approximation algorithms for machine learning.
In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, NIPS  Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011.Onchip training of recurrent neural networks with limited numerical precision.
2017 International Joint Conference on Neural Networks (IJCNN), pages 3716–3723, 2009.For , , and operator it holds
(17) 
From the definition of the operators, for all in we have
(18) 
and we apply the expectation
(19) 
which concludes the proof. ∎
Let , for . Then .
Observe
(20) 
where the inequality follows from
(21) 
∎
To upper bound the third term, we use the same estimates as in [17, Appendix C.3]: By strong convexity, for , hence
(24) 
and with we further have
(25) 
Putting these two estimates together, we can bound (23) as follows:
(26) 
where . We now estimate the last term. As each is smooth also is smooth, i.e. satisfies . Together with we have
(27)  
(28)  
(29) 
Combining with (26) we have
(30) 
and the claim follows with (12). ∎
Let , , , be a sequence satisfying
(35) 
for a sequence with , for , . Then
(36) 
for .
The claim holds for .
Let , i.e. . (Note that for any it holds .) Suppose the claim holds for . Observe,
(37) 
for . This follows from Lemma A.2 with . By induction,
(38)  
(39) 
where we used (and the observation just above) for the last inequality.
Assume , otherwise the claim follows from the part above. We have
(40) 
where we used
(41) 
for . For we have
Comments
There are no comments yet.