Modern deep learning models resnet are almost invariably trained in parallel or distributed environments, which is necessitated by the enormous size of the data sets and dimension and complexity of the models required to obtain state-of-the-art performance. In our work, the focus is on the data-parallel paradigm, in which the training data is split across several workers capable of operating in parallel bekkerman2011scaling; recht2011hogwild. Formally, we consider optimization problems of the form
where represents the parameters of the model, is the number of workers, and
is a loss function composed of data stored on worker. Typically, is modelled as a function of the form , where is the distribution of data stored on worker , and is the loss of model on data point . The distributions can be different on every node, which means that the functions may have different minimizers. This framework covers i) stochastic optimization when either or all are identical, and ii) empirical risk minimization when can be expressed as a finite average, i.e, for some .
Distributed Learning. Typically, problem (1
) is solved by distributed stochastic gradient descent (SGD)SGD, which works as follows: i) given model maintained on each node, machine computes a random vector whose mean is (i.e., a stochastic gradient), ii) all stochastic gradients are sent to a master node111There are several alternatives to this, all logically identical, but differing in the implementation. For instance, one may dispense off the master node and instead let all workers broadcast their gradient updates directly to their peers in an all to all fashion. Aggregation is then performed by each node separately. In the theoretical part of paper we work with an abstract method allowing for multiple specific implementations., which performs update aggregation , iii) the aggregated gradient is sent back to the workers, and finally iv) all workers perform a single step of SGD: , where is a step size. This iterative process is repeated until a model of suitable properties is found.
A key bottleneck of the above algorithm, and of its many variants (e.g., variants utilizing mini-batching goyal2017accurate, importance sampling horvath2018nonconvex; katharopoulos2018not, momentum nesterov2013introductory
, or variance reductionjohnson2013accelerating; defazio2014saga), is the cost of communication of the typically dense gradient vector , and in a parameter-sever implementation with a master node, also the cost of broadcasting the aggregated gradient . These are dimensional vectors of floats, with being very large in modern deep learning (e.g. some deep learning applications communicate MB for each worker alistarh2018convergence). It is well-known 1bit; qsgd; zipml; deep that in many practical applications with common computing architectures, communication takes much more time than computation, creating a bottleneck of the entire training system.
Several solutions were suggested in the literature as a remedy to this problem. In one strain of work, the issue is addressed by giving each worker “more work” to do, which results in a better communication-to-computation ratio. For example, one may use mini-batching to construct more powerful gradient estimatorsGoyal2017:large, define local problems for each worker to be solved by a more advanced local solver Shamir2014:approxnewton; Hydra; Reddi:2016aide, or reduce communication frequency (e.g., by communicating only once Mann2009:parallelSGD; Zinkevich2010:parallelSGD or once every few iterations Stich2018:localsgd). An orthogonal approach to the above efforts aims to reduce the size of the communicated vectors instead 1bit; qsgd; terngrad; tonko; hubara2017quantized; grishchenko2018asynchronous using various lossy (and often randomized) compression mechanisms, commonly known in the literature as quantization techniques. In their most basic form, these schemes decrease the number of bits used to represent floating point numbers forming the communicated -dimensional vectors Gupta:2015limited; FEDLEARN; Na2017:limitedprecision, thus reducing the size of the communicated message by a constant factor. Another possibility is to apply randomized sparsification masks to the gradients Suresh2017; RDME; Alistarh2018:topk; stich2018sparse, or to rely on coordinate/block descent updates-rules, which are sparse by design Hydra; Hydra2; 99percent.
One of the most important considerations in the area of compression operators is the compression-variance trade-off RDME; qsgd; diana2. For instance, while random dithering approaches attain up to compression 1bit; qsgd; terngrad, the most aggressive schemes reach compression by sending a constant number of bits per iteration only Suresh2017; RDME; Alistarh2018:topk; stich2018sparse. However, the more compression is applied, the more information is lost, and the more will the quantized vector differ from the original vector we want to communicate, increasing its statistical variance. Higher variance implies slower convergence qsgd; mishchenko2019distributed, i.e., more communication rounds. So, ultimately, compression approaches offer a trade-off between the communication cost per iteration and the number of communication rounds. Remarkably, for carefully constructed compression strategies and appropriately designed training algorithms, this trade-off, while generally favouring some level of compression in practice, can be captured also theoretically 99percent.
2 Summary of Contributions
New Compression Operators. We construct a new “natural compression” operator (see Sec 3, Def 1) based on a randomized rounding scheme in which each float of the compressed vector is rounded to a (positive or negative) power of 2. As a by-product, natural compression can get away with communicating the exponents and signs of the original floats only, which can be done with no computation effort beyond disposing off the mantissa and performing a bit-shift on the exponent. Importantly, natural compression enjoys a provably small variance, (see Thm 1), which means that theoretical convergence results of SGD-type methods are essentially unaffected. At the same time, substantial savings are obtained in the amount of communicated bits per iteration ( less for float32 and less for float64). In addition, we utilize these insights and develop a new random dithering operator—natural dithering—which is exponentially better than standard random dithering (see Thm 5). Finally, our new compression techniques can be combined with existing compression and sparsification operators for a more dramatic combined effect (see Thm 3.3).
Computation-Free Simple Low-Level Implementation. As we show in Sec 3.2, apart from a randomization procedure (which is inherent in all unbiased compression operators), natural compression is computation-free. Natural compression essentially amounts to the trimming of the mantissa and possibly shifting the bits in the exponent by one place. This is the first compression mechanism with such a “natural” compatibility with binary floating point types.
Proof-of-Concept System with In-Network Aggregation (INA). The recently proposed SwitchML switchML system alleviates the communication bottleneck via in-network aggregation (INA) of gradients. However, since current programmable network switches are only capable of adding (not even averaging) integers, new update compression methods are needed which can supply outputs in an integer format. Our natural compression mechanism is the first that is provably able to operate in the SwitchML framework as it communicates integers only: the sign, plus the bits forming the exponent of a float. Moreover, having bounded (and small) variance, it is compatible with existing distributed training methods.
Theory of general quantized SGD. We provide a convergence theory for a distributed SGD method (see Algorithm 1), allowing for compression both at the worker and master side. Moreover, the compression operators compatible with our theory form a large family (operators for some finite ; see Definition 2). This enables safe experimentation with existing and facilitates the development of new compression operators fine-tuned to specific deep learning model architectures. Our convergence result (Thm 1) applies to smooth and non-convex functions, and our rates predict linear speed-up with respect to the number of machines.
Experiments. We observe superior behavior compared to the state-of-the-art.
3 Natural Compression
We define a new (randomized) compression technique, which we call natural compression. This is fundamentally a function mapping
to a random variable. In case of vectors we apply it in an element-wise fashion: .
Natural compression performs a randomized rounding of its input to one of the two closest integer powers of 2. Given nonzero , let be such that (i.e., ). Then
and we round to either , or to . When , we set
. The probabilities are chosen so that
is an unbiased estimator of, i.e., for all .
For instance, will be rounded to either or (since ), and will be rounded to either or (since ). As a consequence, if is an integer power of 2, then will leave unchanged. See Fig 1 for a graphical illustration.
Definition 1 (Natural compression).
Natural compression is a random function defined as follows. We set . If , we let
Alternatively, (3) can be written as where ; that is, with prob. and with prob. . The key properties of any (unbiased) compression operator are variance, ease of implementation, and compression level. We next characterize the remarkably low variance of in Sec 3.1 and describe an (almost) effortless and natural implementation, and the compression it offers in Sec 3.2.
3.1 Natural compression has a negligible variance:
We identify natural compression as belonging to a large class of unbiased compression operators with bounded second moment jiang; khirirat2018distributed; diana2, defined below.
Definition 2 (Compression operators).
Note that implies almost surely. It is easy to see222Using the identity which holds for any random . that the variance of is bounded as: . If this holds, we say that “ has variance ”. The importance of stems from two observations. First, operators from this class are known to be compatible with several optimization algorithms khirirat2018distributed; diana2. Second, this class includes many compression operators used in practice, including qsgd; terngrad; tonko; mishchenko2019distributed. In general, the larger is, the higher compression level might be achievable, and the worse impact compression has on the convergence speed.
The main result of this section says that the natural compression operator has variance .
Consider now a similar unbiased randomized rounding operator to ; but one that rounds to one of the nearest integers (as opposed to integer powers of 2). We call it . At first sight, this may seem like a reasonable alternative to . However, as we show next, does not have a finite second moment and is hence incompatible with existing optimization methods.
There is no such that .
3.2 Natural compression: from 32 to 9 bits, with lightning speed
We now explain that performing natural compression of a real number in a binary floating point format is computationally cheap. In particular, excluding the randomization step, amounts to simply dispensing off the mantissa in the binary representation. The most common computer format for real numbers, binary (resp. binary) of the IEEE 754 standard, represents each number with (resp. ) bits, where the first bit represents the sign, (resp. ) bits are used for the exponent, and the remaining (resp. ) bits are used for the mantissa. A scalar is represented in the form , where are bits, via the relationship
where is the sign, is the exponent and is the mantissa.
A binary representation of is visualized in Fig 2. In this case, , , and hence .
Observe that is obtained from by setting the mantissa to zero, and keeping both the sign and exponent unchanged. Similarly, is obtained from by setting the mantissa to zero, keeping the sign , and increasing the exponent by one, which amounts to a simple shift of the bits forming the exponent to the left by one spot. Hence, both values can be computed from essentially without any computation.
Communication savings. In summary, in case of binary, the output of natural compression is encoded using the 8 bits in the exponent and an extra bit for the sign. This is less communication. In case of binary, we only need 11 bits for the exponent and 1 bit for the sign, and this is less communication.
3.3 Compatibility with other compression techniques
We start with a simple but useful observation about composition of compression operators.
If and , then , where , and is the composition defined by .
Combining this result with Thm 1, we observe that for any , we have . Since offers substantial communication savings with only a negligible effect on the variance of , a key use for natural compression beyond applying it as the sole compression strategy (e.g., for SwitchML switchML) is to deploy it with other effective techniques as a final compression mechanism (e.g., with the optimized sparsifiers RDME; tonko, or with Alistarh2018:topk; stich2018sparse), boosting the performance of the system even further. However, our technique will be useful also as a post-compression mechanism for compressions that do not belong to (e.g., TopK sparsifier stich2018sparse; Alistarh2018:topk). The same comments apply to the natural dithering operator , defined in the next section.
4 Natural Dithering
Motivated by the natural compression introduced in Sec 3, here we propose a new random dithering operator which we call natural dithering. However, it will be useful to introduce a more general dithering operator, one generalizing both the natural and the standard dithering operators. For , let be -norm: .
Definition 3 (General dithering).
The general dithering operator with respect to the norm and with levels , denoted , is defined as follows. Let . If , we let . If , we let for all . Assuming for some , we let
where for some and is a random variable equal to with probability , and to with probability . Note that .
Standard (random) dithering, , Goodall1951:randdithering; Roberts1962:randdithering is obtained as a special case of general dithering (which is also novel) for a linear partition of the unit interval, , , …, and equal to the identity operator. Natural dithering—a novel compression operator introduced in this paper—arises as a special case of general dithering for and a binary geometric partition of the unit interval: , , …, .
A comparison of the operators for the standard and natural dithering with levels applied to can be found in Fig 3.
When is used to compress gradients, each worker communicates the norm (1 float), vector of signs ( bits) and efficient encoding of the effective levels for each entry . Note that is essentially an application of to all normalized entries of , with two differences: i) we also communicate the compressed norm , ii) in the interval is subdivided further, to machine precision, and in this sense can be seen as a limited precision variant of . As is the case with , the mantissa is ignored, and one communicates exponents only. The norm compression is particularly useful on the master side since multiplication by a naturally compressed norm is just summation of the exponents.
The main result of this section establishes natural dithering as belonging to the class :
, where and .
To illustrate the strength of this result, we now compare natural dithering to standard dithering and show that natural dithering is exponentially better than the standard dithering. In particular, for the same level of variance, uses only levels while uses levels. Note also that the levels used by form a subset of the levels used by (see Fig 14).
Fixing , natural dithering has times smaller variance than standard dithering . Fixing , if , then implies that .
|Approach||No. iterations||Bits per iter.||Speedup|
5 Distributed SGD with Bidirectional Compression
There are several stochastic gradient-based methods SGD; bubeck2015convex; ghadimi2013stochastic; mishchenko2019distributed for solving (1) that are compatible with compression operators , and hence also with our natural compression () and natural dithering () techniques. However, as none of them support compression at the master node, we propose a distributed SGD algorithm that allows for bidirectional compression (Algorithm 1).
We assume repeated access to unbiased stochastic gradients with bounded variance for every worker . We also assume node similarity represented by constant , and that is -smooth (gradient is -Lipschitz). Formal definitions as well as detailed explanation of Algorithm 1 can be found in Appendix D. We denote , and
where is the compression operator used by the master node, are the compression operators used by the workers and . The main theorem follows:
The above theorem has some interesting consequences. First, notice that (9) posits a convergence of the gradient norm to the value , which depends linearly on . In view of (8), the more compression we perform, the larger this value. More interestingly, assume now that the same compression operator is used at each worker: . Let and be the compression on master side. Then, is its iteration complexity. In the special case of equal data on all nodes, i.e., , we get and . If no compression is used, then and . So, the relative slowdown of Algorithm 1 used with compression compared to Algorithm 1 used without compression is given by
The upper bound is achieved for (or for any and ), and the lower bound is achieved in the limit as . So, the slowdown caused by compression on worker side decreases with . More importantly, the savings in communication due to compression can outweigh the iteration slowdown, which leads to an overall speedup! See Table 1 for the computation of the overall worker to master speedup achieved by our compression techniques (also see Appendix D.6 for additional similar comparisons under different cost/speed models). Notice that, however, standard sparsification do not necessarily improve the overall running time — they can make it worse. Our methods have the nice property of significantly uplifting the minimal speedup comparing to their “non-natural” version. The minimal speedup is more important as usually the number of nodes is not very big.
6 System Evaluation
To verify the theoretical properties of our approach in practice, we built a proof-of-concept system and provide evaluation results. In particular, we focus on illustrating convergence behavior, training throughput improvement, and transmitted data reduction.
Experimental setup. Our implementation builds upon the concept of In-Network Aggregation switchML. Appendix A describes the implementation details. We run the workers on 8 machines configured with 1 NVIDIA P100 GPU, dual CPU Intel Xeon E5-2630 v4 at 2.20GHz, and 128 GB of RAM. The machines run Ubuntu (Linux kernel 4.4.0-122) and CUDA 9.0. Following switchML
, we balance the workers with 8 aggregators running on machines configured with dual Intel Xeon Silver 4108 CPU at 1.80 GHz. Each machine uses a 10 GbE network interface and has CPU frequency scaling disabled. The chunks of compressed gradients sent by workers are uniformly distributed across all aggregators. This setup ensures that workers can fully utilize their network bandwidth and match the performance of a programmable switch. We leave the switch-based implementation for future work.
Our experiments execute the standard CNN benchmark tfbench
. We summarize the hyperparameters setting in AppendixB.1.1. We further present results for two more variations of our implementation: one without compression (providing the baseline for In-Network Aggregation), and the other with deterministic rounding to the nearest power of 2.
Results. We first illustrate the convergence behavior by training ResNet110 and AlexNet models on CIFAR10 dataset. Fig 6 shows the test accuracy over time. We note that natural compression lowers training time by and , resp., compared to using no compression, while the accuracy matches the results in resnet with the same hyperparameters setting. Moreover, compression using deterministic rounding (not shown) instead of stochastic rounding does not further reduce training time. In addition, our approaches do not affect the convergence speed in terms of training loss as predicted by theory, even when we use fewer levels for w.r.t. ; see Appendix B.3.
Next, we report the speedup measured in average training throughput while training benchmark CNN models on Imagenet dataset for one epoch. The throughput is calculated as the total number of images processed divided by the time elapsed. Fig6
shows the speedup normalized by the training throughput of the baseline, that is, TensorFlow + Horovod using the NCCL communication library. We further break down the speedup by showing the relative speedup of In-Network Aggregation, which performs no compression but reduces the volume of data transferred (shown below). We also show the effects of deterministic rounding on throughput. Because deterministic rounding does not compute random numbers, it provides some additional speedups. However, it may affect convergence. These results represent potential speedups in case the overheads of randomization were low, for instance, when using simply lookup for pre-computed randomness. We observe that thecommunication-intensive models (VGG, AlexNet) benefit more from quantization as compared to the computation-intensive models (GoogleNet, Inception, ResNet). These observations are consistent with prior work qsgd.
To quantify the data reduction benefits of natural compression, we measure the total volume of data transferred during training. Fig 6 shows that data transferred grows linearly over time, as expected. Natural compression saves 84% of data, which greatly reduces communication time.
Further details and additional experiments are presented in Appendix B.
Appendix A Implementation Details
We implement the natural compression operator within the Gloo communication library [gloo], as a drop-in replacement for the ring all-reduce routine. Our implementation is in C++. We integrate our communication library with Horovod and, in turn, with TensorFlow. We follow the same communication strategy introduced in SwitchML [switchML], which aggregates the deep learning model’s gradients using In-Network Aggregation on programmable network switches. We choose this strategy because natural compression is a good fit for the capabilities of this class of modern hardware, which only supports basic integer arithmetic, simple logical operations and limited storage.
A worker applies the natural compression operator to quantize gradient values and sends them to the aggregator component. As in SwitchML, an aggregator is capable of aggregating a fixed-length array of gradient values at a time. Thus, the worker sends a stream of network packets, each carrying a chunk of compressed values. For a given chunk, the aggregator awaits all values from every worker; then, it restores the compressed values as integers, aggregates them and applies compression to quantize the aggregated values. Finally, the aggregator multicasts back to the workers a packet of aggregated values.
For implementation expedience, we prototype the In-Network Aggregation as a server-based program implemented atop DPDK [dpdk] for fast I/O performance. We leave to future work a complete P4 implementation for programmable switches; however, we note that all operations (bit shifting, masking, and random bits generation) needed for our compression operator are available on programmable switches.
Difference in Implementation. We carefully optimize our implementation using modern x86 vector instructions (AVX512) to minimize the overheads in doing compression. To fit the byte length and access memory more efficiently, we compress a 32-bit floating point numbers to an 8-bit representation, where 1 bit is for the sign and 7 bits are for the exponent. The aggregator uses 64-bit integers to store the intermediate results, and we choose to clip the exponents in the range of . As a result, we only use 6 bits for exponents. The remaining one bit is used to represent zeros. Note that it is possible to implement 128-bit integers using two 64-bit integers, but we found that, in practice, the exponent values never exceed the range of (Figure 7).
Despite the optimization effort, we identify non-negligible overheads in doing random number generation used in stochastic rounding, which was also reported in [hubara2017quantized]. We include the experimental results of our compression operator without stochastic rounding as a reference. There could be more efficient ways to deal with stochastic rounding, but we observe that doing deterministic rounding gives nearly the same training curve in practice.
Appendix B Extra Experiments
b.1 Convergence Tests on CIFAR 10
In order to validate that does not incur any loss in performance, we trained various DNNs on the Tensorflow CNN Benchmark [tfbench] on the CIFAR 10 dataset with and without
for the same number of epochs, and compared the test set accuracy. As mentioned earlier, the baseline for comparison is the default NCCL setting. We didn’t tune the hyperparameters. In all of the experiments, we used Batch Normalization, but no Dropout was used.
b.1.1 AlexNet Hyperparameters:
For AlexNet, we chose the optimizer as SGD with momentum with a momentum of 0.9. We trained on three minibatch sizes: for 200 epochs. The learning rate was initially set to be , which was decreased by a factor of after every epoch.
b.1.2 ResNet Hyperparameters:
All the ResNets followed the training procedure as described in [resnet]. We used a weight decay of and the optimizer was chosen to be SGD with momentum, with a momentum of . The minibatch size was fixed to be 128 for ResNet 20, and 256 for all the others. We train for a total of 64K iterations. We start with an initial learning rate of , and multiply it by at and iterations.
b.1.3 DenseNet Hyperparameters:
We trained DenseNet40 and followed the same training procedure as described in the original paper [huang2017densely]. We used a weight decay of and the optimizer as SGD with momentum, with a momentum of . We trained for a total of 300 epochs. The initial learning rate was , which was decreased by a factor of 10 at 150 and 225 epoch.
b.2 vs. : Empirical Variance
In this section, we perform experiments to confirm that level selection brings not just theoretical but also practical performance speedup in comparison to . We measure the empirical variance of and . For , we do not compress the norm, so we can compare just variance introduced by level selection. Our experimental setup is the following. We first generate a random vector of size
, with independent entries with Gaussian distribution of zero mean and unit variance (we tried other distributions, the results were similar, thus we report just this one) and then we measure normalizedempirical variance
We provide boxplots, each for 100 randomly generated vectors using the above procedure. We perform this for , and . We report our findings in Fig 9, Fig 10 and Fig 11. These experimental results support our theoretical findings.
b.2.1 has exponentially better variance
In Fig 9, we compare and for , i.e., we use the same number of levels for both compression strategies. In each of the three plots we generated vectors with a different norm. We find that natural dithering has dramatically smaller variance, as predicted by Theorem 5.
b.2.2 needs exponentially less levels to achieve the same variance
In Fig 10, we set the number of levels for to . That is, we give standard dithering an exponential advantage in terms of the number of levels (which also means that it will need more bits for communication). We now study the effect of this change on the variance. We observe that the empirical variance is essentially the same for both, as predicted by Theorem 5.
b.2.3 can outperform in the big regime
We now remark on the situation when the number of levels is chosen to be very large (see Fig 11). While this is not a practical setting as it does not provide sufficient compression, it will serve as an illustration of a fundamental theoretical difference between and in the limit which we want to highlight. Note that while converges to the identity operator as , which enjoys zero variance, converges to instead, with variance that can’t reduce below . Hence, for large enough , one would expect, based on our theory, the variance of to be around , while the variance of to be closer to zero. In particular, this means that can, in a practically meaningless regime, outperform . In Fig 11 we choose and (this is large). Note that, as expected, the empirical variance of both compression techniques is small, and that, indeed, outperforms .
b.2.4 Compressing gradients
We also performed identical to those reported above, but with a different generation technique of the vectors . In particular, instead of a synthetic Gaussian generation, we used gradients generated by our optimization procedure as applied to the problem of training several deep learning models. Our results were essentially the same as the ones reported above, and hence we do not include them.
b.3 Different Compression Operators
We report additional experiments where we compare our compression operator to previously proposed ones. These results are based on a Python implementation of our methods running in PyTorch as this enabled a rapid direct comparisons against the prior methods. We compare against no compression, random sparsification, and random dithering methods. We compare on MNIST and CIFAR10 datasets. For MNIST, we use a two-layer fully connected neural network with RELU activation function. For CIFAR10, we use VGG11 with one fully connected layer as the classifier. We run these experiments withworkers and batch size for MNIST and for CIFAR10. The results are averages over 3 runs.
We tune the step size for SGD for a given “non-natural” compression. Then we use the same step size for the “natural” method. Step sizes and parameters are listed alongside the results.
Figures 12 and 13 illustrate the results. Each row contains four plots that illustrate, left to right, (1) the test accuracy vs. the volume of data transmitted (measured in bits), (2) the test accuracy over training epochs, (3) the training loss vs. the volume of data transmitted, and (4) the training loss over training epochs.
One can see that in terms of epochs, we obtain almost the same result in terms of training loss and test accuracy, sometimes even better. On the other hand, our approach has a huge impact on the number of bits transmitted from workers to master, which is the main speedup factor together with the speedup in aggregation if we use In-Network Aggregation (INA). Moreover, with INA we compress updates also from master to nodes, hence we send also fewer bits. These factors together bring significant speedup improvements, as illustrated in Fig 6.
c.1 Proof of Theorem 1
Recall that can be written in the form
where the last step follows since . Hence,
where the last step follows since . This establishes unbiasedness (11).
In order to establish (12), it suffices to show that for all . Since by definition for all , it suffices to show that
The optimal solution of the last maximization problem is , with optimal objective value . This implies that (14) holds with .
c.2 Proof of Theorem 2
Let assume that there exists some for which is the quantization. Unbiased rounding to the nearest integer can be defined in the following way
where . Let’s take -D example, where , then
which implies , thus taking , one obtains , which contradicts the existence of finite .
c.3 Proof of Theorem 3
The main building block of the proof is the tower property of mathematical expectation. The tower property says: If and are random variables, then Applying it to the composite compression operator , we get
For the second moment, we have
which concludes the proof.
c.4 Proof of Theorem 4
Unbiasedness of is a direct consequence of unbiasedness of .
For the second part, we first establish a bound on the second moment of :