Distributed Learning with Compressed Gradient Differences

Training very large machine learning models requires a distributed computing approach, with communication of the model updates often being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of the updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which means that they necessarily suffer from several issues, such as the inability to converge to the true optimum in the batch mode, inability to work with a nonsmooth regularizer, and slow convergence rates. In this work we propose a new distributed learning method---DIANA---which resolves these issues via compression of gradient differences. We perform a theoretical analysis in the strongly convex and nonconvex settings and show that our rates are vastly superior to existing rates. Our analysis of block-quantization and differences between ℓ_2 and ℓ_∞ quantization closes the gaps in theory and practice. Finally, by applying our analysis technique to TernGrad, we establish the first convergence rate for this method.

READ FULL TEXT VIEW PDF

Authors

06/09/2021

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback

Error feedback (EF), also known as error compensation, is an immensely p...
12/27/2019

Evolution Strategies Converges to Finite Differences

Since the debut of Evolution Strategies (ES) as a tool for Reinforcement...
10/07/2021

EF21 with Bells Whistles: Practical Algorithmic Extensions of Modern Error Feedback

First proposed by Seide (2014) as a heuristic, error feedback (EF) is a ...
06/17/2021

On Effects of Compression with Hyperdimensional Computing in Distributed Randomized Neural Networks

A change of the prevalent supervised learning techniques is foreseeable ...
12/14/2020

Quantizing data for distributed learning

We consider machine learning applications that train a model by leveragi...
06/07/2021

Smoothness-Aware Quantization Techniques

Distributed machine learning has become an indispensable tool for traini...
01/18/2021

Inference for BART with Multinomial Outcomes

The multinomial probit Bayesian additive regression trees (MPBART) frame...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Big machine learning models are typically trained in a distributed fashion. In this paradigm, the training data is distributed across several workers (e.g., nodes of a cluster), all of which compute in parallel an update to the model based on their local data. For instance, they can all perform a single step of Gradient Descent (GD

) or Stochastic Gradient Descent (

SGD). These updates are then sent to a parameter server which performs aggregation (typically this means just averaging of the updates) and then broadcasts the aggregated updates back to the workers. The process is repeated until a good solution is found.

When doubling the amount of computational power, one usually expects to see the learning process finish in half time. If this is the case, the considered system is called to scale linearly. For various reasons, however, this does not happen, even to the extent that the system might become slower with more resources. At the same time, the surge of big data applications increased the demand for distributed optimization methods, often requiring new properties such as ability to find a sparse solution. It is, therefore, of great importance to design new methods that are versatile, efficient and scale linearly with the amount of available resources.

In fact, the applications vary a lot in their desiderata. There is a rising interest in federated learning [Konečný et al., 2016], where the main concerns include the communication cost and ability to use local data only in an attempt to provide a certain level of privacy. In high-dimensional machine learning problems, nonsmooth

-penalty is often utilized, so one wants to have a support for proximable regularization. The efficiency of deep learning, in contrast, is dependent on heavy-ball momentum and nonconvex convergence to criticality, while sampling from the full dataset might not be an issue. In our work, we try to address all of these questions correspondingly.

1.1 Communication as the bottleneck

The the main aspects of distributed optimization efficiency are computational and communication complexity. In general, evaluating full gradients is intractable due to time and memory restrictions, so computation is made cheap by employing stochastic updates. On the other hand, in typical distributed computing architectures, communication is much slower (see Figure 1 for our experiments with communication cost of aggregating and broadcasting) than a stochastic update, and the design of a training algorithm needs to find a trade-off between them. There have been considered several ways of dealing with this issue.

One of the early approaches is to have each worker perform a block coordinate descent step  [Richtárik and Takáč, 2016, Fercoq et al., 2014]. By choosing the size of the block, one directly chooses the amount of data that needs to be communicated. An alternative idea is for each worker to do more work between communication rounds (e.g., by employing a more powerful local solver, such as a second order method), so that computation roughly balances out with communication. The key methods in this sphere include CoCoA and its variants [Jaggi et al., 2014, Ma et al., 2015, 2017b, 2017a, Smith et al., 2018], DANE [Shamir et al., 2014], DiSCO [Zhang and Xiao, 2015, Ma and Takáč, 2015], DANCE [Jahani et al., 2018] and AIDE [Reddi et al., 2016].

1.2 Update compression via randomized sparsification and/or quantization

Practitioners also suggested a number of heuristics to find a remedy for the communication botlleneck. Of special interest to this paper is the idea of compressing SGD updates, proposed by 

Seide et al. [2014]. Building off of this work, Alistarh et al. [2017] designed a variant of SGD that guarantees convergence with compressed updates. Other works with SGD update structure include [Konečný and Richtárik, 2016, Bernstein et al., 2018, Khirirat et al., 2018]. Despite proving a convergence rate, [Alistarh et al., 2017]

also left many new questions open and introduced an additional, unexplained, heuristic of quantizing only vector blocks. Moreover, their analysis implicitly makes an assumption that all data should be available to each worker, which is hard and sometimes even impossible to satisfy.

In a concurrent with [Alistarh et al., 2017] work [Wen et al., 2017], the Terngrad method was analyzed for stochastic updates that in expectation have positive correlation with the vector pointing to the solution. While giving more intuition about convergence of quantized methods, this work used norm for quantization, unlike -quantization of [Alistarh et al., 2017]. This introduces another question of which norm is better suited for applications, which we also aim to answer in this work.

Figure 1: Typical communication cost using broadcast, reduce and gather for 64 and 32 FP using 4 (solid) resp 128 (dashed) MPI processes. See suppl. material for details about the network.

1.3 The problem

In this paper we focus on the problem of training a machine learning model via regularized empirical risk minimization:

(1)

Above, is a closed convex regularizer, and is the loss of model obtained on data points belonging to distribution :

Note that we do not assume any kind of similarity between distributions . In contrast, we allow them to be arbitrary meaning that functions may have completely different minimizers. Furthermore, we will assume that every has its own smoothness parameter

, so it is permitted to use utterly non-uniform distributions. As a special case where

, we recover the framework of one distribution, consideration of which leads to a purely stochastic method.

1.4 Notation

By we denote the sign of (-1 if , 0 if and if ). The th element of a vector is denoted as . For and , the norm of is . Note that for all . By we denote the number of nonzero elements of . For the detailed description of the notation see the Table 5 in the appendix.

2 Contributions

method lin. rate loc. data non-smooth momentum
DIANA
(New!)
QSGD
Alistarh et al. [2017]
TernGrad
Wen et al. [2017]
DQGD
Khirirat et al. [2018]
QSVRG
Alistarh et al. [2017]
Table 1: Comparison of DIANA and related methods. Here “lin. rate” means that authors of corresponding paper prove linear convergence to either ball at the optimum or to the optimum itself, “loc. data” describes whether or not authors assume that is available at node only, “non-smooth” means support for a non-smooth regularizer, and “momentum” says whether or not authors consider momentum in their algorithm.

DIANA. We develop a distributed gradient-type method with compression of gradient differences, which we call DIANA (Algorithm 1).

Rate in the strongly convex case. We show that when applied to the smooth strongly convex minimization problem with arbitrary closed convex regularizer, DIANA has the iteration complexity , to a ball with center at the optimum (see Section 4.2, Theorem 2 and Corollary 1 for the details). In the case of decreasing stepsize we show iteration complexity (see Section 4.6, Theorem 5 and Corollary 2 for the details). Unlike in [Khirirat et al., 2018], in a noiseless regime our method converges to the exact optimum, and at a linear rate.

Rate in the non-convex case. We prove that DIANA also works for smooth non-convex problems with an indicator-like regularizer and get the iteration complexity (see Section 5, Theorem 4 and Corollary 3 for the details).

DIANA with momentum. We study momentum version of DIANA for the case of smooth non-convex objective with constant regularizer and (see Section J, Theorem 7 and Corollary 7 for the details).

First rate for Terngrad. We provide first convergence rate of TernGrad and provide new tight analysis of 1-bit QSGD under less restrictive assumptions for both smooth strongly convex objectives with arbitrary closed convex regularizer and non-convex objective with indicator-like regularizer (see Section 3.6 for the detailed comparison). Both of these methods are just special cases of our Algorithm 2 which is also a special case of Algorithm 1 with and for all . We show that Algorithm 2 has iteration complexity of convergence to the ball with center at the optimum in the case of the smooth strongly convex minimization problem with arbitrary closed convex regularizer (see Theorem 10) and in the case of non-convex minimization problem with indicator-like regularizer (see Theorem 8 and Corollary 8).

QSGD and Terngrad with momentum. We study momentum version of DIANA for and, in particular, we propose momentum versions of (1-bit) QSGD and TernGrad the case of smooth non-convex objective with constant regularizer and (see Section K.3, Theorem 9 and Corollary 9 for the details).

Optimal norm power. We find the answer for the following question: which norm to use for quantization in order to get the best iteration complexity of the algorithm? It is easy to see that all the bounds that we propose depend on where is an increasing function of (see Lemma 2 for the details). That is, for both Algorithm 1 and 2 the iteration complexity reduces when is growing and the best iteration complexity for our algorithms is achieved for . It implies that TernGrad has better iteration complexity than 1-bit QSGD.

First analysis for block-quantization. We give a first analysis of block-quantization (i.e. bucket-quantization). It was only mentioned in the paper Alistarh et al. [2017] that it is possible to get better convergence via block-quantization, but the authors do not have rigorous analysis of block-quantization. So, we close this gap.

We summarize a few key features of our complexity results established in Table 1.

3 The Algorithm

0:  learning rates and , initial vectors and , quantization parameter , sizes of blocks , momentum parameter
1:  
2:  for  do
3:     Broadcast to all workers
4:     for  in parallel do
5:        Sample such that
6:        
7:        Sample
8:        
9:        
10:     end for
11:     
12:     
13:     
14:     
15:     
16:  end for
Algorithm 1 DIANA ( nodes)

In this section we describe our main method—DIANA. However, we first need to introduce several key concepts and ingredients that come together to make the algorithm.

3.1 Stochastic gradients

In each iteration of DIANA

, each node will sample an unbiased estimator of the local gradient. We assume that these gradients have bounded variance.

Assumption 1 (Stochastic gradients).

For every , . Moreover, the variance is bounded:

(2)

Note that is an unbiased estimator of :

(3)

Moreover, by independence of the random vectors , its variance is bounded above by

(4)

where .

3.2 Quantization

DIANA applies random compression (quantization) to gradient differences, which are then communicated to a parameter server. We now define the random quantization transformations used. Our first quantization operator transforms a vector into a random vector whose entries belong to the set for some .

Definition 1 (-quantization).

Let and let . If , we define . If , we define by setting

(5)

where

are Bernoulli random variables

111That is,

with probability

(observe that this quantity is always upper bounded by 1) and with probability .
. Note that

(6)

where is applied elementwise, and denotes the Hadamard (i.e. elementwise) product. We say that is -quantization of . When sampling , we shall write .

In addition, we consider a block variant of -quantization operators.

Definition 2 (block--quantization).

Let , where , and for all . We say that is -quantization of with sizes of blocks and write if for all .

In other words, we quantize subvectors called blocks of the initial vector. Note that in the special case when we get full quantization: . Note that we do not assume independence of the quantization of blocks or independence of .

The next result states that is an unbiased estimator of , and gives a formula for its variance.

Lemma 1.

Let and . Then for

(7)
(8)

where

(9)
(10)

Thus, is an unbiased estimator of . Moreover, the variance of is a decreasing function of , and is minimized for .

3.3 Communication cost

If bits are used to encode a float number, then at most bits are needed to communicate with Elias coding (see Theorem 3.3 in [Alistarh et al., 2017]). In our next result, we given an upper bound on the expected communication cost.

Theorem 1 (Expected sparsity).

Let and be its -quantization. Then

(11)

and

(12)

All expressions in (11) and (12) are increasing functions of .

3.4 Proximal step

Given , the proximal operator for the regularizer is defined as

The proximal operator of a closed convex function is nonexpansive. That is, for any and ,

(13)

3.5 Diana

In DIANA, each machine first computes a stochastic gradient at current iterate . We do not quantize this information and send it off to the parameter server as that approach would not converge for . Instead, we maintain memory at each node (initialized to arbitrary values), and quantize the difference instead. Both the node and the parameter server update in an appropriate manner, and a proximal gradient descent step is taken with respect to direction , where is a momentum parameter, whereas is an unbiased estimator of the full gradient, assembled from the memory and the transmitted quantized vectors. Note that we allows for block quantization for more flexibility. In practice, we want the transmitted quantized vectors to be much easier to communicate than the full dimensional vector in , which can be tuned by the choice of defining the quantization norm, and the choice of blocks.

3.6 Relation to Qsgd and TernGrad

If the initialization is done with and , our method reduces to either 1-bit QSGD or TernGrad with and respectively. We unify them in the Algorithm 2. We analyse this algorithm (i.e. DIANA with and ) in three cases: 1) smooth strongly convex objective with arbitrary closed convex regularizer; 2) smooth non-convex objective with constant regularizer; 3) smooth non-convex objective with constant regularizer for the momentum version of the algorithm. We notice, that in the original paper Wen et al. [2017] authors do not provide the rate of convergence for TernGrad and we get the convergence rate for the three aforementioned situations as a special case of our results. Moreover, we emphasize that our analysis is new even for 1-bit QSGD, since in the original paper Alistarh et al. [2017] authors consider only the case of bounded gradients (), which is very restrictive assumption, and they do not provide rigorous analysis of block-quantization as we do. In contrast, we consider more general case of block-quantization and assume only that the variance of the stochastic gradients is bounded, which is less restrictive assumption since the inequality implies .

We obtain the convergence rate for arbitrary for the three aforementioned cases (see Theorems 891011 and Corollaries 8910 for the details) and all obtained bounds becomes better when is growing, which means that TernGrad has better iteration complexity than QSGD and, more generally, the best iteration complexity attains for norm quantization.

4 Theory: Strongly Convex Case

4.1 Assumptions

Let us introduce two key assumptions of this section.

Assumption 2 (–smoothness).

We say that a function is -smooth if

(14)

for any and .

Assumption 3 (-strong convexity).

A function is called -strongly convex if for all and

(15)

4.2 Iteration complexity

For , define

(16)
Lemma 2.

is increasing as a function of and decreasing as a function of . In particular, , and moreover,

and, as a consequence, for all positive integers and the following relations holds

Theorem 2.

Assume the functions are –smooth and –strongly convex. Choose stepsizes and , block sizes , where , and parameter satisfying the following relations:

(17)
(18)

Define the Lyapunov function

(19)

where is the solution of (1) and . Then for all ,

(20)

This implies that as long as , we have

In particular, if we set to be equal to the minimum in (18), then the leading term in the iteration complexity bound is

(21)
Corollary 1.

Let , , , and . Then the conditions (17) and (18) are satisfied, and the leading term in the iteration complexity bound is equal to

(22)

This is a decreasing function of . Hence, from iteration complexity perspective, is the optimal choice.

In Table 2 we calculate the leading term (43) in the complexity of DIANA for , each for two condition number regimes: (standard) and (large).

iteration complexity
1
Table 2: The leading term of the iteration complexity of DIANA in the strongly convex case, according to Theorem 2 (see Corollary 1 and Lemma 2). Logarithmic dependence on is suppressed. Condition number:

4.3 Matching the rate of gradient descent for quadratic size models

Note that as long as the model size is not too big; in particular, when the linear rate of DIANA with is , which matches the rate of gradient descent.

4.4 Optimal number of nodes

In practice one has access to a finite data set, consisting of data points, where is very large, and wishes to solve an empirical risk minimization (“finite-sum”) of the form

(23)

where each is –smooth and -strongly convex. If compute nodes of a distributed system are available, one may partition the functions into groups, , each of size , and define Note that Note that each is also –smooth and –strongly convex.

This way, we have fit the original (and large) problem (23) into our framework. One may now ask the question: How many many nodes should we use (other things equal)? If what we care about is iteration complexity, then insights can be gained by investigating Eq. (43). For instance, if , then the complexity is

The optimal choice is to choose so that the term becomes (roughly) equal to : This gives the formula for the optimal number of nodes

and the resulting iteration complexity is . Note that is increasing in . Hence, it makes sense to use more nodes for larger models (big ).

4.5 Optimal block quantization

If the dimension of the problem is large, it becomes reasonable to quantize vector’s blocks, also called blocks. For example, if we had a vector which consists of 2 smaller subvectors each of which is proportional to the vector of all ones, we can transmit just the subvectors without any loss of information. In the real world, we have a similar situation when different parts of the parameter vector have different scale. A straightforward example is deep neural networks, layers of which have pairwise different scales. If we quantized the whole vector at once, we would zero most of the update for the layer with the smallest scale.

Moreover, our theory says that if we have workers, then the iteration complexity increase of quantization is about . However, if quantization is applied to a block of size , then this number becomes 1, implying that the complexity remains the same. Therefore, if one uses about 100 workers and splits the parameter vector into parts of size about 10,000, the algorithm will work as fast as SGD, while communicating bits instead of floats!

4.6 Decreasing stepsizes

We now provide a convergence result for DIANA with decreasing step sizes, obtaining a rate.

Theorem 3.

Assume that is -smooth, -strongly convex and we have access to its gradients with bounded noise. Set with some for some numbers and satisfying . After iterations of DIANA we have

where , and

is the standard deviation of the gradient noise.

Corollary 2.

If we choose , , , then there are three regimes:

  1. if , then and to achieve we need at most iterations;

  2. if , then and to achieve we need at most iterations;

  3. if , then and to achieve we need at most iterations.

5 Theory: Nonconvex Case

In this section we consider the non-convex case.

Theorem 4.

Assume is such that exists a closed convex set satisfying 1) and 2) (e.g. indicator function of ). Also assume that , is -smooth, stepsizes and and parameter satisfying and is chosen randomly from . If, further, every worker samples from the full dataset, then

Corollary 3.

Set , , , and run the algorithm for iterations. Then, the final accuracy is at most .

Moreover, if the first term in Corollary 3 is leading and , the resulting complexity is , i.e. the same as that of SGD. For instance, if sufficiently large mini-batches are used, the former condition holds, while for the latter it is enough to quantize vectors in blocks of size .

6 Convergence Rate of TernGrad

Here we give the convergence guarantees for TernGrad and provide upper bounds for this method. The method coincides with Algorithm 2 for the case when . In the original paper [Wen et al., 2017] no convergence rate was given and we close this gap.

To maintain consistent notation we rewrite the TernGrad in notation which is close to the notation we used for DIANA. Using our notation it is easy to see that TernGrad is DIANA with and . Firstly, it means that for all and . What is more, this observation tells us that Lemma 3 holds for the iterates of TernGrad too. What is more, in the original paper [Wen et al., 2017] the quantization parameter was chosen as . We generalize the method and we don’t restrict our analysis only on the case of sampling.

As it was in the analysis of DIANA our proofs for TernGrad work under Assumption 1.

0:  learning rates , initial vector , quantization parameter , sizes of blocks , momentum parameter
1:  
2:  for  do
3:     Broadcast to all workers
4:     for  do in parallel do
5:        Sample such that
6:        Sample
7:     end for
8:     
9:     
10:     
11:  end for
Algorithm 2 DIANA with and ; QSGD for (1-bit)/ TernGrad for (SGD), [Alistarh et al., 2017, Wen et al., 2017]

7 Implementation and Experiments

Following advice from Alistarh et al. [2017], we encourage the use of blocks when quantizing large vectors. To this effect, a vector can decomposed into a number of blocks, each of which should then be quantized separately. If coordinates have different scales, as is the case in deep learning, it will prevent undersampling of those with typically smaller values. Moreover, our theoretical results predict that applying quantization to blocks or layers will result in superlinear acceleration.

In our convex experiments, the optimal values of were usually around , where the minimum is taken with respect to blocks and are their sizes.

Finally, higher mini-batch sizes make the sampled gradients less noisy, which in turn is favorable to more uniform differences and faster convergence.

7.1 Diana with momentum works best

We implement DIANA, QSGD, TernGrad and DQGD in Python222The code will be made available online upon acceptance of this work. using MPI4PY for processes communication. This is then tested on a machine with 24 cores, each is Intel(R) Xeon(R) Gold 6146 CPU @ 3.20GHz. The problem considered is binary classification with logistic loss and penalty, chosen to be of order , where is the total number of data points. We experiment with choices of , choice of norm type , different number of workers and search for optimal block sizes. is always set to be zero vector for all . We observe that for -norm the optimal block size is significantly bigger than for -norm. Detailed description of the experiments can be found in Section L. Here, however, we provide Figure 2 to show how vast the difference is with other methods.

Figure 2: Comparison of the DIANA () with QSGD, TernGrad and DQGD

on the logistic regression problem for the ”mushrooms” dataset.

7.2 Diana vs MPI

In Figure 3 we compare the performance of DIANA vs. doing a MPI reduce operation with 32bit floats. The computing cluster had Cray Aries High Speed Network. However, for DIANA we used 2bit per dimension and have experienced a strange scaling behaviour, which was documented also in Chunduri et al. [2017]. In our case, this affected speed for alexnet and vgg_a beyond 64 or 32 MPI processes respectively. For more detailed experiments, see Section L.

7.3 Train and test accuracy on Mnist and Cifar10

In the next experiments, we run QSGD Alistarh et al. [2017], TernGrad Wen et al. [2017], SGD with momentum and DIANA

on Mnist dataset and Cifar10 dataset for 3 epochs. We have selected 8 workers and run each method for learning rate from

. For QSGD, DIANA and TernGrad, we also tried various quantization bucket sizes in . For QSGD we have chosen quantization levels. For DIANA we have chosen and have selected initial . For DIANA and SGD we also run a momentum version, with a momentum parameter in . For DIANA we also run with two choices of norm and . For each experiment we have selected softmax cross entropy loss. Mnist-Convex is a simple DNN with no hidden layer, Mnist-DNN is a convolutional NN described here https://github.com/floydhub/mnist/blob/master/ConvNet.py and Cifar10-DNN is a convolutional DNN described here https://github.com/kuangliu/pytorch-cifar/blob/master/models/lenet.py.

In Figure 4 we show the best runs over all the parameters for all the methods. For Mnist-Convex SGD and DIANA makes use of the momentum and dominate all other algorithms. For Mnist-DNN situation is very similar. For Cifar10-DNN both DIANA and SGD significantly outperform other methods.

Acknowledgements

The work of Peter Richtárik was supported by the KAUST baseline funding scheme. The work of Martin Takáč was partially supported by the U.S. National Science Foundation, under award numbers NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796.

Figure 3: Comparison of performance (images/second) for various number of GPUs/MPI processes and sparse communication DIANA (2bit) vs. Reduce with 32bit float (FP32).
Figure 4: Evolution of training (left) and testing (right) accuracy for 3 different problems, using 4 algorithms: DIANA, SGD, QSGD and TernGrad. We have chosen the best runs over all tested hyper-parameters.

References

References

Appendix A Basic Identities and Inequalities

Smoothness and strong convexity. If is -smooth and -strongly convex, then for any vectors we have

(24)

Norm of a convex combination. For any and , we have

(25)

Variance decomposition. The (total) variance of a random vector is defined as the trace of its covariance matrix: