Trajectory Normalized Gradients for Distributed Optimization

01/24/2019 ∙ by Jianqiao Wangni, et al. ∙ 14

Recently, researchers proposed various low-precision gradient compression, for efficient communication in large-scale distributed optimization. Based on these work, we try to reduce the communication complexity from a new direction. We pursue an ideal bijective mapping between two spaces of gradient distribution, so that the mapped gradient carries greater information entropy after the compression. In our setting, all servers should share a reference gradient in advance, and they communicate via the normalized gradients, which are the subtraction or quotient, between current gradients and the reference. To obtain a reference vector that yields a stronger signal-to-noise ratio, dynamically in each iteration, we extract and fuse information from the past trajectory in hindsight, and search for an optimal reference for compression. We name this to be the trajectory-based normalized gradients (TNG). It bridges the research from different societies, like coding, optimization, systems, and learning. It is easy to implement and can universally combine with existing algorithms. Our experiments on benchmarking hard non-convex functions, convex problems like logistic regression demonstrate that TNG is more compression-efficient for communication of distributed optimization of general functions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For large-scale machine learning, the distributed optimization algorithm makes a significant contribution to improving the performance and scalability (Bottou et al., 2018). An almost necessary technique to process massive amounts of data in parallel is to divide data to different servers within a computation cluster, and these servers will provide local gradients and perform model synchronization using various communication protocols. In a centralized and synchronous setting, that all servers transmit their local gradients back to the main server, then the main server computes an updated parameter value and broadcast to others. As the system grows in size, the synchronization procedure is likely to be slowed by the network capacity and latency. A popular way to reduce the communication cost is to transmit compressed gradients, i.e. the low-bit representation. There are studies using low-bit quantization (Alistarh et al., 2017), ternary representation in (Zhou et al., 2016; Wen et al., 2017), sparsified vectors (Wang et al., 2018; Alistarh et al., 2018; Wangni et al., 2018), top-K important coordinates (Aji & Heafield, 2017) or even only using signs of gradients (Bernstein et al., 2018). Plus, the compression error in previous iterations can be accumulated as (Wu et al., 2018; Stich et al., 2018) to compensate the gradients.

The compression error is still far from rigorously studied. Most of the above works focused on the stochastic gradient descent (SGD)

(Zhang, 2004; Bottou, 2010)

and for training deep neural networks, the objective function of which are naturally robust to optimization with noisy gradients

(Jin et al., 2017; Kleinberg et al., 2018)

. However, for wider range problems, optimization for convex and strongly-convex problems are sensitive to gradient noise, which partially explains that variance-reduced SGD

(Johnson & Zhang, 2013) and quasi-Newton methods (Wright & Nocedal, 1999)

strongly outperform vanilla SGD. Under these settings, the convergence rate will probably slow down linearly by the compression, then there are theoretically no savings in terms of communication cost. Therefore, it is imperative to characterize the compression error more in-depth.

A natural motivation is that the compression error strongly depends on the gradient distribution, in addition to the compression algorithm itself. For example, the Huffman coding favors the distribution of literals occurrence being skewed

(Cormen et al., 2009); frequency-domain image codings are effective since the low-frequency and high-frequency parts have an unbalanced distribution of sensitivity to human eyes (Szeliski, 2010). Perhaps, just like the no free lunch theorem, there will be no effective compression without further distribution properties to apply.

Motivated by this, we propose to effectively adjust or normalize the gradient distribution before compressing them; ideally, having a distribution of standard Gaussian. The problem is different from a conventional sense of communication, from several perspectives: 1) the ultimate target of these rounds of gradient exchange is to improve the optimization in outer framework, 2), the information is generated by the optimization algorithm, which can be modified to adapt the encoding and decoding, 3) the past gradients shared in advance may be used to accelerate, like quasi-Newton algorithms (Wright & Nocedal, 1999) and Nesterov’s momentum (Nesterov, 2013), and they naturally cost no extra communication.

The paper is arranged as follows: we will introduce the background and notations, then the motivation of normalized gradients; we give some implementation options on the idea and we evaluate the idea on different problems.

2 Background

Denote as the objective function, and is the parameter to be optimized. For convenience, we assume that the objective has a finite average formulation over

data points, and each loss function is denoted as

. In the round, a descent vector based on the current parameter

The descent vector

has to an unbiased estimation of the gradient

, and has a bounded variance term to assure the convergence. A typical strategy for stochastic gradient descent (SGD), is to have an index uniformly sampled from the data set and take a step as

here is the step-size for this iteration.

In a distributed computation model, where we assume servers are available for the optimization task. Let each server has its own share of the whole training dataset, say server has , and they provide an unbiased estimation of gradients by averaging together,

In each around, server calculates its unbiased estimation of the gradient where is randomly sampled, based on partial data from its memory, then transmits the gradients to main server for synchronization, during which the main server average over all gradients and updates the parameter , and broadcasts it back to all servers.

Previous research on compressed gradient assumes that there exists a coding strategy to compress the gradient vector, where is the available set for representing a number in . Then each server only needs to update its gradient using a compressed vector, and the overall algorithm behaves like

(1)

Besides, an ideal design of compression should be unbiased, so that

2.1 Motivation

Suppose we use an algorithm in Eq.(1) and target for an -smooth loss function , and we assume that the compression error is random and independent of . The convergence rates for the methods designed above are strongly related to the optimization algorithm, especially the strategies of generating gradients in each iteration, as well as the assumptions (i.e. smoothness, Lipchitz continuity, convexity). For convenience, we denote and .

We suppose that the loss function is differentiable and -smooth and -strongly convex

We start with a simple inequality: based on the iteration , the expected loss for the next iteration is bounded by

where we applied the smoothness property in the first inequality, and decomposed the variance in the second inequality. An optimal compression is supposed to reduce the variance from compression error in .

Although seldom studied in this area of communication-efficient distributed optimization, we notice that the compression error is largely affected by the gradient distribution. Different compression strategies favor different kinds of distribution, whether it is long-tail, or strongly-concentrated like sub-Gaussian, or weakly-concentrated like sub-exponential. For example, gradient quantization approaches (Alistarh et al., 2017)

favors gradients with uniformly distributed elements within the quantization range; but differently, if one uses the gradient sparsification technique

(Wangni et al., 2018) as the compression , then reversely, a strong skewness of gradients implies that the communication could be saved more.

3 Normalized Gradients

We try to address the problem by adjusting or normalizing the gradient distribution by past trajectories, since they have been transmitted so do not incur additional communication cost in this round. We refere to the adjusted gradient to be Trajectory Normalized Gradient (TNG). The communication protocol can be generally described as: we wish to let all servers share a gradient vector that approximate in advance. For sending the gradients, each server transmits the normalized gradients, i.e. the difference between . Each server could send gradients using compressed TNG ; then upon receiving , a server uses the following procedure to decode the gradient as

(2)

A simple understanding of

is to view it as a zero-centered random variable, if

, or a polynomial of the high order derivative, if , and the range for the normalized gradients is tighter by higher-order continuity. The distribution of and depends on the model, data and the optimization algorithm itself. If they follow the same distribution, only different in magnitude by a factor of , clearly, the compression on yields a smaller error. By taking logarithms of gradients vectors and before performing the coding above, we get a form that

(3)

where is the element-wise product and takes the element-wise quotient. If these two procedures are combined, we get a normalization form of

where is a second reference vector. We also not that could be shared through a round of broadcast, from the main server, it could also be explicitly shared, for example, using a predefined protocol to update from the gradient vectors that these servers received from previous iterations.

3.1 Reference Vectors

The key requirement is choose appropriately so that

follows a normalized distribution than

for less compression error from . General normalization request the mean vector to be pre-known, which actually cause much trouble, as in each iteration of being updated, it has different means . The calculation of can be assumed to be basically impossible, as it takes much more computation (linear to data numbers) than calculating gradient from a mini-batch as SGD. Here we reach an interesting problem about how to approximate the mean of stochastic gradient to make it actually normalized.

A simple approach is to take where is the average value of all elements in . This will reduce the variance of from an inequality

(4)

for any random variable . The only additional cost is to transmit a single scalar , which is ignorable compared to transmitting a -dimensional vector.

The formulation for can be inspired from other areas. For example, the stochastic variance-reduced gradient (SVRG) algorithm (Johnson & Zhang, 2013) gives a better estimation of gradients converges linearly converges on strongly-convex and smooth loss functions, where

where is a reference parameter which is generally chosen from a previous iteration. The full gradient , although cost much more compared to stochastic gradients, are not frequently updated. Once the full gradient is evaluated, it only costs one round of communication for many rounds of SGD steps. Based on the same intuition, the stochast averaging gradient (Schmidt et al., 2017) could be applied here. The difference is that, the main server can average gradients from all servers, and the gradients might be the compressed ones from past iterations.

In another area of distributed optimization, the delay-tolerant optimization algorithm (Agarwal & Duchi, 2011) performs the following updates

As long as the staleness of the parameter, or , is bounded. The gradient as above can be the reference gradient since it is a close approximation to the current gradient.

The fourth option is to use a two-stage compression strategy that, in each stage, the algorithm generates a compensate vector with shared vector to complement the first stage and . To list all of them here:

The reference vector can be updated frequently or occasionally depending on the easiness of visiting it, e.g. setting an update frequency of like the staleness synchronous protocol (SSP) (Ho et al., 2013).

3.2 Gradient Compression

There are many protocols available for compressing the normalized gradient , as the literature introduced above. Here we take a strong compression coding strategy for an example, e.g. using the sign of each element (Wen et al., 2017; Bernstein et al., 2018). For communication, each server transmits a constant as the largest element of , and each compressed element derived from the element of . For simplicity, we will often omit the subscripts and . The magnitude information is encoded by the randomization process, and the unbiasedness of the compressed gradient would won’t change in expectation.

Denote as the largest element of , and a binary vector , to indicate whether each element of to be compressed by its sign or simply zero.

An example of compressed TNG is in Algorithm 1. In the following, we will characterize the optimality of the coding strategy above, that the probability vector should be proportional to magnitudes. For -smooth loss functions , setting proposed above is the optimal sampling probability for ternary coding of in for optimizing .

1:  Initialize the clock and initialize the weight , and set .
2:  repeat
3:     Each server calculates local gradient , and the vector .
4:     Randomly sample a binary vector that and .
5:     Transmit the compresed gradients and .
6:     The main server average over the received gradients and broadcast.
7:     Update the reference vector , through main server broadcasting.
8:  until convergence or the number of iteration reaches the maximum setting.
Algorithm 1 Trajectory Normalized Gradients via Ternary Coding and Delayed Gradients

3.3 Convergence Analysis

We do not focus on the specific constant of the convergence rate, since it depends on other factors and it is hard to provide a unified theorem that is both informative and tight. Here, we give a simple analysis of how the compression error affects the convergence rate.

We have the following assumption for the variance of stochastic gradient evaluated at the optimal point , . Then for loss functions that satisfy assumption 2.1, the variance of is bounded by

This lemma gives a better bound on the gradient variance rather than directly assigning an upper bound to the variance, as it decreases as the optimization is going on. For compressed normalized gradient in Algorithm 1, we assume that there exists a constant that for stochastic gradients on all servers,

We could always assure the proposition above to be satisfied. For example, we can set and and get , although it degenerates to a trivial case. For real applications, this assumption can be much better satisfied, since we have a large pool of available reference vectors that can be shared in so many ways, e.g. using reference vectors from in hindsight. As long as there is a need for trading computation for communication, this constant can be searched. The additional communication cost for this is to indicate which is used for this iteration. We assume that the coding strategy has bounded compression error for , that

We denote as a compression constant for TNG, and implies neccessary bits for communication. The variance of is bounded as,

Remark: We apply an inequality for two variables , , and decompose the variance using Assumption on compression error into

After applying the assumption about shrinkage of variance for normalization, we have the lemma.

For loss functions and TNG algorithms that satisfy 2.1; 3.3, after enough iterations , and the step size satisfies

here is a constant and behaves like the condition number, then the suboptimality is guaranteed as

This is an adaptation of a general analysis of strongly-convex optimization (Nguyen et al., 2018) to include compression error, and gives us a basic intuition about the factors of compression error affecting the convergence rate.

(a) Ackley Function
(b) Booth Function
(c) Rosenbrock Function.
Figure 1: TNG on Benchmarking Nonconvex Functions.
Figure 2: Convergence of SGD Methods. X-axis: (communications, bits for per element, Y-axis: the suboptimality .
Figure 3: Convergence of Stochastic Quasi-Newton Methods. X-axis: (communications, bits for per element,Y-axis: the suboptimality.
Figure 4: Convergence of Stochastic Quasi-Newton Methods. X-axis: (communications, bits for per element, Y-axis: the suboptimality.

4 Experiments

4.1 Nonconvex Problems

To visualize the efficiency of compressed normalized gradients on some hard non-convex functions, we plot some figures to demonstrate the optimization trajectories in Figure 1. These functions include Ackley function (, and global minimum at ), Booth function ( and global minimum at ), and Rosenbrock function ( and global minimum is at ). The stochast gradient is synthetically generated by adding Gaussian noise, each element of which follows , and step size is fixed through all iterations. We search for the optimal step size, and set for Ackley function, for Booth function and for Rosenbrock function. Normalized gradients are noted in the figure as TNG and the baseline noted as SGD. We choose the ternary coding (Wen et al., 2017) of stochastic gradients for both methods, and the difference is with or without trajectory normalization. For each optimizer, we noted the current parameter and objective function values as below each fiure. We make sure that two approaches use equal communication for a fair comparison, by counting one round of reference vector communication in 16-bits representation as iterations of pure ternary coding. The reference vector is chosen to be updated by every iterations. As non-convex optimization is sensitive to initialization points, we choose three initialization points, and we noted the optimizers with a number suffix to indicate different initializations. In general, the normalized gradient is compression-robust, as it converges faster. The improvement on the oscillating surface like Ackley function than flat surface like Rosenbrock, which aligns with our motivation that the compression error depends on the intrinsic distribution of gradients.

4.2 Convex Problems

We study the TNG combined with different kinds of gradients, coding strategies, reference gradient formulation, with or without second-order gradients, to prove the generality of the proposed methods. We use the mini-batch stochastic gradient descent, along with its quasi-Newton adaption (Byrd et al., 2016). The stochastic quasi-Newton uses L-BFGS method for updating the Hessian matrix and stochastic gradients as the first-order gradient. To be specific, we replace the vanilla stochastic gradient with the second-order gradient , where is an approximate inverse Hessian matrix by using the past trajectory of both parameters and gradients of within the memory (of size )

(5)

Denoting , we initialize it with , where is a diagonal matrix. Then L-BFGS udpates the inverse Hessian as

(6)

for , and finally generates .

We will mainly use the -regularized logistic regression as a representative convex problem to evaluate the efficiency. We use representing a feature sample and represents its label. We use the same procedure with (Wangni et al., 2018)

to generate a large pool of synthetic data that have different scale of skewness of gradient distribution, with two hyperparameters

and

that control the skewness: we sample normalized data vectors from standard Gaussian distribution for each element,

normalized data:

meanwhile sample magnitude vectors from a uniform distribution, and the smaller magnitudes are shrunk so the distribution is skewed. The features are elementwise products of the two normalized data and the magnitudes individually. The data is dimensional and each setting generate a dataset of size .

magnitudes:
if:
label:

and a smaller implies a stronger skewness or sparsity in gradient distribution.

First, we simulated servers where the main server does the averaging and broadcasting jobs. We use two kinds of algorithms to calculate : SGD and SVRG, the batch-size is always set to be . We plotted the convergence behavior of them in Figure 2 respectively, in terms of communications, the product of the number of data passes and the compression rate of gradient information. We use for all settings, and in the row and column, we set and the -regularization to individually, to test the sensitivity of TNG under different level of convexity and gradient skewness. We compare our approach with gradient quantization (Alistarh et al., 2017) (noted as QG in the figures), randomized ternary coding (Wen et al., 2017) (noted as TG in the figures) and gradient sparsification (noted as SG in the figures) (Wangni et al., 2018), three approaches that favor different distributions for compression. We tuned the step-size for the fastest convergence speed, and found that under the general principle perform stably for all methods, and a larger step-size caused divergence in some settings. We noticed that TG methods have a larger variance than other two, therefore we measured their variance with a shrinking factor of , to make it easy for plotting. In each subfiure, we noted the parameter and regularization (value for showing). We also plotted in Figure 3 with convergence of the stochastic second-order gradient method, with exactly the same setting of convexity, sparsity, etc, with Figure 2, respectively. We also test the sensitivity of settings like the number of servers and memory size of quasi-Newton methods, in Figure 4. In the row and column, we set the number of servers to be , and the memory size , and the settings are noted below the subfigures.

Our normalization technique is combined with three kinds of codings, respectively. (noted with prefix TN in the figure). We initialize the reference vector with a full gradient, and in the following iterations, the reference is updated to be the averaged compressed TNG from the last iteration . This can be done with a round of broadcasting for the reference vector in a synchronous setting, or the other servers can inference from the past parameters without additional communication. The balance between the fitness of and its cost needs to be balanced for different problems. When calculating bits for each approach, we also choose the optimal methods for coding the vectors, whether in dense vector form or in sparse vector form, the latter of which suits a case where the distribution of is uneven.

By observing the figures, we see that the normalization clearly improves upon baselines, in basically all the settings, and the improvement gap has a dependence on conditions. Since difference coding strategies have advantages in different problems, we do not compare them with each other. The SG methods, majorly use the bits for transmitting full-precision of important elements, and should be improved if using low-precision, i.e. quantized numbers. We found that TNG improves upon the baseline more under stronger convexity and weaker gradient skewness. By comparing with different level of sparsity in gradient distribution, the different kinds of coding methods tend to have slightly different performance: for example, we see that QG is relatively insensitive to skewness of gradients comparing to SG, and SG performs better with stronger convexity. Besides, by observing Figure 4 vertically, a larger number of servers provides a better reference vector; and observing horizontallly, we see that increasing memory size initially improves convergence but gradually becomes ineffective.

5 Related Works

Researchers proposed protocols from other perspectives to reduce communication. A prevailing method is to average parameter occasionally, but not too frequent (Tsianos et al., 2012; Wang & Joshi, 2018), or just one round of averaging over final parameters(Zhang et al., 2012). If the problems require the servers to frequently synchronized, we can use an asynchronous protocol like parameter servers (Ho et al., 2013; Li et al., 2014a), where each server requests the latest parameter from the main server or contributes its gradients, passively or aggressively, based on the network condition; the decentralized optimization algorithms (Yuan et al., 2016; Lan et al., 2017; Lian et al., 2017) view every servers equally, to avoid the congestion of communication since the main server takes over most of the requests and causing unbalance. Efficiently using a large batch-size (Cotter et al., 2011; Li et al., 2014b; Wang & Zhang, 2017; Goyal et al., 2017) or the second-order gradient (Shamir et al., 2014; Zhang & Lin, 2015) will reduce the communication since the overall number of iterations, and therefore reduce commnunication.the model synchronization can also be formulated as a global consensus problem (Zhang & Kwok, 2014) with penalty of delay. Besides, the normalization idea was also used in other areas, like normalized gradient descent for general convex or quasi-convex optimization (Nesterov, 1984; Hazan et al., 2015); on different subjects, normalization helps to stablize the feature or gradient distribution in neural networks (Ioffe & Szegedy, 2015; Klambauer et al., 2017; Neyshabur et al., 2015; Salimans & Kingma, 2016).

6 Conclusion

In this paper we propose a simple and general protocol, of using the trajectory normalized gradient, to reduce the compression error for gradient communication during distributed optimization. We provide insight to normalize gradient more accurately, and validate our idea on various experiments with different parameters and coding strategies.

References