For large-scale machine learning, the distributed optimization algorithm makes a significant contribution to improving the performance and scalability (Bottou et al., 2018). An almost necessary technique to process massive amounts of data in parallel is to divide data to different servers within a computation cluster, and these servers will provide local gradients and perform model synchronization using various communication protocols. In a centralized and synchronous setting, that all servers transmit their local gradients back to the main server, then the main server computes an updated parameter value and broadcast to others. As the system grows in size, the synchronization procedure is likely to be slowed by the network capacity and latency. A popular way to reduce the communication cost is to transmit compressed gradients, i.e. the low-bit representation. There are studies using low-bit quantization (Alistarh et al., 2017), ternary representation in (Zhou et al., 2016; Wen et al., 2017), sparsified vectors (Wang et al., 2018; Alistarh et al., 2018; Wangni et al., 2018), top-K important coordinates (Aji & Heafield, 2017) or even only using signs of gradients (Bernstein et al., 2018). Plus, the compression error in previous iterations can be accumulated as (Wu et al., 2018; Stich et al., 2018) to compensate the gradients.
The compression error is still far from rigorously studied. Most of the above works focused on the stochastic gradient descent (SGD)(Zhang, 2004; Bottou, 2010)
and for training deep neural networks, the objective function of which are naturally robust to optimization with noisy gradients(Jin et al., 2017; Kleinberg et al., 2018)
. However, for wider range problems, optimization for convex and strongly-convex problems are sensitive to gradient noise, which partially explains that variance-reduced SGD(Johnson & Zhang, 2013) and quasi-Newton methods (Wright & Nocedal, 1999)
strongly outperform vanilla SGD. Under these settings, the convergence rate will probably slow down linearly by the compression, then there are theoretically no savings in terms of communication cost. Therefore, it is imperative to characterize the compression error more in-depth.
A natural motivation is that the compression error strongly depends on the gradient distribution, in addition to the compression algorithm itself. For example, the Huffman coding favors the distribution of literals occurrence being skewed(Cormen et al., 2009); frequency-domain image codings are effective since the low-frequency and high-frequency parts have an unbalanced distribution of sensitivity to human eyes (Szeliski, 2010). Perhaps, just like the no free lunch theorem, there will be no effective compression without further distribution properties to apply.
Motivated by this, we propose to effectively adjust or normalize the gradient distribution before compressing them; ideally, having a distribution of standard Gaussian. The problem is different from a conventional sense of communication, from several perspectives: 1) the ultimate target of these rounds of gradient exchange is to improve the optimization in outer framework, 2), the information is generated by the optimization algorithm, which can be modified to adapt the encoding and decoding, 3) the past gradients shared in advance may be used to accelerate, like quasi-Newton algorithms (Wright & Nocedal, 1999) and Nesterov’s momentum (Nesterov, 2013), and they naturally cost no extra communication.
The paper is arranged as follows: we will introduce the background and notations, then the motivation of normalized gradients; we give some implementation options on the idea and we evaluate the idea on different problems.
Denote as the objective function, and is the parameter to be optimized. For convenience, we assume that the objective has a finite average formulation over
data points, and each loss function is denoted as. In the round, a descent vector based on the current parameter
The descent vector
has to an unbiased estimation of the gradient, and has a bounded variance term to assure the convergence. A typical strategy for stochastic gradient descent (SGD), is to have an index uniformly sampled from the data set and take a step as
here is the step-size for this iteration.
In a distributed computation model, where we assume servers are available for the optimization task. Let each server has its own share of the whole training dataset, say server has , and they provide an unbiased estimation of gradients by averaging together,
In each around, server calculates its unbiased estimation of the gradient where is randomly sampled, based on partial data from its memory, then transmits the gradients to main server for synchronization, during which the main server average over all gradients and updates the parameter , and broadcasts it back to all servers.
Previous research on compressed gradient assumes that there exists a coding strategy to compress the gradient vector, where is the available set for representing a number in . Then each server only needs to update its gradient using a compressed vector, and the overall algorithm behaves like
Besides, an ideal design of compression should be unbiased, so that
Suppose we use an algorithm in Eq.(1) and target for an -smooth loss function , and we assume that the compression error is random and independent of . The convergence rates for the methods designed above are strongly related to the optimization algorithm, especially the strategies of generating gradients in each iteration, as well as the assumptions (i.e. smoothness, Lipchitz continuity, convexity). For convenience, we denote and .
We suppose that the loss function is differentiable and -smooth and -strongly convex
We start with a simple inequality: based on the iteration , the expected loss for the next iteration is bounded by
where we applied the smoothness property in the first inequality, and decomposed the variance in the second inequality. An optimal compression is supposed to reduce the variance from compression error in .
Although seldom studied in this area of communication-efficient distributed optimization, we notice that the compression error is largely affected by the gradient distribution. Different compression strategies favor different kinds of distribution, whether it is long-tail, or strongly-concentrated like sub-Gaussian, or weakly-concentrated like sub-exponential. For example, gradient quantization approaches (Alistarh et al., 2017)
favors gradients with uniformly distributed elements within the quantization range; but differently, if one uses the gradient sparsification technique(Wangni et al., 2018) as the compression , then reversely, a strong skewness of gradients implies that the communication could be saved more.
3 Normalized Gradients
We try to address the problem by adjusting or normalizing the gradient distribution by past trajectories, since they have been transmitted so do not incur additional communication cost in this round. We refere to the adjusted gradient to be Trajectory Normalized Gradient (TNG). The communication protocol can be generally described as: we wish to let all servers share a gradient vector that approximate in advance. For sending the gradients, each server transmits the normalized gradients, i.e. the difference between . Each server could send gradients using compressed TNG ; then upon receiving , a server uses the following procedure to decode the gradient as
A simple understanding of
is to view it as a zero-centered random variable, if, or a polynomial of the high order derivative, if , and the range for the normalized gradients is tighter by higher-order continuity. The distribution of and depends on the model, data and the optimization algorithm itself. If they follow the same distribution, only different in magnitude by a factor of , clearly, the compression on yields a smaller error. By taking logarithms of gradients vectors and before performing the coding above, we get a form that
where is the element-wise product and takes the element-wise quotient. If these two procedures are combined, we get a normalization form of
where is a second reference vector. We also not that could be shared through a round of broadcast, from the main server, it could also be explicitly shared, for example, using a predefined protocol to update from the gradient vectors that these servers received from previous iterations.
3.1 Reference Vectors
The key requirement is choose appropriately so that
follows a normalized distribution thanfor less compression error from . General normalization request the mean vector to be pre-known, which actually cause much trouble, as in each iteration of being updated, it has different means . The calculation of can be assumed to be basically impossible, as it takes much more computation (linear to data numbers) than calculating gradient from a mini-batch as SGD. Here we reach an interesting problem about how to approximate the mean of stochastic gradient to make it actually normalized.
A simple approach is to take where is the average value of all elements in . This will reduce the variance of from an inequality
for any random variable . The only additional cost is to transmit a single scalar , which is ignorable compared to transmitting a -dimensional vector.
The formulation for can be inspired from other areas. For example, the stochastic variance-reduced gradient (SVRG) algorithm (Johnson & Zhang, 2013) gives a better estimation of gradients converges linearly converges on strongly-convex and smooth loss functions, where
where is a reference parameter which is generally chosen from a previous iteration. The full gradient , although cost much more compared to stochastic gradients, are not frequently updated. Once the full gradient is evaluated, it only costs one round of communication for many rounds of SGD steps. Based on the same intuition, the stochast averaging gradient (Schmidt et al., 2017) could be applied here. The difference is that, the main server can average gradients from all servers, and the gradients might be the compressed ones from past iterations.
In another area of distributed optimization, the delay-tolerant optimization algorithm (Agarwal & Duchi, 2011) performs the following updates
As long as the staleness of the parameter, or , is bounded. The gradient as above can be the reference gradient since it is a close approximation to the current gradient.
The fourth option is to use a two-stage compression strategy that, in each stage, the algorithm generates a compensate vector with shared vector to complement the first stage and . To list all of them here:
The reference vector can be updated frequently or occasionally depending on the easiness of visiting it, e.g. setting an update frequency of like the staleness synchronous protocol (SSP) (Ho et al., 2013).
3.2 Gradient Compression
There are many protocols available for compressing the normalized gradient , as the literature introduced above. Here we take a strong compression coding strategy for an example, e.g. using the sign of each element (Wen et al., 2017; Bernstein et al., 2018). For communication, each server transmits a constant as the largest element of , and each compressed element derived from the element of . For simplicity, we will often omit the subscripts and . The magnitude information is encoded by the randomization process, and the unbiasedness of the compressed gradient would won’t change in expectation.
Denote as the largest element of , and a binary vector , to indicate whether each element of to be compressed by its sign or simply zero.
An example of compressed TNG is in Algorithm 1. In the following, we will characterize the optimality of the coding strategy above, that the probability vector should be proportional to magnitudes. For -smooth loss functions , setting proposed above is the optimal sampling probability for ternary coding of in for optimizing .
3.3 Convergence Analysis
We do not focus on the specific constant of the convergence rate, since it depends on other factors and it is hard to provide a unified theorem that is both informative and tight. Here, we give a simple analysis of how the compression error affects the convergence rate.
We have the following assumption for the variance of stochastic gradient evaluated at the optimal point , . Then for loss functions that satisfy assumption 2.1, the variance of is bounded by
This lemma gives a better bound on the gradient variance rather than directly assigning an upper bound to the variance, as it decreases as the optimization is going on. For compressed normalized gradient in Algorithm 1, we assume that there exists a constant that for stochastic gradients on all servers,
We could always assure the proposition above to be satisfied. For example, we can set and and get , although it degenerates to a trivial case. For real applications, this assumption can be much better satisfied, since we have a large pool of available reference vectors that can be shared in so many ways, e.g. using reference vectors from in hindsight. As long as there is a need for trading computation for communication, this constant can be searched. The additional communication cost for this is to indicate which is used for this iteration. We assume that the coding strategy has bounded compression error for , that
We denote as a compression constant for TNG, and implies neccessary bits for communication. The variance of is bounded as,
Remark: We apply an inequality for two variables , , and decompose the variance using Assumption on compression error into
After applying the assumption about shrinkage of variance for normalization, we have the lemma.
here is a constant and behaves like the condition number, then the suboptimality is guaranteed as
This is an adaptation of a general analysis of strongly-convex optimization (Nguyen et al., 2018) to include compression error, and gives us a basic intuition about the factors of compression error affecting the convergence rate.
4.1 Nonconvex Problems
To visualize the efficiency of compressed normalized gradients on some hard non-convex functions, we plot some figures to demonstrate the optimization trajectories in Figure 1. These functions include Ackley function (, and global minimum at ), Booth function ( and global minimum at ), and Rosenbrock function ( and global minimum is at ). The stochast gradient is synthetically generated by adding Gaussian noise, each element of which follows , and step size is fixed through all iterations. We search for the optimal step size, and set for Ackley function, for Booth function and for Rosenbrock function. Normalized gradients are noted in the figure as TNG and the baseline noted as SGD. We choose the ternary coding (Wen et al., 2017) of stochastic gradients for both methods, and the difference is with or without trajectory normalization. For each optimizer, we noted the current parameter and objective function values as below each fiure. We make sure that two approaches use equal communication for a fair comparison, by counting one round of reference vector communication in 16-bits representation as iterations of pure ternary coding. The reference vector is chosen to be updated by every iterations. As non-convex optimization is sensitive to initialization points, we choose three initialization points, and we noted the optimizers with a number suffix to indicate different initializations. In general, the normalized gradient is compression-robust, as it converges faster. The improvement on the oscillating surface like Ackley function than flat surface like Rosenbrock, which aligns with our motivation that the compression error depends on the intrinsic distribution of gradients.
4.2 Convex Problems
We study the TNG combined with different kinds of gradients, coding strategies, reference gradient formulation, with or without second-order gradients, to prove the generality of the proposed methods. We use the mini-batch stochastic gradient descent, along with its quasi-Newton adaption (Byrd et al., 2016). The stochastic quasi-Newton uses L-BFGS method for updating the Hessian matrix and stochastic gradients as the first-order gradient. To be specific, we replace the vanilla stochastic gradient with the second-order gradient , where is an approximate inverse Hessian matrix by using the past trajectory of both parameters and gradients of within the memory (of size )
Denoting , we initialize it with , where is a diagonal matrix. Then L-BFGS udpates the inverse Hessian as
for , and finally generates .
We will mainly use the -regularized logistic regression as a representative convex problem to evaluate the efficiency. We use representing a feature sample and represents its label. We use the same procedure with (Wangni et al., 2018)
to generate a large pool of synthetic data that have different scale of skewness of gradient distribution, with two hyperparametersand
that control the skewness: we sample normalized data vectors from standard Gaussian distribution for each element,
meanwhile sample magnitude vectors from a uniform distribution, and the smaller magnitudes are shrunk so the distribution is skewed. The features are elementwise products of the two normalized data and the magnitudes individually. The data is dimensional and each setting generate a dataset of size .
and a smaller implies a stronger skewness or sparsity in gradient distribution.
First, we simulated servers where the main server does the averaging and broadcasting jobs. We use two kinds of algorithms to calculate : SGD and SVRG, the batch-size is always set to be . We plotted the convergence behavior of them in Figure 2 respectively, in terms of communications, the product of the number of data passes and the compression rate of gradient information. We use for all settings, and in the row and column, we set and the -regularization to individually, to test the sensitivity of TNG under different level of convexity and gradient skewness. We compare our approach with gradient quantization (Alistarh et al., 2017) (noted as QG in the figures), randomized ternary coding (Wen et al., 2017) (noted as TG in the figures) and gradient sparsification (noted as SG in the figures) (Wangni et al., 2018), three approaches that favor different distributions for compression. We tuned the step-size for the fastest convergence speed, and found that under the general principle perform stably for all methods, and a larger step-size caused divergence in some settings. We noticed that TG methods have a larger variance than other two, therefore we measured their variance with a shrinking factor of , to make it easy for plotting. In each subfiure, we noted the parameter and regularization (value for showing). We also plotted in Figure 3 with convergence of the stochastic second-order gradient method, with exactly the same setting of convexity, sparsity, etc, with Figure 2, respectively. We also test the sensitivity of settings like the number of servers and memory size of quasi-Newton methods, in Figure 4. In the row and column, we set the number of servers to be , and the memory size , and the settings are noted below the subfigures.
Our normalization technique is combined with three kinds of codings, respectively. (noted with prefix TN in the figure). We initialize the reference vector with a full gradient, and in the following iterations, the reference is updated to be the averaged compressed TNG from the last iteration . This can be done with a round of broadcasting for the reference vector in a synchronous setting, or the other servers can inference from the past parameters without additional communication. The balance between the fitness of and its cost needs to be balanced for different problems. When calculating bits for each approach, we also choose the optimal methods for coding the vectors, whether in dense vector form or in sparse vector form, the latter of which suits a case where the distribution of is uneven.
By observing the figures, we see that the normalization clearly improves upon baselines, in basically all the settings, and the improvement gap has a dependence on conditions. Since difference coding strategies have advantages in different problems, we do not compare them with each other. The SG methods, majorly use the bits for transmitting full-precision of important elements, and should be improved if using low-precision, i.e. quantized numbers. We found that TNG improves upon the baseline more under stronger convexity and weaker gradient skewness. By comparing with different level of sparsity in gradient distribution, the different kinds of coding methods tend to have slightly different performance: for example, we see that QG is relatively insensitive to skewness of gradients comparing to SG, and SG performs better with stronger convexity. Besides, by observing Figure 4 vertically, a larger number of servers provides a better reference vector; and observing horizontallly, we see that increasing memory size initially improves convergence but gradually becomes ineffective.
5 Related Works
Researchers proposed protocols from other perspectives to reduce communication. A prevailing method is to average parameter occasionally, but not too frequent (Tsianos et al., 2012; Wang & Joshi, 2018), or just one round of averaging over final parameters(Zhang et al., 2012). If the problems require the servers to frequently synchronized, we can use an asynchronous protocol like parameter servers (Ho et al., 2013; Li et al., 2014a), where each server requests the latest parameter from the main server or contributes its gradients, passively or aggressively, based on the network condition; the decentralized optimization algorithms (Yuan et al., 2016; Lan et al., 2017; Lian et al., 2017) view every servers equally, to avoid the congestion of communication since the main server takes over most of the requests and causing unbalance. Efficiently using a large batch-size (Cotter et al., 2011; Li et al., 2014b; Wang & Zhang, 2017; Goyal et al., 2017) or the second-order gradient (Shamir et al., 2014; Zhang & Lin, 2015) will reduce the communication since the overall number of iterations, and therefore reduce commnunication.the model synchronization can also be formulated as a global consensus problem (Zhang & Kwok, 2014) with penalty of delay. Besides, the normalization idea was also used in other areas, like normalized gradient descent for general convex or quasi-convex optimization (Nesterov, 1984; Hazan et al., 2015); on different subjects, normalization helps to stablize the feature or gradient distribution in neural networks (Ioffe & Szegedy, 2015; Klambauer et al., 2017; Neyshabur et al., 2015; Salimans & Kingma, 2016).
In this paper we propose a simple and general protocol, of using the trajectory normalized gradient, to reduce the compression error for gradient communication during distributed optimization. We provide insight to normalize gradient more accurately, and validate our idea on various experiments with different parameters and coding strategies.
- Agarwal & Duchi (2011) Agarwal, A. and Duchi, J. C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pp. 873–881, 2011.
Aji & Heafield (2017)
Aji, A. F. and Heafield, K.
Sparse communication for distributed gradient descent.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 440–445, 2017.
- Alistarh et al. (2017) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1707–1718, 2017.
- Alistarh et al. (2018) Alistarh, D., Hoefler, T., Johansson, M., Konstantinov, N., Khirirat, S., and Renggli, C. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pp. 5977–5987, 2018.
- Bernstein et al. (2018) Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. SignSGD: compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434, 2018.
- Bottou (2010) Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Springer, 2010.
- Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
- Byrd et al. (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
- Cormen et al. (2009) Cormen, T. H., Leiserson, C. E., Rivest, R. L., and Stein, C. Introduction to algorithms. MIT press, 2009.
- Cotter et al. (2011) Cotter, A., Shamir, O., Srebro, N., and Sridharan, K. Better mini-batch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems, pp. 1647–1655, 2011.
- Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Hazan et al. (2015) Hazan, E., Levy, K., and Shalev-Shwartz, S. Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems, pp. 1594–1602, 2015.
- Ho et al. (2013) Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing, E. P. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in Neural Information Processing Systems, pp. 1223–1231, 2013.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456, 2015.
- Jin et al. (2017) Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In International Conference on Machine Learning, pp. 1724–1732, 2017.
- Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323, 2013.
- Klambauer et al. (2017) Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. Self-normalizing neural networks. In Advances in Neural Information Processing Systems, pp. 971–980, 2017.
- Kleinberg et al. (2018) Kleinberg, R., Li, Y., and Yuan, Y. An alternative view: When does SGD escape local minima? arXiv preprint arXiv:1802.06175, 2018.
- Lan et al. (2017) Lan, G., Lee, S., and Zhou, Y. Communication-efficient algorithms for decentralized and stochastic optimization. arXiv preprint arXiv:1701.03961, 2017.
- Li et al. (2014a) Li, M., Andersen, D. G., Smola, A. J., and Yu, K. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pp. 19–27, 2014a.
- Li et al. (2014b) Li, M., Zhang, T., Chen, Y., and Smola, A. J. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 661–670. ACM, 2014b.
- Lian et al. (2017) Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330–5340, 2017.
- Nesterov (1984) Nesterov, Y. Minimization methods for nonsmooth convex and quasiconvex functions. Matekon, pp. 29:519–531, 1984.
- Nesterov (2013) Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Neyshabur et al. (2015) Neyshabur, B., Salakhutdinov, R. R., and Srebro, N. Path-SGD: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2422–2430, 2015.
- Nguyen et al. (2018) Nguyen, L. M., Nguyen, P. H., van Dijk, M., Richtárik, P., Scheinberg, K., and Takáč, M. SGD and Hogwild! convergence without the bounded gradients assumption. arXiv preprint arXiv:1802.03801, 2018.
- Salimans & Kingma (2016) Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.
- Schmidt et al. (2017) Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming: Series A and B, 162(1-2):83–112, 2017.
- Shamir et al. (2014) Shamir, O., Srebro, N., and Zhang, T. Communication-efficient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning, pp. 1000–1008, 2014.
- Stich et al. (2018) Stich, S. U., Cordonnier, J.-B., and Jaggi, M. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, pp. 4452–4463, 2018.
- Szeliski (2010) Szeliski, R. Computer vision: algorithms and applications. Springer Science & Business Media, 2010.
- Tsianos et al. (2012) Tsianos, K., Lawlor, S., and Rabbat, M. G. Communication/computation tradeoffs in consensus-based distributed optimization. In Advances in Neural Information Processing Systems, pp. 1943–1951, 2012.
- Wang et al. (2018) Wang, H., Sievert, S., Liu, S., Charles, Z., Papailiopoulos, D., and Wright, S. Atomo: Communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pp. 9872–9883, 2018.
- Wang & Joshi (2018) Wang, J. and Joshi, G. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms. arXiv preprint arXiv:1808.07576, 2018.
- Wang & Zhang (2017) Wang, J. and Zhang, T. Improved optimization of finite sums with minibatch stochastic variance reduced proximal iterations. arXiv preprint arXiv:1706.07001, 2017.
- Wangni et al. (2018) Wangni, J., Wang, J., Liu, J., and Zhang, T. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1306–1316, 2018.
- Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. Terngrad: Ternary gradients to reduce communication in distributed deep learning. arXiv preprint arXiv:1705.07878, 2017.
- Wright & Nocedal (1999) Wright, S. and Nocedal, J. Numerical optimization. Springer Science, 35(67-68):7, 1999.
- Wu et al. (2018) Wu, J., Huang, W., Huang, J., and Zhang, T. Error compensated quantized SGD and its applications to large-scale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
- Yuan et al. (2016) Yuan, K., Ling, Q., and Yin, W. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
- Zhang & Kwok (2014) Zhang, R. and Kwok, J. Asynchronous distributed admm for consensus optimization. In International Conference on Machine Learning, pp. 1701–1709, 2014.
- Zhang (2004) Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first International Conference on Machine Learning, pp. 116. ACM, 2004.
- Zhang & Lin (2015) Zhang, Y. and Lin, X. Disco: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning, pp. 362–370, 2015.
- Zhang et al. (2012) Zhang, Y., Wainwright, M. J., and Duchi, J. C. Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pp. 1502–1510, 2012.
- Zhou et al. (2016) Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., and Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.