I Introduction
Distributed optimization methods are nontrivial when we optimize a data mining problem when the data or model is distributed across multiple machines. When data are distributed, parameter server [6, 14] or decentralized methods [15, 16]
were proposed for parallel computation and linear speedup. When model are distributed, especially deep learning model, pipelinebased methods or decoupled backpropagation algorithm
[9, 29, 8, 30] parallelized the model updating on different machines and made full use of computing resources. In this paper, we only consider the case that data are distributed.As in Figure 1, most distributed methods require collecting update information from all workers iteratively to find the optimal solution [5, 12, 13, 10, 11]. As the communication in the network is slow, it is challenging to obtain accurate solutions if there is a limited time budget. Total running time of the distributed methods is determined by multiple factors. To get accurate solutions, the total running time of a distributed algorithm can be represented as follows:
(1) 
where denotes the number of communication rounds the algorithm requires to get accurate solutions and represents the communication time per round, which is dependent on the dimensionality of the data. indicates the computational time required by the slowest worker at round , such that all workers have completed their jobs at that time. To reduce the total running time , we can either decrease the number of communication rounds , or cut down the running time at each round
There are various methods trying to reduce the total running time by reducing the number of communication rounds . Previous communicationefficient methods make workers update locally for iterations and communicate with the server periodically. For example, DSVRG [13], DISCO [34], AIDE [20] and DANE [24] proved that they require communication rounds to reach accurate solutions. There are also various distributed methods for dual problems, for example, CoCoA [12, 25], CoCoA+ [18, 26] and DisDCA [31]. These methods also admit linear convergence guarantees regarding communication rounds. In [18], the authors proved that CoCoA+ is a generation for CoCoA and showed that CoCoA+ is equivalent to DisDCA under certain conditions. A brief comparison of distributed primaldual methods is in Table I. These communicationefficient methods work well when the communication time per round is relatively small and all workers run with similar speed. However, they suffer from the communication bottleneck when the data is of high dimensionality or straggler problem where there are machines work far slower than other normal workers.
In [1, 3, 27, 17], authors proposed to reduce the communication time and increase the bandwidth efficiency by compressing or dropping the gradients for distributed optimization. There are also several attempts trying to quantize the gradients such that fewer messages are transmitted in the network [2, 28]. However, these methods are not communicationefficient. These methods ask workers to send gradient information to the server every iteration, suffering from a large number of communication rounds. To the best of our knowledge, there is no work reducing the size of the transmitted message for distributed primaldual methods.
In this paper, we focus on reducing the running time at each round for distributed data mining. To solve the issues of the straggler problem and highdimensional data, we propose a novel straggleragnostic and bandwidthefficient distributed primaldual algorithm. The main contributions of our work are summarized as follows:

We propose a novel primaldual algorithm to solve the straggler problem and high communication complexity per iteration in Section III.

We provide convergence analysis in Section IV and prove that the proposed method guarantees linear convergence to the optimal solution for the convex problem.

We perform experiments with largescale datasets distributed across multiple machines in Section V. Experimental results verify that the proposed method can be up to 4 times faster than compared methods.
Ii Related Work
Iia Stochastic Dual Algorithm
In this paper, we consider to optimize the following
regularized empirical loss minimization problem which is arising ubiquitously in supervised machine learning:
(2) 
where denotes data sample and denotes the linear predictor to be optimized. There are many applications falling into this formulation, for example, classification problem or regression problem. To solve the primal problem (2), we can optimize its dual problem instead:
(3) 
where is the convex conjugate function to , denotes data matrix and represents dual variables. Stochastic dual coordinate ascent (SDCA) [7, 22] is one of the most successful methods proposed to solve problem (2). In [22], the authors proved that SDCA guarantees linear convergence if the convex function
is smooth, which is much faster than stochastic gradient descent (SGD)
[4]. At iteration , given sample and variables fixed, we maximize the following subproblem:(4) 
denotes a coordinate vector of size
, where element is and other elements are . Another advantage of optimizing the dual problem is that we can monitor the optimization progress by keeping track of the duality gap . The duality gap is defined as: , where and denote objective values of the primal problem and the dual problem respectively. Assuming is the optimal solution to the primal problem (2), is the optimal solution to the dual problem (3), the primaldual relation is always satisfied such that:(5) 
IiB Distributed CommunicationEfficient PrimalDual Algorithm
Distributed optimization methods are nontrivial when we train a data mining problem with dataset partitioned over multiple machines. We suppose that the dataset of samples is evenly partitioned across workers. represents the subset of data in the worker , where and . Sample is only stored in the worker , such that it cannot be sampled by any other workers. represents the dimensionality of the dataset. In [12, 18], the authors proposed communicationefficient distributed dual coordinate ascent algorithm (CoCoA) for distributed optimization of dual problem. Communicationefficient means that CoCoA allows for more computation in the worker side before communication between workers. Suppose the dataset is partitioned over machines, and all machines are doing computation simultaneously. In each iteration, workers optimize their local subproblems independently as follows:
(6)  
where is the data partition on worker and represents the difficulty of the given data partition. It was proved in Lemma 3 [18] that sum of local subproblems (6) in workers closely approximate the global dual problem (3). The global variable is updated after all workers have obtained a approximate solution to their local subproblems. Authors in [18] claimed that CoCoA shows significant speedups over previous stateoftheart methods on largescale distributed datasets. However, the synchronous communication protocol makes CoCoA vulnerable to slow or dead workers. Suppose the normal workers spend seconds completing their computation task while a slow worker needs seconds. In each iteration, all normal workers have to wait seconds for the slow worker, which is a tremendous waste of computation resource.
Iii StragglerAgnostic and BandwidthEfficient Distributed PrimalDual Algorithm
In this section, we propose a novel StragglerAgnostic and BandwidthEfficient Distributed PrimalDual Algorithm (ACPD) for the highdimensional data.
Iiia StragglerAgnostic Server
As shown in Figure 2, previous distributed primaldual methods need to collect information from all workers before updating, suffering from the straggler problem if there are straggler problems. Running time per iteration is entirely dependent on the slowest workers. We overcome the straggler problem by allowing server to update the model as long as a group of workers has been received. For example, in Figure 2, the server just needs to receive messages from two workers. The server keeps a model update variable for each worker, which stores the update of the server model between two communication iterations of worker . After updating the variables on the server, it sends the model update variable to the corresponding workers for further computation.
Additionally, we also need to control the gap between workers as update information from slow workers may lead to divergence [32]. Because of our groupwise communication protocol, the local models on workers are usually of different timestamps. It could severely degrade the performance of the method if local models are too stale. To solve this problem, we make the server to collect information from all workers every iterations, such that all workers are guaranteed to be received at least once within iterations. Thus, the maximum time delay between local models is bounded by . A brief description of the procedures in the server is in Algorithm 1.
IiiB BandwidthEfficient Worker
Workers are responsible for most of the complicated computations. There are data on worker , and it is denoted using
. Because of our groupwise communication, we assume the probability of worker
to be received by the server is . In each iteration, worker solves the local subproblem and obtains an approximate solution , and then it sends the filtered variable to the server. Finally, workers receive the global model from the server for further computation.IiiB1 Subproblem in Worker
At first, worker finds an approximate solution to the local subproblem as follows:
(7) 
where and is defined as:
(8) 
At each iteration, we sample randomly from and compute supposing other variables are fixed. We repeat this procedure for iterations. In Algorithm 2, represents the number of iterations before communication, controlling the tradeoff between computation and communication. There are many fast solvers for the dual problem (7), such as Stochastic Dual Coordinate Ascent (SDCA) [22] and Accelerated Proximal Stochastic Dual Coordinate Ascent (Accelerated ProxSDCA)[23]. Sampling techniques can also be used to improve the convergence of the local solver, such as importance sampling [33] and adaptive sampling [19]. In this paper, we only consider SDCA with uniform sampling as the local solver.
IiiB2 Sparse Communication on Worker
After getting an approximate solution to the subproblem (7) after iterations, we compute update for primal variables using:
(9) 
As shown in Figure 2, previous methods used to send directly to the server. However, when the dimension of data is large, sending and receiving variables with a full dimensionality between server and workers are timeconsuming. On the contrary, ACPD requires workers to input into the filter at first and sends the filtered variable to the server. We implement the filter by simply selecting the elements whose absolute values are the top largest. as long as and otherwise . In this way, major update information is kept in the filtered variables, and less communication bandwidth is required. We can easily compress a sparse vector by storing locations and values of the elements. For the purpose of theoretical analysis, at lines , we put the filtered out update information back to the local dual variables . In the end, the worker receives model update variable from the server. The communication time is also on average. A brief summarization of the procedures on workers is in Algorithm 2.
Lines  in Algorithm 2 keep the primaldual relation (5) always satisfied at each iteration, which is nontrivial for theoretical analysis. However, matrix inversion computation at each iteration is not practical. In practice, we simply replace lines 1012 with: , where denotes elementwise multiplication. In the experiments, we show that this simplification does not affect the convergence empirically.
Iv Convergence Analysis
In this section, we analyze the convergence properties for the proposed method and prove that it guarantees linear convergence to the optimal solution under certain conditions. Because of the groupwise communication, at iteration , local variable in the worker equals to from the server, where denotes stale global variable with time stamp . is local dual variable, where if . Because data can only be sampled in the worker , it is always true that . Therefore, the subproblem in the worker at iteration can be written as . Firstly, we suppose that following Assumptions are satisfied throughout the paper:
Assumption 1.
We assume all convex are nonnegative and it holds that . All samples are normalized such that:
Assumption 2.
Function is smooth , such that :
Assumption 3.
Time delay on worker is upper bounded that: In our algorithm, we have .
For any local solver in the workers, we assume that the result of the subproblem after iterations is an approximation of the optimal solutions:
Assumption 4.
We define to be the optimal solution to the local subproblem on worker . For each subproblem, there exists a constant such that:
(10) 
All above assumptions are commonly used in [21, 22]. Based on these assumptions, we analyze the convergence rate of the proposed method. At first, we analyze the optimization path of the proposed method at each iteration.
Lemma 1.
According Lemma 1, we analyze convergence rate of the proposed method as follows.
Theorem 1.
Proof.
For , we get upper bound of such that:
(14)  
where . According to Lemma 1, we have:
where the last inequality follows from that . Then we can get the upper bound of as follows:
(16)  
Summing up the above inequality from to , we have:
Therefore, as long as,
(18) 
the following inequality holds that:
(19) 
As per our algorithm, we have . Applying (19) recursively from to , it holds that: Letting , where . It is easy to note that when , and . Therefore, there exists that (18) always holds. Setting , we have:
(20)  
where the second inequality follows from and from Assumption 1. Therefore, if the suboptimality of dual problem has an upper bound , must be bounded that:
(21) 
∎
Remark 1.
From Theorem 1, when , we have and . Therefore, it is guaranteed that we can always find an appropriate such that and exists.
Theorem 2.
Following notations in Theorem 1, we can prove that to get duality gap: the outer iteration must have a lower bound that:
(22) 
Proof.
To bound the duality gap, as per (IV) we have:
Hence, in order to get , we must make sure that:
(24) 
∎
Above all, we prove that the proposed method guarantees linear convergence rate for the convex problem as long as Assumptions in this section are satisfied.
V Experiments
In this section, we perform data mining experiments with largescale datasets on distributed environments. Firstly, we describe the implementation details of our experiments in Section VA. Then, in Section VB, we evaluate our method in a simulated distributed environment with straggler problem. Finally, we conduct experiments in a real distributed environment in Section VC. Experimental results show that the proposed method can be up to 4 times faster than compared methods in real distributed system.
Va Implementations
In the experiment, we apply the proposed algorithm to solve a ridge regression problem, where
is the least square loss. The dual problem of the ridge regression problem can be represented as follows :(25) 
where and are labels and feature vectors respectively. We use three binary classification datasets from LIBSVM dataset collections^{1}^{1}1 https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html: RCV1, URL and KDD. Table II shows brief details about these datasets. We compare with the CoCoA+ method in the
Dataset  Samples (n)  Features (d)  Size 

RCV1  G  
URL  G  
KDD  G 
experiment, as it has been shown to be superior to other related methods [18].
We implement the compared methods CoCoA+ and ACPD using C++, where the communication between workers and server is handled by OpenMPI^{2}^{2}2https://www.openmpi.org/. We use ’Send’ and ’Recv’ for point to point communications and ’allreduce’ for collective communications. All experiments are conducted on Amazon Web Services, where each node is a t2.medium instance with two virtual CPUs.
VB Experiments in Simulated Distributed Environment
In this section, we perform experiments on RCV1 in a distributed system with a straggler worker. We simulate the straggler problem by forcing worker to sleep at each iteration such that the computational time of worker is times as long as the computational time of other normal workers.
VB1 Convergence of Compared Methods
In Figure 3, we present the convergence rate regarding communication rounds or elapsed time of the compared methods. The RCV1 dataset is distributed across four workers. We follow the parameter settings in [18] for CoCoA+. For ACPD, we set , and in the experiment. To analyze the effect of straggleragnosticism and bandwidthefficiency, we also do ablation studies by setting or . Figures in the first two columns show the results when we set . In this situation, the waiting time for the straggler machine is comparable to the communication time between machines. There are two observations: (1) our method admits nearly similar convergence rate to CoCoA+ when is small, (2) ACPD converges faster than other compared methods regarding time. It is obvious that both straggleragnosticism and bandwidthefficiency are beneficial for the distributed optimization. We then set and plot the results in the third and fourth columns of Figure 3. In this situation, the communication time between machines is negligible compared to the waiting time for the straggler machine such that the maximum delay is close to . There are three observations: (1) groupwise communication affects the convergence rate when straggler problem is serious and is large; (2) sparse communication may affect the convergence rate; (3) when the straggler problem is serious, ACPD is much faster than CoCoA+.
VB2 Effect of Sparsity Constant
In this section, we evaluate the performance of our methods when we vary the value of the sparsity constant . We set , and in the experiment, and the RCV1 dataset is distributed across four workers. In Figure 3(a), we can observe that when we decrease the value of from to , the convergence rate of ACPD is stable if the magnitude of the duality gap is above . The convergence rate of duality gap degrades a little when it is below . In practice, we can get good generalization error when the duality gap is above . Therefore, our method is robust to the selection of sparsity constant .
VB3 Scaling Up Workers
We evaluate the speedup properties of ACPD and CoCoA+ in a distributed system with straggler problem. In the experiment, we set , and . We test the performance of the compared methods when . For our method, we let , and . We plot the elapsed time of the compared methods when they reach duality gap in Figure 3(b). From this figure, we can observe that our method always spends much less time to reach similar duality gap compared to CoCoA+. As becomes large, the communication time becomes the bottleneck and constrains CoCoA+ from further speedup. Experimental results show that the groupwise and sparse communication helps ACPD reduce the communication time remarkably, such that it can make better use of resources in a cluster.
VC Experiments in Real Distributed Environment
In this section, we perform largescale experiments with KDD and URL datasets in a real distributed environment, where there are other jobs running in different workers. We distribute all datasets on eight workers in AWS platform. For our method, we let , and . Figure 5 presents the performance of the compared methods in terms of computational time. We can observe that our method is much faster than CoCoA+ in the real distributed environment. For example, ACPD is times faster than CoCoA+ to get duality gap according to the first figure in the second row of Figure 5. According to the results in the third column of Figure 5, the proposed method spends much less communication time than the other compared method.
Vi Conclusion
In this paper, we propose a novel straggleragnostic and bandwidthefficient distributed primaldual algorithm for highdimensional data. The proposed method utilizes the groupwise and sparse communication to reduce the communication time in the distributed environment. We provide the theoretical analysis of the proposed method for convex problem and prove that our method guarantees linear convergence to the optimal solution under certain conditions. Finally, we perform largescale experiments in distributed environments. Experimental results verify that the proposed method can be up to 4 times faster than compared methods in real distributed system.
References
 [1] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
 [2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communicationoptimal stochastic gradient descent, with applications to training neural networks. arXiv preprint arXiv:1610.02132, 2016.
 [3] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
 [4] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. arXiv preprint arXiv:1606.04838, 2016.
 [5] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
 [6] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
 [7] ChoJui Hsieh, KaiWei Chang, ChihJen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for largescale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415. ACM, 2008.

[8]
Zhouyuan Huo, Bin Gu, and Heng Huang.
Training neural networks using features replay.
In Advances in Neural Information Processing Systems, pages 6659–6668, 2018.  [9] Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. Decoupled parallel backpropagation with convergence guarantee. arXiv preprint arXiv:1804.10574, 2018.

[10]
Zhouyuan Huo and Heng Huang.
Asynchronous minibatch gradient descent with variance reduction for nonconvex optimization.
InThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  [11] Zhouyuan Huo, Xue Jiang, and Heng Huang. Asynchronous dual free stochastic dual coordinate ascent for distributed data mining. In 2018 IEEE International Conference on Data Mining (ICDM), pages 167–176. IEEE, 2018.
 [12] Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communicationefficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 3068–3076, 2014.
 [13] Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity. arXiv preprint arXiv:1507.07595, 2015.
 [14] Mu Li, David G Andersen, Alex J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27, 2014.
 [15] Xiangru Lian, Ce Zhang, Huan Zhang, ChoJui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
 [16] Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017.
 [17] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
 [18] Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter Richtárik, and Martin Takáč. Adding vs. averaging in distributed primaldual optimization. arXiv preprint arXiv:1502.03508, 2015.
 [19] Zheng Qu and Peter Richtárik. Stochastic dual coordinate ascent with adaptive probabilities. 2015.
 [20] Sashank J Reddi, Jakub Konečnỳ, Peter Richtárik, Barnabás Póczós, and Alex Smola. Aide: fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.
 [21] Shai ShalevShwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. arXiv preprint arXiv:1309.2375, 2013.
 [22] Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
 [23] Shai ShalevShwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. In ICML, pages 64–72, 2014.
 [24] Ohad Shamir, Nati Srebro, and Tong Zhang. Communicationefficient distributed optimization using an approximate newtontype method. In International conference on machine learning, pages 1000–1008, 2014.
 [25] Virginia Smith, Simone Forte, Ma Chenxin, Martin Takáč, Michael I Jordan, and Martin Jaggi. Cocoa: A general framework for communicationefficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.
 [26] Virginia Smith, Simone Forte, Chenxin Ma, Martin Takac, Michael I Jordan, and Martin Jaggi. Cocoa: A general framework for communicationefficient distributed optimization. arXiv preprint arXiv:1611.02189, 2016.
 [27] Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [28] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1508–1518, 2017.
 [29] An Xu, Zhouyuan Huo, and Heng Huang. Diversely stale parameters for efficient training of cnns. arXiv preprint arXiv:1909.02625, 2019.
 [30] Qian Yang, Zhouyuan Huo, Wenlin Wang, Heng Huang, and Lawrence Carin. Ouroboros: On accelerating training of transformerbased language models. arXiv preprint arXiv:1909.06695, 2019.
 [31] Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pages 629–637, 2013.
 [32] Huan Zhang and ChoJui Hsieh. Fixing the convergence problems in parallel asynchronous dual coordinate descent. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 619–628. IEEE, 2016.
 [33] Tong Zhang and RUTGERS EDU. Stochastic optimization with importance sampling for regularized loss minimization. 2015.
 [34] Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for selfconcordant empirical loss. In ICML, pages 362–370, 2015.