Straggler-Agnostic and Communication-Efficient Distributed Primal-Dual Algorithm for High-Dimensional Data Mining

10/09/2019 ∙ by Zhouyuan Huo, et al. ∙ University of Pittsburgh 0

Recently, reducing communication time between machines becomes the main focus of distributed data mining. Previous methods propose to make workers do more computation locally before aggregating local solutions in the server such that fewer communication rounds between server and workers are required. However, these methods do not consider reducing the communication time per round and work very poor under certain conditions, for example, when there are straggler problems or the dataset is of high dimension. In this paper, we target to reduce communication time per round as well as the required communication rounds. We propose a communication-efficient distributed primal-dual method with straggler-agnostic server and bandwidth-efficient workers. We analyze the convergence property and prove that the proposed method guarantees linear convergence rate to the optimal solution for convex problems. Finally, we conduct large-scale experiments in simulated and real distributed systems and experimental results demonstrate that the proposed method is much faster than compared methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Distributed optimization methods are nontrivial when we optimize a data mining problem when the data or model is distributed across multiple machines. When data are distributed, parameter server [6, 14] or decentralized methods [15, 16]

were proposed for parallel computation and linear speedup. When model are distributed, especially deep learning model, pipeline-based methods or decoupled backpropagation algorithm

[9, 29, 8, 30] parallelized the model updating on different machines and made full use of computing resources. In this paper, we only consider the case that data are distributed.

As in Figure 1, most distributed methods require collecting update information from all workers iteratively to find the optimal solution [5, 12, 13, 10, 11]. As the communication in the network is slow, it is challenging to obtain accurate solutions if there is a limited time budget. Total running time of the distributed methods is determined by multiple factors. To get -accurate solutions, the total running time of a distributed algorithm can be represented as follows:

(1)

where denotes the number of communication rounds the algorithm requires to get -accurate solutions and represents the communication time per round, which is dependent on the dimensionality of the data. indicates the computational time required by the slowest worker at round , such that all workers have completed their jobs at that time. To reduce the total running time , we can either decrease the number of communication rounds , or cut down the running time at each round

Fig. 1: Parameter server structure for distributed data mining.

There are various methods trying to reduce the total running time by reducing the number of communication rounds . Previous communication-efficient methods make workers update locally for iterations and communicate with the server periodically. For example, DSVRG [13], DISCO [34], AIDE [20] and DANE [24] proved that they require communication rounds to reach -accurate solutions. There are also various distributed methods for dual problems, for example, CoCoA [12, 25], CoCoA+ [18, 26] and DisDCA [31]. These methods also admit linear convergence guarantees regarding communication rounds. In [18], the authors proved that CoCoA+ is a generation for CoCoA and showed that CoCoA+ is equivalent to DisDCA under certain conditions. A brief comparison of distributed primal-dual methods is in Table I. These communication-efficient methods work well when the communication time per round is relatively small and all workers run with similar speed. However, they suffer from the communication bottleneck when the data is of high dimensionality or straggler problem where there are machines work far slower than other normal workers.

In [1, 3, 27, 17], authors proposed to reduce the communication time and increase the bandwidth efficiency by compressing or dropping the gradients for distributed optimization. There are also several attempts trying to quantize the gradients such that fewer messages are transmitted in the network [2, 28]. However, these methods are not communication-efficient. These methods ask workers to send gradient information to the server every iteration, suffering from a large number of communication rounds. To the best of our knowledge, there is no work reducing the size of the transmitted message for distributed primal-dual methods.

In this paper, we focus on reducing the running time at each round for distributed data mining. To solve the issues of the straggler problem and high-dimensional data, we propose a novel straggler-agnostic and bandwidth-efficient distributed primal-dual algorithm. The main contributions of our work are summarized as follows:

  • We propose a novel primal-dual algorithm to solve the straggler problem and high communication complexity per iteration in Section III.

  • We provide convergence analysis in Section IV and prove that the proposed method guarantees linear convergence to the optimal solution for the convex problem.

  • We perform experiments with large-scale datasets distributed across multiple machines in Section V. Experimental results verify that the proposed method can be up to 4 times faster than compared methods.

Algorithm S-A Communication Rounds
DisDCA [31]
CoCoA [12]
CoCoA+ [18]
ACPD
TABLE I: Communications of distributed primal-dual algorithms. denotes the size of the model, denotes the sparsity constant and . S-A denotes straggler agnostic.

Ii Related Work

Ii-a Stochastic Dual Algorithm

In this paper, we consider to optimize the following

regularized empirical loss minimization problem which is arising ubiquitously in supervised machine learning:

(2)

where denotes data sample and denotes the linear predictor to be optimized. There are many applications falling into this formulation, for example, classification problem or regression problem. To solve the primal problem (2), we can optimize its dual problem instead:

(3)
Fig. 2: Communication protocols for (a) previous methods such as CoCoA, CoCoA+, and DisDCA, (b) ACPD. Straggler-Agnostic: Previous methods use the synchronous communication protocol, suffering from the slowest workers in the cluster. ACPD allows for group-wise communication between server and workers such that it avoids the straggler problem. Bandwidth-Efficient: Previous methods send variables with dimensionality directly through the network. ACPD inputs into a message filter and then sends compressed variable with dimensionality , where . The number of nonzero elements in is also of size on average.

where is the convex conjugate function to , denotes data matrix and represents dual variables. Stochastic dual coordinate ascent (SDCA) [7, 22] is one of the most successful methods proposed to solve problem (2). In [22], the authors proved that SDCA guarantees linear convergence if the convex function

is smooth, which is much faster than stochastic gradient descent (SGD)

[4]. At iteration , given sample and variables fixed, we maximize the following subproblem:

(4)

denotes a coordinate vector of size

, where element is and other elements are . Another advantage of optimizing the dual problem is that we can monitor the optimization progress by keeping track of the duality gap . The duality gap is defined as: , where and denote objective values of the primal problem and the dual problem respectively. Assuming is the optimal solution to the primal problem (2), is the optimal solution to the dual problem (3), the primal-dual relation is always satisfied such that:

(5)

Ii-B Distributed Communication-Efficient Primal-Dual Algorithm

Distributed optimization methods are nontrivial when we train a data mining problem with dataset partitioned over multiple machines. We suppose that the dataset of samples is evenly partitioned across workers. represents the subset of data in the worker , where and . Sample is only stored in the worker , such that it cannot be sampled by any other workers. represents the dimensionality of the dataset. In [12, 18], the authors proposed communication-efficient distributed dual coordinate ascent algorithm (CoCoA) for distributed optimization of dual problem. Communication-efficient means that CoCoA allows for more computation in the worker side before communication between workers. Suppose the dataset is partitioned over machines, and all machines are doing computation simultaneously. In each iteration, workers optimize their local subproblems independently as follows:

(6)

where is the data partition on worker and represents the difficulty of the given data partition. It was proved in Lemma 3 [18] that sum of local subproblems (6) in workers closely approximate the global dual problem (3). The global variable is updated after all workers have obtained a -approximate solution to their local subproblems. Authors in [18] claimed that CoCoA shows significant speedups over previous state-of-the-art methods on large-scale distributed datasets. However, the synchronous communication protocol makes CoCoA vulnerable to slow or dead workers. Suppose the normal workers spend seconds completing their computation task while a slow worker needs seconds. In each iteration, all normal workers have to wait seconds for the slow worker, which is a tremendous waste of computation resource.

Iii Straggler-Agnostic and Bandwidth-Efficient Distributed Primal-Dual Algorithm

In this section, we propose a novel Straggler-Agnostic and Bandwidth-Efficient Distributed Primal-Dual Algorithm (ACPD) for the high-dimensional data.

Iii-a Straggler-Agnostic Server

As shown in Figure 2, previous distributed primal-dual methods need to collect information from all workers before updating, suffering from the straggler problem if there are straggler problems. Running time per iteration is entirely dependent on the slowest workers. We overcome the straggler problem by allowing server to update the model as long as a group of workers has been received. For example, in Figure 2, the server just needs to receive messages from two workers. The server keeps a model update variable for each worker, which stores the update of the server model between two communication iterations of worker . After updating the variables on the server, it sends the model update variable to the corresponding workers for further computation.

1:  Initialize: Global model: ;Model update: ;Condition1: ;Condition2: ;
2:  for  do
3:     Update
4:     for  do
5:        Empty workers set ;
6:        while  Condition1 or Condition2 do
7:           Receive from worker , add to set ;
8:           Update ;
9:        end while
10:        Update ;
11:        Send to worker and set if ;
12:     end for
13:     Update ;
14:  end for
Algorithm 1 Straggler-Agnostic Server

Additionally, we also need to control the gap between workers as update information from slow workers may lead to divergence [32]. Because of our group-wise communication protocol, the local models on workers are usually of different timestamps. It could severely degrade the performance of the method if local models are too stale. To solve this problem, we make the server to collect information from all workers every iterations, such that all workers are guaranteed to be received at least once within iterations. Thus, the maximum time delay between local models is bounded by . A brief description of the procedures in the server is in Algorithm 1.

Iii-B Bandwidth-Efficient Worker

Workers are responsible for most of the complicated computations. There are data on worker , and it is denoted using

. Because of our group-wise communication, we assume the probability of worker

to be received by the server is . In each iteration, worker solves the local subproblem and obtains an approximate solution , and then it sends the filtered variable to the server. Finally, workers receive the global model from the server for further computation.

1:  Initialize: Local model: ; Model update: ;Local dual variable: ;
2:  repeat
3:     ;
4:     Solve subproblem for iterations and output :     
5:     Update ;
6:     Update ;
7:     Find the largest values in as ;
8:     Update mask ;
9:     Send to server;
10:     Compute ;
11:     Update ;
12:     Update ;
13:     Receive from the server;
14:     Update ;
15:  until convergence
Algorithm 2 Bandwidth-Efficient Worker

Iii-B1 Subproblem in Worker

At first, worker finds an approximate solution to the local subproblem as follows:

(7)

where and is defined as:

(8)

At each iteration, we sample randomly from and compute supposing other variables are fixed. We repeat this procedure for iterations. In Algorithm 2, represents the number of iterations before communication, controlling the trade-off between computation and communication. There are many fast solvers for the dual problem (7), such as Stochastic Dual Coordinate Ascent (SDCA) [22] and Accelerated Proximal Stochastic Dual Coordinate Ascent (Accelerated Prox-SDCA)[23]. Sampling techniques can also be used to improve the convergence of the local solver, such as importance sampling [33] and adaptive sampling [19]. In this paper, we only consider SDCA with uniform sampling as the local solver.

Iii-B2 Sparse Communication on Worker

After getting an approximate solution to the subproblem (7) after iterations, we compute update for primal variables using:

(9)

As shown in Figure 2, previous methods used to send directly to the server. However, when the dimension of data is large, sending and receiving variables with a full dimensionality between server and workers are time-consuming. On the contrary, ACPD requires workers to input into the filter at first and sends the filtered variable to the server. We implement the filter by simply selecting the elements whose absolute values are the top largest. as long as and otherwise . In this way, major update information is kept in the filtered variables, and less communication bandwidth is required. We can easily compress a sparse vector by storing locations and values of the elements. For the purpose of theoretical analysis, at lines -, we put the filtered out update information back to the local dual variables . In the end, the worker receives model update variable from the server. The communication time is also on average. A brief summarization of the procedures on workers is in Algorithm 2.

Lines - in Algorithm 2 keep the primal-dual relation (5) always satisfied at each iteration, which is nontrivial for theoretical analysis. However, matrix inversion computation at each iteration is not practical. In practice, we simply replace lines 10-12 with: , where denotes element-wise multiplication. In the experiments, we show that this simplification does not affect the convergence empirically.

Iv Convergence Analysis

In this section, we analyze the convergence properties for the proposed method and prove that it guarantees linear convergence to the optimal solution under certain conditions. Because of the group-wise communication, at iteration , local variable in the worker equals to from the server, where denotes stale global variable with time stamp . is local dual variable, where if . Because data can only be sampled in the worker , it is always true that . Therefore, the subproblem in the worker at iteration can be written as . Firstly, we suppose that following Assumptions are satisfied throughout the paper:

Assumption 1.

We assume all convex are non-negative and it holds that . All samples are normalized such that:

Assumption 2.

Function is -smooth , such that :

Assumption 3.

Time delay on worker is upper bounded that: In our algorithm, we have .

For any local solver in the workers, we assume that the result of the subproblem after iterations is an approximation of the optimal solutions:

Assumption 4.

We define to be the optimal solution to the local subproblem on worker . For each subproblem, there exists a constant such that:

(10)

All above assumptions are commonly used in [21, 22]. Based on these assumptions, we analyze the convergence rate of the proposed method. At first, we analyze the optimization path of the proposed method at each iteration.

Lemma 1.

Suppose all assumptions are satisfied, is -smooth and convex. and are computed according to Algorithm 1 and 2. Let and . We define and , if , Then following inequality holds that:

(11)

where .

According Lemma 1, we analyze convergence rate of the proposed method as follows.

Theorem 1.

Suppose that all assumptions are satisfied. We set and . Through Algorithm 1 and 2, we can obtain duality sub-optimality:

(12)

after outer iterations as long as:

(13)

where and .

Proof.

For , we get upper bound of such that:

(14)

where . According to Lemma 1, we have:

where the last inequality follows from that . Then we can get the upper bound of as follows:

(16)

Summing up the above inequality from to , we have:

Therefore, as long as,

(18)

the following inequality holds that:

(19)

As per our algorithm, we have . Applying (19) recursively from to , it holds that: Letting , where . It is easy to note that when , and . Therefore, there exists that (18) always holds. Setting , we have:

(20)

where the second inequality follows from and from Assumption 1. Therefore, if the sub-optimality of dual problem has an upper bound , must be bounded that:

(21)

Fig. 3: Convergence of the duality gap for compared methods regarding communication rounds and elapsed time. denotes the computational time which worker takes compared to other normal workers. For example, if the normal computational time is , then worker takes .
Remark 1.

From Theorem 1, when , we have and . Therefore, it is guaranteed that we can always find an appropriate such that and exists.

Theorem 2.

Following notations in Theorem 1, we can prove that to get duality gap: the outer iteration must have a lower bound that:

(22)
Proof.

To bound the duality gap, as per (IV) we have:

Hence, in order to get , we must make sure that:

(24)

Above all, we prove that the proposed method guarantees linear convergence rate for the convex problem as long as Assumptions in this section are satisfied.

V Experiments

In this section, we perform data mining experiments with large-scale datasets on distributed environments. Firstly, we describe the implementation details of our experiments in Section V-A. Then, in Section V-B, we evaluate our method in a simulated distributed environment with straggler problem. Finally, we conduct experiments in a real distributed environment in Section V-C. Experimental results show that the proposed method can be up to 4 times faster than compared methods in real distributed system.

V-a Implementations

In the experiment, we apply the proposed algorithm to solve a ridge regression problem, where

is the least square loss. The dual problem of the ridge regression problem can be represented as follows :

(25)

where and are labels and feature vectors respectively. We use three binary classification datasets from LIBSVM dataset collections111 https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html: RCV1, URL and KDD. Table II shows brief details about these datasets. We compare with the CoCoA+ method in the

Dataset Samples (n) Features (d) Size
RCV1 G
URL G
KDD G
TABLE II: Summary of three real large-scale datasets in the experiment.

experiment, as it has been shown to be superior to other related methods [18].

We implement the compared methods CoCoA+ and ACPD using C++, where the communication between workers and server is handled by OpenMPI222https://www.open-mpi.org/. We use ’Send’ and ’Recv’ for point to point communications and ’allreduce’ for collective communications. All experiments are conducted on Amazon Web Services, where each node is a t2.medium instance with two virtual CPUs.

(a)
(b)
Fig. 4: Figure 3(a): Duality gap convergence of the proposed method in terms of the communication rounds with different sparsity constant ; Figure 3(b): Total running time of compared methods with a different number of workers.
Fig. 5: The left two columns show the convergence of duality gap for the compared methods regarding elapsed time. The right column presents the computational time and communication time of the compared methods when they reach similar duality gap during the optimization.

V-B Experiments in Simulated Distributed Environment

In this section, we perform experiments on RCV1 in a distributed system with a straggler worker. We simulate the straggler problem by forcing worker to sleep at each iteration such that the computational time of worker is times as long as the computational time of other normal workers.

V-B1 Convergence of Compared Methods

In Figure 3, we present the convergence rate regarding communication rounds or elapsed time of the compared methods. The RCV1 dataset is distributed across four workers. We follow the parameter settings in [18] for CoCoA+. For ACPD, we set , and in the experiment. To analyze the effect of straggler-agnosticism and bandwidth-efficiency, we also do ablation studies by setting or . Figures in the first two columns show the results when we set . In this situation, the waiting time for the straggler machine is comparable to the communication time between machines. There are two observations: (1) our method admits nearly similar convergence rate to CoCoA+ when is small, (2) ACPD converges faster than other compared methods regarding time. It is obvious that both straggler-agnosticism and bandwidth-efficiency are beneficial for the distributed optimization. We then set and plot the results in the third and fourth columns of Figure 3. In this situation, the communication time between machines is negligible compared to the waiting time for the straggler machine such that the maximum delay is close to . There are three observations: (1) group-wise communication affects the convergence rate when straggler problem is serious and is large; (2) sparse communication may affect the convergence rate; (3) when the straggler problem is serious, ACPD is much faster than CoCoA+.

V-B2 Effect of Sparsity Constant

In this section, we evaluate the performance of our methods when we vary the value of the sparsity constant . We set , and in the experiment, and the RCV1 dataset is distributed across four workers. In Figure 3(a), we can observe that when we decrease the value of from to , the convergence rate of ACPD is stable if the magnitude of the duality gap is above . The convergence rate of duality gap degrades a little when it is below . In practice, we can get good generalization error when the duality gap is above . Therefore, our method is robust to the selection of sparsity constant .

V-B3 Scaling Up Workers

We evaluate the speedup properties of ACPD and CoCoA+ in a distributed system with straggler problem. In the experiment, we set , and . We test the performance of the compared methods when . For our method, we let , and . We plot the elapsed time of the compared methods when they reach duality gap in Figure 3(b). From this figure, we can observe that our method always spends much less time to reach similar duality gap compared to CoCoA+. As becomes large, the communication time becomes the bottleneck and constrains CoCoA+ from further speedup. Experimental results show that the group-wise and sparse communication helps ACPD reduce the communication time remarkably, such that it can make better use of resources in a cluster.

V-C Experiments in Real Distributed Environment

In this section, we perform large-scale experiments with KDD and URL datasets in a real distributed environment, where there are other jobs running in different workers. We distribute all datasets on eight workers in AWS platform. For our method, we let , and . Figure 5 presents the performance of the compared methods in terms of computational time. We can observe that our method is much faster than CoCoA+ in the real distributed environment. For example, ACPD is times faster than CoCoA+ to get duality gap according to the first figure in the second row of Figure 5. According to the results in the third column of Figure 5, the proposed method spends much less communication time than the other compared method.

Vi Conclusion

In this paper, we propose a novel straggler-agnostic and bandwidth-efficient distributed primal-dual algorithm for high-dimensional data. The proposed method utilizes the group-wise and sparse communication to reduce the communication time in the distributed environment. We provide the theoretical analysis of the proposed method for convex problem and prove that our method guarantees linear convergence to the optimal solution under certain conditions. Finally, we perform large-scale experiments in distributed environments. Experimental results verify that the proposed method can be up to 4 times faster than compared methods in real distributed system.

References