Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

With the increasing demand for large-scale training of machine learning models, consensus-based distributed optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase can be time consuming due to the need to wait for stragglers, i.e., slower workers. An efficient way to mitigate this effect is to let each worker wait only for updates from the fastest neighbors before updating its local parameter. The remaining neighbors are called backup workers. To minimize the globally training time over the network, we propose a fully distributed algorithm to dynamically determine the number of backup workers for each worker. We show that our algorithm achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers). We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/30/2020

Dynamic backup workers for parallel machine learning

The most popular framework for distributed training of machine learning ...
02/28/2020

Decentralized gradient methods: does topology matter?

Consensus-based distributed optimization methods have recently been advo...
07/07/2020

Divide-and-Shuffle Synchronization for Distributed Machine Learning

Distributed Machine Learning suffers from the bottleneck of synchronizat...
01/27/2021

Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning

Federated learning (FL) is a distributed machine learning architecture t...
08/16/2019

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been appli...
08/24/2020

Adaptive Serverless Learning

With the emergence of distributed data, training machine learning models...
02/28/2022

Distributed randomized Kaczmarz for the adversarial workers

Developing large-scale distributed methods that are robust to the presen...

1 Introduction

Highly over-parametrized deep neural networks have shown impressive results in a variety of machine learning (ML) tasks such as computer vision

[18]

, natural language processing

[48], speech recognization [2]

and many others. Its success depends on the availability of large amount of training data, which often leads to a dramatic increase in the size, complexity, and computational power of the training systems. In response to these challenges, the need for efficient parallel and distributed algorithms (i.e., data-parallel mini-batch stochastic gradient descent (SGD)) becomes even more urgent for solving large scale optimization and ML problems

[4, 8]. These ML algorithms in general use the parameter server (PS) [47, 39, 14, 19, 27, 28] or ring All-Reduce [16, 17] communication primitive to perform exact averaging on the local mini-batch gradients computed on different subsets of the data by each worker, for the later synchronized model update. However, the aggregation with PS or All-Reduce often leads to extremely high communication overhead111Each worker needs to receive the aggregate of updates from all other workers to move to the next iteration, where aggregation is performed either by PS or along the ring through multiple rounds., causing the bottleneck in efficient ML model training.

Recently, an alternative approach has been developed in ML community [29, 30], where each worker keeps updating a local version of the parameter vector and broadcasts its updates only to its neighbors. The popularity of this family of algorithms can date back to the seminal work of [46] on distributed gradient methods, in which, they are often referred to as consensus-based distributed optimization methods.

Dynamic Backup Workers. Most existing consensus-based algorithms assume full worker participation, that is, all workers participate in every training round. Such an exact averaging with consensus-based methods are sensitive to stragglers, i.e., slow tasks, which can significantly reduce the computation speed in a multi-machine setting [3, 23]. In practice, only a small fraction of workers participate in each training round, thus rendering the active worker (neighbor) set stochastic and time-varying across training rounds. An efficient way to mitigate the effect of stragglers is to rely on backup workers [13]: rather than waiting for updates from all neighbors (say ), worker only waits for updates from the fastest neighbors before updating and correcting its local parameter. The remaining workers are called backup workers.

Workers are then nodes of a time-varying consensus graph, where an edge indicates that worker waits updates from worker to update and correct its local parameter. A larger value of reduces the communication overhead and may mitigate the effect of stragglers since each worker needs to wait for fewer updates from its neighbors. This reduces the period of one iteration. At the same time, the “quality" of updates may suffer since each worker only receives limited information from neighbors. As a result, more iterations will be required for convergence or in some cases, it cannot even guarantee the consistency of parameters across all workers. This raises the questions: Can a large number of backup workers per node significantly reduce the convergence time or will stragglers still slow down the whole network? and which effect prevails? Apart from some numerical results [31], this paper is perhaps the first attempt to answer these questions.

Main Contribution. Our contribution is three-fold. First, we formulate the consensus-based distributed optimization problem with dynamic backup workers (DyBW), and propose the consensus-based DyBW (cb-DyBW) algorithm that can dynamically adapt the number of backup workers for each worker during the training time.

Second, we present the first (to the best of our knowledge) convergence analysis of consensus-based algorithms with DyBW that is cognizant of training process at each worker. We show that the convergence rate of cb-DyBW algorithm on both independent-identically-distributed (i.i.d.) and non-i.i.d. datasets across workers is , where is the total number of workers and is the total number of communication rounds. This indicates that our algorithm achieves a linear speedup for convergence rate for a sufficiently large The PSGD algorithm [29] also achieves the same rate with full worker participation (i.e., ) in each communication round, which lead to high communication costs and implementation complexity. In contrast we have the flexibility to dynamically define the number of backup workers for each node in the system. We then build on the result of convergence rate in terms of number of iterations to characterize the wall-clock time required to achieve this precision. We show that cb-DyBW can dramatically reduce the convergence time by reducing the duration of an iteration without increasing the number of convergence iterations for a certain accuracy.

Third, we develop a practical algorithm to dynamically determine the number of backup workers for each worker during the training time. It appears that our paper is the first study to understand and quantify the effect of dynamic backup workers in fully distributed ML systems.

Related Work. Our algorithm belongs to the class of consensus/gossip algorithms. A significant amount of research has focused on developing consensus-based algorithms in various fields including [5, 24, 51, 7, 33, 21, 34, 8, 15, 11, 36, 37, 40, 41]. However, most results are developed for full worker participation, which is communication expensive with large numbers of workers. Efficient control of partial worker participation in consensus-based distributed optimization methods has received little attention, which arise in problems of practical interests. The few existing works address this issue through a stale-synchronous model [10, 19] have no convergence guarantee, and the number of backup workers is often configured manually through some preliminary experiments before the start of the actual training process. Adaptive worker selection that is cognizant of the training process at each worker has not been understood yet.

A more relevant line of research on distributed learning is the PS/All-Reduce. Most existing work is based on the centralized framework through synchronous communication scheme which often leads to communication congestion in the server. Alternatively, the PS can operate asynchronously, updating parameter vector immediately after it receives updates from a single worker. Although such an asynchronous update manner can increase system throughput (parameter updates per time unit), some workers may still operate on stale versions of the parameter vector and in some cases, even preventing convergence to the optimal model [12]. More recently, several PS-based algorithms considering stale-synchronous model via backup workers have gained increased attention. See [10, 43, 35, 52, 53] for details, among which [31, 52]

are perhaps the only other works proposing to dynamically adapt the number of backup workers. However, most of them focus on showing empirical results without convergence guarantees and without a rigorous analysis of how selection skew affects convergence speed. Moreover, none of them exploit the communication among workers, since they assume either a central server exists to coordinate workers (e.g., PS model) or all workers have an identical role (e.g., All-Reduce). Generalizing the PS schemes to our fully distributed framework is a non-trivial and open problem because of the large scale distributed and heterogeneous nature of the training data. We refer the interested readers to

[42] and references therein for a comprehensive review in distributed learning.

Notation. Let be the total number of workers and be the number of total communication rounds. We denote the cardinality of a finite set as We denote by and (

) the identity matrix and all-one (zero) matrices of proper dimensions, respectively. We also use

to denote the set of integers . We use boldface to denote matrices and vectors, and to denote the -norm. In addition, returns the diagonal elements of matrix

2 Background and Problem Formulation

In this section, we introduce the background and formulation of the consensus-based distributed optimization problem with dynamic backup workers.

2.1 Consensus-based Distributed Optimization

Supervised learning aims to learn a function that maps an input to an output using examples from a training dataset , where each example is a pair of input and the associated output . The training of ML model aims to find the best statistical model via optimizing a set of parameters to solve the following optimization problem

(1)

where is the model error on the -th element of dataset when parameter is used. The objective function may also include a regularization term that enforces some “simplicity" (e.g., sparseness) of , which can be easily taken into account in our analysis.

Different iterative algorithms have been proposed to solve (1) and we refer interested readers to [9] for a nice introduction. Due to increases in available data set and complexity of statistical model, an efficient distributed algorithm for (1) is usually desired to determine the parameter vector in a reasonable time. The common way is to offload the computation overhead to independent workers, and they jointly determine the optimal parameter of interest through a distributed coordination. In other words, (1) can be equivalently reformulated as a minimization of the sum of functions local to each worker

(2)

where and is the local dataset of worker . In conventional distributed learning, is divided among workers () and the distribution for each worker’s local dataset can usually be assumed to be i.i.d.. Unfortunately, this assumption may not hold true in practice. Instead, we make no so such assumption in our model and our analysis holds for both i.i.d. and non-i.i.d. local datasets across workers.

The distributed system can be modeled as a communication graph with being the set of workers and an edge indicates that workers and can communicate with each other. Without loss of generality (W.l.o.g.), we assume the graph is strongly connected, i.e., there exists at least one path between any two arbitrary workers. Denote the neighbors of worker as . Worker maintains a local estimate of the parameter vector at iteration and broadcasts it to its neighbors. The local estimate is updated as follows:

(3)

where is a non-negative matrix and we call it the consensus matrix. Parameter is the learning rate, which can be time-varying. In other words, in each iteration, worker computes a weighted average (i.e., consensus component) of the estimates of its neighbors and itself, and then corrects it by taking into account a stochastic subgradient of its local function, i.e.,

(4)

where is a random mini-batch of size drawn from at iteration .

2.2 Dynamic Backup Workers Setting

To mitigate the effect of stragglers, worker only waits for the first updates from its neighbors rather than waiting for all as in (3) at iteration . Denote the corresponding set of neighbors as . The remaining workers are called the backup workers of at iteration and denote them as . As a result, each worker first locally updates its parameter at iteration as in (5), and then follows the consensus update only with workers from as in (6).

(5)
(6)

where is the time-varying stochastic consensus matrix at iteration , and if . We denote the gradient matrix as and the learning rate as . Then the compact form of (5) and (6) satisfies

(7)

from which we have

(8)

Note that the value of , and hence the consensus matrix in our model is not static but dynamically change from one iteration to the other, ensuring faster convergence time. This will be described in details in Section 3.

We summarize the workflow of the above consensus-based dynamic backup workers scheme in Algorithm 1, and call it cb-DyBW. One key challenge is how to determine the number of dynamic backup workers or equivalently for each worker at each iteration so as to minimize the globally training (convergence) time over the network. In the following, we first characterize the convergence performance of cb-DyBW in Section 3 and then determine how to dynamically select the backup workers in Section 4.

1:Network , number of iteration .
2:Estimated parameter .
3:for  do
4:     if  then
5:         Set , where is the number of neighbors for worker .      
6:     Compute local gradient and update local parameter as in (5) for ;
7:     Correct local parameter with the received updates from the first neighbors as in (6) for ;
8:     
9:     Update for .
Algorithm 1 Consensus-based Dynamic Backup Workers (cb-DyBW)

3 Convergence Analysis

In this section, we analyze the convergence of cb-DyBW. We show that a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers) is achievable by cb-DyBW. We relegate all proof details to the appendix.

3.1 Assumptions

We first introduce the assumptions utilized for our convergence analysis.

Assumption 1 (Non-Negative Metropolis Weight Rule).

The Metropolis weights on a time-varying graph with census matrix at iteration satisfy

(9)

where is the number of active neighbors that node needs to wait at iteration Given the Metropolis weights, the matrices are doubly stochastic, i.e.,

Assumption 2 (Bounded Connectivity Time).

There exists an integer such that the graph is strongly-connected.

Assumption 3 (L-Lipschitz Continuous Gradient).

We assume that for each worker

the loss function

is convex and differentiable, and the gradient is Lipschitz continuous with constant , i.e.,

(10)
Assumption 4 (Bounded Variance).

There exist constants for

such that the variance of local gradient estimator is bounded by

Assumptions 1-4 are standard ones in the related literature. For example, Assumptions 1 and 2 are common in dynamic networks with finite number of nodes [50, 40]. Specifically, Assumption 1 guarantees the product of consensus matrices

being a doubly stochastic matrix, while Assumption 2 guarantees the product matrix is a strictly positive matrix for large

. We define as the product of consensus matrices from to , i.e., . Under Assumption 2, [32] proved that for any integer , is a strictly positive matrix. In fact, every entry of it is at least , where is the smallest positive value of all consensus matrices, i.e., with .

Assumption 3 is widely used in convergence results of gradient methods, e.g., [33, 9, 6, 32]. Assumption 4 is also a standard assumption [22, 42]. We use the bound to quantify the heterogeneity of the non-i.i.d. local datasets across workers. In particular, corresponds to i.i.d. datasets. For the ease of exposition, we consider a universal bound in our analysis, which can be easily generalized to the non-i.i.d. case. It is worth noting that we do not require a bounded gradient assumption, which is often used in distributed optimization analysis.

3.2 Main Convergence Results

3.2.1 Convergence in terms of Iteration

In this subsection, we analyze the convergence of cb-DyBW in terms of iterations. We first show that as the number of iteration increases, the gradients tend to be .

Theorem 1.

Under assumptions 1-4 and that a constant learning rate is used, the local gradient of worker generated by cb-DyBW satisfies

(11)

under a large and , is a constant, with the expectation over the local dataset samples among workers.

Proof Sketch. Let and . Since is Lipschitz continuous with , we have

(12)

The proof boils down to bound and . We show that for sufficient large , converges to . The next step is to bound the difference between with . We show that there exists a such that as is large and above a certain level. Please see Appendix A.1 for the full proof.

Remark 1.

The bound (11) on the gradient consists of two parts: (i) that vanishes as increases, and (ii) that converges to as increases. This term tends to zero with a large .

With Theorem 1, there exists a large enough value of iteration such that gradients tend to zero after . This enables us to consider a truncated model in which Therefore, our model update in cb-DyBW through (2.2) (or equivalently (5) and (6)) reduces to

(13)

It is obvious that (13) is equivalent to (2.2) when . We now analyze the convergence of cb-DyBW under this truncated model, for which we have the following result:

Theorem 2.

Under assumptions 1-4 and that a constant learning rate is used, the sequence of parameters generated by the recursions (13) satisfies

(14)

where , and the expectation is over the local dataset samples among workers.

Proof Sketch. The proof follows a similar approach as that for Theorem 1. It leverages the inequality induced by Lipschitz continuous gradient

and to bound and with respect to (w.r.t.) the loss function . We show that for large , the expected loss converges to the optimal loss with a linear speedup. Please see Appendix A.2 for the full proof.

The following result is an immediate consequence of Theorem 2.

Corollary 1.

Consider the truncated recursions (13), in which denotes the weights at each node. We have

(15)
Remark 2.

The convergence bound consists of two parts: a vanishing term as increases and a constant term whose value depends on the problem instance parameters and is independent of . This convergence bound with dynamic backup worker has the same structure but with different variance terms as that of the typical consensus-based method with full worker participation. In other words, the decay rate of the vanishing term matches that of the typical consensus-based method. This implies that the dynamic backup workers does not result in fundamental changes in convergence (in order sense) in terms of iterations. However, as we will see later, dynamic backup workers can significantly reduce the convergence in terms of wall-clock time since it reduces the length of each iteration. We further note that this convergence bound can be faster than centralized SGD with less communication on the busiest worker.

Corollary 2.

Let . The convergence rate of Algorithm 1 is .

Remark 3.

The consensus-based distributed optimization methods with dynamic backup workers can still achieve a linear speedup with proper learning rate settings as shown in Corollary 2. Although many works have achieved this convergence rate asymptotically, e.g., [29] is the first to provide a theoretical analysis of distributed SGD with a convergence rate of , these results are only for consensus-based method with full worker participation. It is non-trivial to achieve our result due to the large scale distributed and heterogeneous nature of training data across workers.

With Corollary 2, we immediately have the following results on the number of iterations required for convergence:

Corollary 3.

Let . The number of iterations required to achieve -accuracy for the loss function is denoted and is and the total number of required iterations for the convergence of parameters is where is the number of iterations to guarantee the convergence of parameter according to (15) in Corollary 1.

Note that for iterations, there is no local computation of gradients at each worker. They only exchange the information of their parameters to reach a consensus.

3.2.2 Convergence in terms of Wall-Clock Time

In this subsection, we analyze the convergence of cb-DyBW in terms of wall-clock time. We show that dynamic back workers can dramatically reduce the convergence time compared to that with full worker participation. The intuition is that since each worker only needs to wait for the fastest neighbors to update its local parameter, the length of one iteration can be significantly reduced, which in turn reduces the convergence time given the results on convergence in iterations in Section 3.2.1. As a result, we need to estimate the time needed for each iteration

We denote the time taken by to compute its local update at iteration as

, which is assumed to be a random variable. W.l.o.g., we assume that each worker consumes different amount of time to compute its local update due to the different sizes of available local training data. We denote

as the time worker needs to collect updates from neighbors at iteration , which is given by

(16)

Define , which is a subset of . Therefore, the total time needed for all workers to complete updates at iteration satisfies

(17)

We approximate by a deterministic constant such that the mean square error (MSE) is minimized. The MSE yields from an estimation of by constant satisfies

(18)

where

is the probability density function (PDF) of

. A necessary condition for the minimization of can be obtained by taking the derivative of w.r.t. and setting it to zero, from which we obtain an MSE estimator as

(19)

With this estimator, we will show that the length of one iteration with dynamic backup workers (i.e., partial worker participation) is smaller than that of conventional consensus-based methods with full worker participation. We denote the corresponding length of one iteration as and , respectively. We have the following result:

Corollary 4.

The event that occurs almost surely, i.e.,

(20)
Remark 4.

The expected time for one iteration with dynamic backup workers is smaller than that of full worker participation. Combined with results in Section 3.2.1 (Theorem 2), the convergence time of consensus-based distributed optimization methods with dynamic backup workers is reduced compared to that of conventional consensus-based methods. As we will numerically show in Section 5, our Algorithm 1 can dramatically reduce the convergence time.

4 Dynamic Backup Worker Selection

Given the convergence result on cb-DyBW, we now discuss how to determine the number of backup workers or for each worker at iteration for cb-DyBW so that the total convergence time is minimized. From Corollary 3, the total number of required iterations contains two parts. The first iterations conduct the stochastic gradient descent, while the second iterations are only for the convergence of the product among time-variant consensus matrices. We note that the major time consumption comes from the calculation of gradients, while the communication delay is negligible. Therefore, we only consider the time for the first iterations.

4.1 Distributed Threshold-Based Update Rule (DTUR)

We propose the following “threshold based" rule for choosing the set of fastest workers that each worker needs to wait and collect their updated parameters in each iteration. Within each iteration, each worker waits for a maximum time of for its neighbors to send updates, and then includes them in . Thus we let if , while in case . This rule is fully specified by deciding the threshold .  We now analyze its performance under the following optimization problem

(21)

where for .

The optimization problem (21) is in general hard to solve. In the following, we relax it and propose a distributed algorithm to solve it. For a given communication graph , we first find the shortest path that connects all nodes in this network. Let denote the set of links for the shortest path and be its length. If there exists more than one shortest path, we randomly select one as W.l.o.g, we assume the in (21) as , i.e., the dynamic consensus weights matrices are -strongly connected. Our key insight is that (21) can be relaxed to find the minimal time for every iterations, in which all links in has been visited at least once, i.e., all nodes in share information with each other. To this end, such a -iteration procedure can run multiple times until the convergence of parameters. More importantly, every -iteration procedure is independent.

A -iteration Procedure. W.l.o.g., we consider one particular

-iteration procedure, and we call it an “epoch" which consists of

iterations. The algorithm aims to establish at least one connected link at each iteration so that is -strongly connected at the end of this epoch. At the beginning of the -th epoch, denote an empty set to to store the established links in each iteration during this epoch. At the end of this epoch, is reset to be empty.

For simplicity, we denote the iteration index in the -th epoch as satisfying for At iteration , all workers start their local updates simultaneously. Once one worker completes its local update, it sends its update to its neighbors and waits for collecting updates from its neighbor as well. If two workers ( and ) successfully exchange their local updates, the link is established if Then link (i, j) is added to , i.e., , and all workers move to the -th iteration. However, if the established link satisfies , then it will not be added to , and iteration continues until one such link is established. Therefore, the time for such link establishment is the desired maximum time for one iteration, and we denote it as , i.e.,

(22)

Since is reset to be empty at the end of each independent epoch, we can equivalently minimize the time for each epoch for the optimization problem (21), i.e.,

(23)

We summarize the workflow for the -iteration procedure in DTUR in Algorithm 2.

1:Network .
2:Find a shortest path and its length .
3:for  do
4:     Set .
5:     for  do
6:         Find the first link but , and store into ;
7:         Define according to (22);
8:               
9:     
Algorithm 2 A -iteration for Distributed Threshold-Based Update Rule
Remark 5.

Algorithm 2 requires each worker to store a local copy of , which needs a memory size . When a desired link is established, workers and will broadcast the information to the entire network such that each worker can update . In addition, workers and need to send a command to the rest workers to terminate the current iteration. The overall communication overhead is , which is linear in the number of workers .

5 Numerical Results

In this section, we conduct extensive experiments to validate our model and theoretical results. For the notation abuse, we use cb-DyBW to denote the algorithm that uses the distributed threshold-based update rule (Algorithm 2) to determine the number of dynamic backup workers in Algorithm 1. We implement cb-DyBW

in Tensorflow

[1] using the Network File System (NFS) and MPI backends, where files are shared among workers through MFS and communications between workers are via MPI.

Figure 1: Performance of cb-DyBW and cb-Full for the LRM model under MNIST (top) and CIFAR-10 (bottom). The straight line in (c) corresponds to the average number of backup workers over the iterations.

We compare cb-DyBW with the conventional consensus-based distributed optimization methods in which full workers participate during the training time. We call this benchmark as cb-Full. It is clear that cb-Full suffers the straggler issues in the optimization problem since it ignores the slower workers, which often leads to a longer convergence time compared to that of cb-DyBW.

We evaluate cb-DyBW and cb-Full

on the multi-class classification problem. We use different models including the Logistic Regression Model (LRM) and a fully-connected neural network with 2 hidden layers (2NN) with MNIST

[26, 45] and CIFAR-10 [25, 44] datasets. The MNIST dataset contains handwritten digits with samples for training and samples for testing. The CIFAR-10 dataset consists of color images in classes where samples are for training and the other samples for testing. We consider a network with workers and randomly generate a connected graph for evaluation. The loss function we consider is the cross-entropy one. For the ease of exposition, we relegate some experimental results including testing different models, and the impact of different network topologies with different number of workers etc to Appendix B.

To enhance the training efficiency, we reduce the dimensions of MNIST (the dimension of samples is ) and CIFAR-10 (the dimension of samples is

) through the widely used principal component analysis (PCA)

[49]

. The learning rate is perhaps the most critical hyperparameter in distributed ML optimization problems. A proper value of the learning rate is important; however, it is in general hard to find the optimal value. The standard recommendation is to have the learning rate proportional to the aggregate batch size with full worker participation

[17, 38]. With the consideration of dynamic backup workers, it is reasonable to adaptively set the learning rate [10, 52]. Hence, we choose where and Finally, batch size is another important hyperparameter, which is limited by the memory and computational resources available at each worker, or determined by generalization performance of the final model [20]. We test the impact of batch size using these two datasets and find that is a proper value. For the ease of exposition, we relegate the detailed comparisons to Appendix B.

Figure 1 shows the (testing) errors, (training) loss and iteration duration of cb-DyBW and cb-Full for the LRM model under MNIST (top) and CIFAR-10 (bottom) datasets, as well as the number of dynamic backup workers for cb-DyBW. We observe that the number of iterations required for convergence is similar (in order sense) for both cb-DyBW and cb-Full, consistent with our theoretical results in Theorem 2. However, it is clear that cb-DyBW can dramatically reduce the duration of one iteration by 65%-70% on average compared to that of cb-Full from Figure 1 (c). This is because our proposed framework and algorithm cb-DyBW can dynamically and adaptively determine the number of backup workers for each worker during the training time so as to mitigate the effect of stragglers in the optimization problem. As a result, cb-DyBW can significantly reduce the convergence time compared to cb-Full for a certain accuracy. Finally, from Figure 1 (d), it is clear that the number of backup workers is dynamically changing over time during the training time, which further validates our motivation and model.

6 Conclusions

In this paper, we considered to mitigate the effect of stragglers in distributed machine learning via dynamic backup workers. We formulated the consensus-based distributed optimization problem with dynamic backup workers to minimize the total training time for a certain convergence accuracy. We proposed cb-DyBW and analyzed its convergence. We proved that cb-DyBW achieves a linear speedup for convergence, and can dramatically reduce the convergence time compared to conventional consensus-based methods. We further proposed a threshold-based rule to determine how to dynamically select backup workers during the training time. Finally, we provided empirical experiments to validate our theoretical results.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: A System for Large-Scale Machine Learning. In Proc. of USENIX OSDI, Cited by: §5.
  • [2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. In Proc. of ICML, Cited by: §1.
  • [3] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica (2013) Effective Straggler Mitigation: Attack of the Clones. In Proc. of USENIX NSDI, Cited by: §1.
  • [4] R. Bekkerman, M. Bilenko, and J. Langford (2011) Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press. Cited by: §1.
  • [5] D. P. Bertsekas and J. N. Tsitsiklis (1989) Parallel and Distributed Computation: Numerical Methods. Vol. 23, Prentice hall Englewood Cliffs, NJ. Cited by: §1.
  • [6] L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization Methods for Large-Scale Machine Learning. SIAM Review 60 (2), pp. 223–311. Cited by: §A.1, §3.1.
  • [7] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah (2006) Randomized Gossip Algorithms. IEEE Transactions on Information Theory 52 (6), pp. 2508–2530. Cited by: §1.
  • [8] S. Boyd, N. Parikh, and E. Chu (2011) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Now Publishers Inc. Cited by: §1, §1.
  • [9] S. Bubeck et al. (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends® in Machine Learning 8 (3-4), pp. 231–357. Cited by: §2.1, §3.1.
  • [10] J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz (2016) Revisiting Distributed Synchronous SGD. arXiv preprint arXiv:1604.00981. Cited by: Appendix B, §1, §1, §5.
  • [11] I. Colin, A. Bellet, J. Salmon, and S. Clémençon (2016) Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions. In Proc. of ICML, Cited by: §1.
  • [12] W. Dai, Y. Zhou, N. Dong, H. Zhang, and E. Xing (2018) Toward Understanding the Impact of Staleness in Distributed Machine Learning. In Proc. of ICLR, Cited by: §1.
  • [13] J. Dean and L. A. Barroso (2013) The Tail at Scale. Communications of the ACM 56, pp. 74–80. Cited by: §1.
  • [14] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, et al. (2012) Large Scale Distributed Deep Networks. In Proc. of NIPS, Cited by: §1.
  • [15] J. C. Duchi, A. Agarwal, and M. J. Wainwright (2011) Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. IEEE Transactions on Automatic Control 57 (3), pp. 592–606. Cited by: §1.
  • [16] A. Gibiansky (2017) https://github.com/baidu-research/baidu-allreduce. Cited by: §1.
  • [17] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, Large Minibatch SGD: Training Imagenet in 1 hour

    .
    arXiv preprint arXiv:1706.02677. Cited by: Appendix B, §1, §5.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In Proc. of IEEE CVPR, Cited by: §1.
  • [19] Q. Ho, J. Cipar, H. Cui, J. K. Kim, S. Lee, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing (2013) More Effective Distributed ML Via A Stale Synchronous Parallel Parameter Server. In Proc. of NIPS, Cited by: §1, §1.
  • [20] E. Hoffer, I. Hubara, and D. Soudry (2017) Train Longer, Generalize Better: Closing the Generalization Gap in Large Batch Training of Neural Networks. In Proc. of NIPS, Cited by: Appendix B, §5.
  • [21] B. Johansson, M. Rabi, and M. Johansson (2010) A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems. SIAM Journal on Optimization 20 (3), pp. 1157–1170. Cited by: §1.
  • [22] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and Open Problems in Federated Learning. arXiv preprint arXiv:1912.04977. Cited by: §3.1.
  • [23] C. Karakus, Y. Sun, S. Diggavi, and W. Yin (2017) Straggler Mitigation in Distributed Optimization Through Data Encoding. Proc. of NIPS. Cited by: §1.
  • [24] D. Kempe, A. Dobra, and J. Gehrke (2003) Gossip-Based Computation of Aggregate Information. In Proc. of IEEE FOCS, Cited by: §1.
  • [25] A. Krizhevsky, G. Hinton, et al. (2009) Learning Multiple Layers of Features from Tiny Images. Cited by: Appendix B, §5.
  • [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Appendix B, §5.
  • [27] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling Distributed Machine Learning with The Parameter Server. In Proc. of USENIX OSDI, Cited by: §1.
  • [28] M. Li, D. G. Andersen, A. J. Smola, and K. Yu (2014) Communication Efficient Distributed Machine Learning with the Parameter Server. In Proc. of NIPS, Cited by: §1.
  • [29] X. Lian, C. Zhang, H. Zhang, C. Hsieh, W. Zhang, and J. Liu (2017) Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In Proc. of NIPS, Cited by: §1, §1, Remark 3.
  • [30] X. Lian, W. Zhang, C. Zhang, and J. Liu (2018) Asynchronous Decentralized Parallel Stochastic Gradient Descent. In Proc. of ICML, Cited by: §1.
  • [31] Q. Luo, J. Lin, Y. Zhuo, and X. Qian (2019) Hop: Heterogeneity-Aware Decentralized Training. In Proc. of ACM ASPLOS, Cited by: §1, §1.
  • [32] A. Nedić, A. Olshevsky, and M. G. Rabbat (2018) Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization. Proceedings of the IEEE 106 (5), pp. 953–976. Cited by: §3.1, §3.1.
  • [33] A. Nedic and A. Ozdaglar (2009) Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Transactions on Automatic Control 54 (1), pp. 48–61. Cited by: §A.4, §1, §3.1, Lemma 2.
  • [34] S. S. Ram, A. Nedić, and V. V. Veeravalli (2010) Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization. Journal of Optimization Theory and Applications 147 (3), pp. 516–545. Cited by: §1.
  • [35] Y. Ruan, X. Zhang, S. Liang, and C. Joe-Wong (2020) Towards Flexible Device Participation in Federated Learning for Non-IID Data. arXiv preprint arXiv:2006.06954. Cited by: §1.
  • [36] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee, and L. Massoulié (2017) Optimal Algorithms for Smooth and Strongly Convex Distributed Optimization in Networks. In Proc. of ICML, Cited by: §1.
  • [37] K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee (2018) Optimal Algorithms for Non-Smooth Distributed Optimization in Networks. In Proc. of NeurIPS, Cited by: §1.
  • [38] S. L. Smith, P. Kindermans, C. Ying, and Q. V. Le (2018) Don’t Decay the Learning Rate, Increase the Batch Size. In Proc. of ICLR, Cited by: Appendix B, §5.
  • [39] A. Smola and S. Narayanamurthy (2010) An Architecture for Parallel Topic Models. Proc. of VLDB. Cited by: §1.
  • [40] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu (2018) Communication Compression for Decentralized Training. In Proc. of NeurIPS, Cited by: §1, §3.1.
  • [41] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu (2018) Decentralized Training Over Decentralized Data. In Proc. of ICML, Cited by: §1.
  • [42] Z. Tang, S. Shi, X. Chu, W. Wang, and B. Li (2020)

    Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

    .
    arXiv preprint arXiv:2003.06307. Cited by: §1, §3.1.
  • [43] M. Teng and F. Wood (2018) Bayesian Distributed Stochastic Gradient Descent. Proc. of NeurIPS. Cited by: §1.
  • [44] The CIFAR-10 Dataset http://www.cs.toronto.edu/~kriz/cifar.html. Cited by: Appendix B, §5.
  • [45] The MNIST Database http://yann.lecun.com/exdb/mnist/. Cited by: Appendix B, §5.
  • [46] J. Tsitsiklis, D. Bertsekas, and M. Athans (1986) Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms. IEEE Transactions on Automatic Control 31 (9), pp. 803–812. Cited by: §1.
  • [47] L. G. Valiant (1990) A Bridging Model for Parallel Computation. Communications of the ACM 33 (8), pp. 103–111. Cited by: §1.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. In Proc. of NIPS, Cited by: §1.
  • [49] S. Wold, K. Esbensen, and P. Geladi (1987) Principal Component Analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: Appendix B, §5.
  • [50] L. Xiao, S. Boyd, and S. Lall (2006) Distributed Average Consensus with Time-Varying Metropolis Weights. Automatica. Cited by: §A.4, §3.1, Lemma 1.
  • [51] L. Xiao and S. Boyd (2004) Fast Linear Iterations for Distributed Averaging. Systems & Control Letters 53 (1), pp. 65–78. Cited by: §1.
  • [52] C. Xu, G. Neglia, and N. Sebastianelli (2020) Dynamic Backup Workers for Parallel Machine Learning. In Proc. of IFIP Networking, Cited by: Appendix B, §1, §5.
  • [53] H. Yang, M. Fang, and J. Liu (2021) Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning. In Proc. of ICLR, Cited by: §1.

Appendix A Proofs of Main Results

In this section, we provide the proofs of the theoretical results presented in the paper. We first prove that the gradients tends to be after a large number of iterations for cb-DyBW (Theorem 1), and then provide proofs for the convergence of cb-DyBW (Theorem 2).

a.1 Proof of Theorem 1

Proof.

We define as the product of consensus matrices from to , i.e., and let be the -th row -th column element of . Therefore, with fixed , can be expressed in terms of as

(24)

Next, we define an auxiliary variable satisfying

(25)

with the following relation holds

(26)

From Assumption 3, we have the following inequality with respect to [6]

(27)

As a result, we need to bound and in (27). Note that the term can be bounded as follows

(28)

where follows from (26) and the fact that

is an unbiased estimator of

comes from a standard mathematical manipulation. The equality in is based on the equation The last inequality in is due to the well triangle-inequality

Next, we bound as follows

(29)