Dynamic backup workers for parallel machine learning

04/30/2020 ∙ by Chuan Xu, et al. ∙ Inria 0

The most popular framework for distributed training of machine learning models is the (synchronous) parameter server (PS). This paradigm consists of n workers, which iteratively compute updates of the model parameters, and a stateful PS, which waits and aggregates all updates to generate a new estimate of model parameters and sends it back to the workers for a new iteration. Transient computation slowdowns or transmission delays can intolerably lengthen the time of each iteration. An efficient way to mitigate this problem is to let the PS wait only for the fastest n-b updates, before generating the new parameters. The slowest b workers are called backup workers. The optimal number b of backup workers depends on the cluster configuration and workload, but also (as we show in this paper) on the hyper-parameters of the learning algorithm and the current stage of the training. We propose DBW, an algorithm that dynamically decides the number of backup workers during the training process to maximize the convergence speed at each iteration. Our experiments show that DBW 1) removes the necessity to tune b by preliminary time-consuming experiments, and 2) makes the training up to a factor 3 faster than the optimal static configuration.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In 2014, Google’s Sybil machine learning (ML) platform was already processing hundreds of terabytes through thousands of cores to train models with hundreds of billions of parameters canini14. At this scale, no single machine can solve these problems in a timely manner, and, as time goes on, the need for efficient distributed solutions becomes even more urgent. These distributed systems are different from those used for traditional applications like transaction processing or data analytics, because of statistical and algorithmic characteristics unique to ML programs, like error tolerance, structural dependencies, and non-uniform convergence of parameters xing16. Currently, their operation requires a number of ad-hoc choices and time-consuming tuning through trial and error, e.g., to decide how to distribute ML programs over a cluster or how to bridge ML computation with inter-machine communication. For this reason, significant research effort (also from the networking community Harlap; wang19; shi19; neglia19infocom; bao19; CChen; Shi) is devoted to design adaptive algorithms for a more effective use of computing resources for ML training.

Currently, the most popular template for distributed ML training is the parameter server (PS) framework Li:2014

. This paradigm consists of workers, that perform the bulk of the computation, and a stateful parameter server that maintains the current version of the model parameters. Workers use locally available versions of the model to compute “delta” updates of the parameters, e.g., through a gradient descent step. These updates are then aggregated by the PS and combined with its current state to produce a new estimate of the optimal parameter vector. If the PS waits for all workers before updating the parameter vector (synchronous operation),

stragglers, i.e., slow tasks, can significantly reduce computation speed in a multi-machine setting ananthanarayanan13; karakus17; LiKAS18. Transient slowdowns are common in computing systems (especially in shared ones) and have many causes, such as resource contention, background OS activities, garbage collection, and (for ML tasks) stopping criteria calculations. Alternatively, the PS can operate asynchronously, updating the parameter vector as soon as it receives the result of a single worker. While this approach increases system throughput (parameter updates per time unit), some workers may operate on stale versions of the parameter vector slowing and, in some cases, even preventing convergence to the optimal model DaiZDZX19. A simple solution that does not jeopardize convergence, while mitigating the effect of stragglers, is to rely on backup workers Jianmin2016: instead of waiting for the updates from all workers (say it ), the PS waits for the fastest out of updates to proceed to the next iteration. The remaining workers are called backup workers.111 We stick to the name used in the original paper Jianmin2016, even if it is someway misleading, because backup workers do not replace other workers when needed. In fact all workers operate identically, and who are the backup workers change from one iteration to the other depending on their execution times at that specific iteration. Experiments on Google cluster with workers show that a few backup workers () can reduce the training time by 30% in comparison to the synchronous PS and by 20% in comparison to the asynchronous PS Jianmin2016.

The number of backup workers has a double effect on the convergence speed. The larger is, the faster each iteration is, because the PS needs to wait less inputs from the workers. At the same time, the PS aggregates less information, so the model update is noisier and more iterations are required to converge. Currently, the number of backup workers is configured manually through some experiments, before the actual training process starts. However, the optimal static setting is highly sensitive to the cluster configuration (e.g., GPU performances and their connectivity) as well as to its instantaneous workload. Both cluster configuration and workload may be unknown to the users (specially in a virtualized cloud setting) and may change as new jobs arrive/depart from the cluster. Moreover, in this paper we show that the optimal number of backup workers 1) is also affected by the choice of hyper-parameters222 An hyper-parameter is a parameter of the learning algorithm (and not of the model), but it can still influence the final model learned.

like the batch size, and 2) changes during the training itself(!) as the loss function approaches a (local) minimum. Therefore, the static configuration of backup workers does not only require time-consuming experiments, but is particularly inefficient and fragile.

In this paper we propose the algorithm DBW (for Dynamic Backup Workers) that dynamically adapts the number of backup workers during the training process without prior knowledge about the cluster or the optimization problem. Our algorithm identifies the sweet spot between the two contrasting effects of (reducing the duration of an iteration and increasing the number of iterations for convergence), by maximizing at each iteration the decrease of the loss function per time unit.

The paper is organized as follows. Sect. 2 provides relevant background and introduces the notation. Sect. 3 illustrates the different components of our algorithm DBW with their respective preliminary assessments. DBW is then evaluated on ML problems in Sect. 4. The results show that DBW is robust to different cluster environments and different hyper-parameters’ settings. DBW does not only remove the necessity to configure an additional parameter () through costly experiments, but also reduce the training time by a factor as large as 3 in comparison to the best static configuration. Sect. 5 concludes the paper and discusses future research directions. The code of our implementation is available online github_dbw.

2 Background and notation

Given a dataset , the training of ML models usually requires to find a parameter vector minimizing a loss function:

(1)

where is the loss of the model on the datapoint

. For example, in supervised learning, each point of the dataset is a pair

, consisting of an input object and a desired output value

. In the standard linear regression method

, , the input-output function is a linear one () and the loss function is the mean squared error

. More complex models like neural networks look for an input-output mapping in a much larger and more flexible family of functions, but they are trained solving an optimization problem like (

1).

The standard way to solve Problem 1 is to use an iterative gradient method. Let be the number of workers (e.g., GPUs) available. In a synchronous setting without backup workers, at each iteration the PS sends the current estimate of the parameter vector to all workers. Each worker computes then a stochastic gradient on a random mini-batch of size () drawn from its local dataset. We assume each worker has access to the complete dataset as it is resonable in the cluster setting that we consider. Each worker sends the stochastic gradient back to the PS. We denote by the -th worker gradient received by the PS at iteration , i.e.,

(2)

and is the random minibatch of size on which the gradient has been computed. Once gradients are received, the PS computes the average gradient

and updates the parameter vector as follows:

(3)

where is called the learning rate.

When backup workers are used Jianmin2016, the PS only waits for the first gradients and then evaluates the average gradient as

(4)

In our dynamic algorithm (Sect. 3), the value of is no longer static but changes in an adaptive manner from one iteration to the other, ensuring faster convergence speed. We denote by the number of gradients of the PS needs to wait for at iteration , and by the time interval between the update of the parameter vector at the PS and the reception of the -th gradient .

The general backup-workers scheme can be implemented in different ways with quite different performance. When implementing the backup workers scheme, there are two general ways to synchronize the PS and the workers: either the PS pushes the updated parameter vector to workers or the workers pull the most updated parameter vector from the PS.

Pull (Pl)

Whenever available to perform a new computation, a worker pulls the most updated parameter vector from the PS. Google’s framework for distributed ML—TensorFlow 1.x 

tensorflow—implements Pl through a shared blocking FIFO queue of size where the PS enqueues copies of tokens indicating the corresponding iteration number. Whenever a worker becomes idle, it dequeues the token from the queue and retrieves the parameter vector directly from the PS.333We describe what appears to be an inefficient implementation. The parameter vector retrieved by the worker may correspond to a more recent iteration than what indicated in the token. Nevertheless, the corresponding gradient is still associated to the old iteration and then will be discarded at the PS. The worker may start then a computation that is already known to be useless!

Push & Interrupt (PsI)

After the PS updates the new parameter vector , it pushes to all workers, which interrupt any ongoing computation to start computing a new gradient at . Interrupts can be implemented in different ways. For example, in (teng18neurips, Algo. 2), the main thread at each worker creates a specific thread for each gradient computation and keeps listening for a new parameter vector. Once the worker receives the new one from PS, the computing thread is killed. However, the overhead of online creating/destroying threads is not negligible since it requires run-time memory allocation and de-allocation, which may even slow down the system Ling:2000:AOT:346152.346320. In amiri2018computation, the same thread performs the computation but periodically checks for new parameter vectors from the PS. When the worker receives a new parameter vector, it stops its ongoing computation. The performance of this interrupt mechanism depends on how often workers listen for messages from PS.

Push & Wait (PsW)

The PS pushes the new parameter vector to each worker as in PsI, but the worker completes its current computation before dequeueing the most recent parameter vector from a local queue. PsW can be easily implemented using MPI non-blocking communication package LiKAS18 or the FIFO queue provided in TensorFlow LuoLZQ19.


Our algorithm works with any of the variants listed above, with minor adaptations. We have implemented and tested it both with PsI and PsW in the PyTorch framework

pytorch. Results are similar, therefore, in what follows, we refer only to PsW.

To the best of our knowledge, the only other work proposing to dynamically adapt the number of backup workers is teng18neurips. The authors consider a PsI approach. The PS uses a deep neural network to predict the time  needed to collect new gradients. It then greedily chooses as the value that maximizes . This neural network for time series forecasting needs itself to be trained in advance for each cluster and each ML model to be learned. No result is provided in teng18neurips about the duration of this additional training phase or its sensitivity to changes in the cluster and/or ML models. Our algorithm DBW also selects  to maximize a similar ratio, but 1) replaces the numerator by the expected decrease of the loss function, 2) uses a simple estimator for , that does not require any preliminary training. Moreover, results in teng18neurips do not show a clear advantage of the proposed mechanism in comparison to the static setting suggested in Jianmin2016 (see (teng18neurips, Fig. 4)). Our experiments in Sect. 4 confirm that indeed considering a gain proportional to  as in teng18neurips is too simplistic (and leads to worse results than DBW).

Our approach to estimate the loss decrease as a function of is inspired by the work balles17uai which evaluates the loss decrease as a function of the batch size. In fact, aggregating gradients, each computed on a mini-batch of samples, is almost equivalent to compute a single gradient on a mini-batch of samples.

While our algorithm adapts the number of backup workers given an available pool of workers, the authors of wang19

proposes a reinforcement learning algorithm to adapt

in order to minimize the training time under a budget constraint. This algorithm and DBW are then complementary: once selected with the approach in wang19, DBW can be applied to tune the number of backup workers.

3 Dynamic backup workers

The rationale behind our algorithm DBW is to adaptively select in order to maximize , i.e., to greedily maximize the decrease of the empirical loss per time unit. We decide just after the update of .444It is possible in principle to refine the choice of upon the arrival of the first gradients of . In the following subsections, we detail how both numerator and denominator can be estimated, and how they depend on . The notation is listed in Table 1.

iteration number
number of workers
parameter vector at iteration
(global) loss function to minimize
batch size
learning rate
Lipschitz smoothness constant of
stochastic gradient PS receives at iteration 
variance of
number of stochastic gradients PS waits for at iteration 
average gradient at iteration 
gain (expected loss decrease) if PS receives gradients
time between update and reception at PS
time between update and reception at PS
when PS has waited for gradients at iteration 
random variable from which values are assumed to be sampled
set of samples available up to iteration 
Table 1: Notation

3.1 Empirical Loss Decrease

We assume that the empirical loss function is -smooth, i.e., it exists a constant such that

(5)

Smoothness is a standard assumption in convergence results of gradient methods (see for example bubeck15; bottou18). In our experiments we show DBW reduces the convergence time also when the loss is not a smooth function. From  (5) and (3) it follows (see (bottou18, Sect. 4.1) for a proof):

(6)

In order to select , DBW uses this lower bound as a proxy for the loss decrease. We note, however, that depends on the value of (see (4

)) and the random mini-batches drawn at the workers. So at the moment to decide for

, is a random variable. We consider then the expected value (over the possible choices for the mini-batches) of the right-hand side of (6). We call it the gain and denote by , i.e.,:

(7)

Each stochastic gradient is an unbiased estimator of the full gradient, then

. Moreover, for any random variable , it holds . Applying this relation to each of the component of the vector , and then summing up, we obtain:

(8)

where denotes the sum of the variances of the different components of , i.e., . Notice that does not depend on , because each worker has access to the complete dataset. Then, combining (7) and (8), can be rewritten as

(9)

When full batch gradient descent is used, the optimal learning rate is , because it maximizes the expected gain. With this choice of the learning rate, Eq. (9) becomes:

(10)

When the loss is not -smooth, or the constant is unknown, the learning rate is selected through some preliminary experiments (details in Sect. 4). We assume that (10) still holds.

Equation (10) shows that the gain increases as increases. This corresponds to the fact that the more gradients are aggregated at the PS, the closer the stochastic gradient is to its expected value , i.e., to the steepest descent direction for the loss function. We also remark that the gain sensitivity to depends on the relative ratio of and , that keeps changing during the training (see for example Fig. 1). Correspondingly, we can expect that the optimal value of will vary during the training process, even when computation and communication times do not change in the cluster. Experiments in Sect. 4 confirm this point.

Computing the exact value of would require the workers to process the whole dataset, leading to much longer iterations. We want rather to evaluate with limited overhead for the workers. In what follows, we discuss how to estimate and to approximate in (10). We first provide estimators that use information available at the end of iteration , i.e., after has been selected and the fastest gradients have been received. Then, we build from these estimators new ones, that can be computed at the beginning of the iteration and then can be used to select . Given a quantity to be estimated at iteration , we denote the first estimator as and the second one as .

(a) Gradient norm
(b) Gradient variance
(c) Loss decrease
Figure 1: Estimation of the loss decrease. MNIST, workers, batch size , learning rate , estimates computed over the last iterations.

We start by estimating through the usual unbiased estimator for the variance:

(11)

It is possible to have more precise estimates (even when ), if each worker can estimate from its mini-batch. As GPUs’ low-level APIs do not provide access to such information, we do not further develop the corresponding formulas here.

Next, we study the estimator of . First, we can trivially use to estimate , i.e.,  . Since (from (8)), we can estimate as follows

(12)

where the operation guarantees non-negativity of the estimate.

Estimates in (11) and (12) cannot be computed at the beginning of iteration , but it is possible to compute them for earlier iterations, and use these past estimates to predict the future value. DBW simply averages the past estimates (or the first if ), i.e.,

(13)
(14)

Combining (10), (13) and (14), the estimate of the gain is

(15)

In Fig. 1, we show our estimates during one training process on the MNIST dataset (details in Sect. 4), where our algorithm (described in Sect. 3.3) is applied to dynamically choose . The solid lines are the estimates given by (13), (14), and (15). The dashed lines present the exact values (we have instrumented our code to compute them). We can see from Figures 0(a) and 0(b) that the proposed estimates and are very accurate. Figure 0(c) compares the loss decrease (observed a posteriori) and . As expected is a lower bound for , but the two quantities are almost proportional. This is promising, because, if the lower bound and the function were exactly proportional, their maximizers would coincide. Then, working on the lower bound, as we do, would not be an approximation.

3.2 Iteration Duration

In this subsection, we discuss how to estimate the time the PS needs to receive gradients of after the update at iteration . As in Lee, we call round trip time the total (random) time an idle worker needs to 1) retrieve the new parameter vector, 2) compute the corresponding gradient, and 3) send it back to the PS.

When the PS starts a new iteration (), there are workers ready to compute the new gradient while the other workers are still computing stale gradients, i.e., relative to past parameter vectors  with . depends not only on the value of but also on the value of and the residual round trip times (i.e., the remaining times for the busy workers to complete their tasks). We assume that most of such dependence is captured by the number . This would be correct if round trip times were exponential random variables. Let denote the time the PS spends for receiving the -th gradient of , provided that it has waited gradients at iteration . Under our assumptions, for given values of  and , the values  can be seen as samples of the same random variable that we denote by . For estimating , we consider .555 It could seem more appropriate to consider , but we want to select a value of that leads to good performance on the long term, i.e., if constantly used. For this reason, we use , that corresponds to select at each iteration.

Consider and . The PS can collect the samples for (it needs to wait gradients before moving to the next iteration), but also for because late workers still complete the ongoing calculations. In fact, late workers may terminate the computation and send their (by now stale) gradients to the PS, before they receive the new parameter vector. Even if a new parameter vector is available at the local queue (and then they know their gradient is not needed), in DBW workers still notify the completion to the PS, providing useful information to estimate with limited communication overhead.

A first naive approach to estimate is to average the samples obtained over the past history. But, actually, there is much more information that can be exploited to improve estimations if we jointly estimate the complete set of values , for . In fact, the following pathwise relation holds for each and : , because the index denotes the order of arrivals of the gradients. As a consequence, . Moreover, coupling arguments lead to conclude that and . These two inequalities express the following intuitive facts: 1) if an iteration starts with more workers available to compute, the PS will collect gradients faster (on average), 2) constantly waiting a smaller number of gradients leads to faster iterations. These inequalities allow us to couple the estimations of , for . Samples for a given pair can thus contribute not only to the estimation of  but also to the estimations of other pairs. This is useful because the number of samples for is proportional to the number of times has been selected equal to . There can be many samples for a given pair and much less (even none) for another one.

Let be the set of samples available up to iteration for , i.e., , . We propose to estimate by solving the following optimization problem:

(16)
subject to

Let be the solution of problem (16). Then, , and we have . We observe that, without the constraints, the optimal value at iteration is the empirical average of the corresponding set . Hence, Problem (16) is a natural way to extend the empirical average estimators, while accounting for the constraints. For our application, the quadratic optimization problem (16) can be solved fast through solvers like CVX cvx; gb08 for the typical values of  ().

In Fig. 2, we compare our estimator with the naive one (the empirical average). We observe that the naive method 1) cannot provide estimates for a given value before it selects , 2) leads often to estimates that are in the wrong relative order. By enforcing the inequality constraints, our estimator (16) is able to obtain more precise estimates, in particular for the values and that are tested less frequently in this experiment.

(a) Values of selected.
(b) Empirical average.
(c) Constraint-aware estimator.
Figure 2: Estimation of . workers.

3.3 Dynamic Choice of

DBW rationale is to select the parameter that maximizes the expected decrease of the loss function per time unit, i.e.,:

(17)

Note that (17) does not select values of for which , unless for all values , in which case . This behaviour is correct. In fact, indicates the aggregate batch size may be too low to guarantee that the stochastic gradient corresponds to a descent direction and then it is opportune to increase (if possible). Our approach then recovers some behaviour of dynamic sample size methods (see (bottou18, Sect. 5.2), de17aistats).

In most of the existing implementations of distributed gradient methods for ML (including PyTorch’s one), each worker  can send to the PS the local average loss computed on its mini-batch. The PS can thus estimate the current loss as

The PS usually exploits this information to evaluate a stopping condition. DBW takes advantage of this available information to avoid decreasing from one iteration to the other, when the loss appears to be increasing (and then we need more accurate gradient estimates, rather than noisier ones). We modify (17) to

(18)

where (we select in our experiments) and denotes the indicator function (equal to iff is true). If the loss has become times larger since the previous iteration, then (18) forces .

4 Experiments

We have implemented DBW in PyTorch pytorch, using the MPI backend for distributed communications. The experiments have been run on a real CPU/GPU cluster platform, with different GPUs available (e.g., GeForce GTX 1080 Ti, GeForce GTX Titan X, and Nvidia Tesla V100). In order to have a fine control over the round trip times, our code can generate computation and communication times according to different distributions (uniform, exponential, Pareto, etc.) or read them from a trace provided as input file. The system operates at the maximum speed guaranteed by the underlying cluster, but it maintains a virtual clock to keep track of when events would have happened. Note that the virtual time is not a simple relabeling of the time axis: for example virtual time instants at which gradients are received by the PS determine which of them are actually used to update the parameter vector. So the virtual time has an effect on the optimization dynamics. Our code is available online github_dbw.

In what follows, we show that the optimal setting for the number of backup workers varies, not only with the round trip time distribution, but also with the hyper-parameters of the optimization algorithm like the batch size . Moreover, the optimal setting depends as well on the stage of the training process, and then changes over time, even when the cluster is stationary (round trip times do not change during the training period).

In all experiments, DBW achieves nearly optimal performance in terms of convergence time, and sometimes it even outperforms the optimal static setting, that is found through an exhaustive offline search over all values . We also compare DBW with a variant where the gain is not estimated as in (15), but it equals the number of aggregated gradients , as proposed in teng18neurips. We call this variant blind DBW (B-DBW), because it is oblivious to the current state of the training. We find that this approach is too simplistic: ignoring the current stage of the optimization problem leads to worse performance than DBW.

We evaluated DBW, B-DBW, and different static settings for on the classification problem of MNIST mnist, a dataset with images portraying handwritten digits. We trained a neural network with two convolutional layers with 55 filters and two fully connected layers. The loss function was the cross-entropy one.

The learning rate is probably the most critical hyper-parameter in ML optimization problems. Ideally, it should be set to that largest value that still guarantees convergence. It is important to note that different static settings for the number of backup workers require different values for the learning rate. In fact, the smaller is

, the noisier is the aggregate gradient , so that the smaller should be the learning rate. The rule of thumb proposed in the seminal paper Jianmin2016 is to set the learning rate proportional to , i.e., . This corresponds to the standard recommendation to have the learning rate proportional to the (aggregate) batch size goyal17; smith18iclr. In static settings, aggregating gradients is equivalent to use a batch size equal to , so that the learning rate should scale accordingly. An alternative approach is to tune the learning rate independently for each static value of according to the empirical rule in smith17, that requires to run a number of experiments and determine the inflection points of a specific curve. This rule leads as well to learning rates increasing with . We call the two settings respectively the proportional and the knee rule. The maximum learning rate for the proportional rule is set equal to the value determined for by the knee rule. The same value is also used as learning rate for DBW and B-DBW, independently from the specific value they select for . In fact, DBW and B-DBW can safely operate with a large learning rate because they dynamically increase up to , when they detect that the loss is increasing.

Figure 3: Loss versus time. MNIST, batch size , workers, estimates computed over the last iterations, proportional rule with

, round trip times follow shifted exponential distribution

.
(a)
(b)
(c)
Figure 4: Effect of round trip time distribution. MNIST, workers, batch size , estimates computed over the last iterations, proportional rule for in static settings where .

Figure 3 shows, for a single run of the training process, the evolution of the loss over time and the corresponding choices of for the two dynamic algorithms. For static settings, the learning rate follows the proportional rule and the optimal static setting is . We can see that DBW achieves the fastest convergence across all other tested configurations of , by using a different value of in different stages of the training process. In fact, as we have discussed after introducing (10), the effect of on the gain depends on the module of the gradient and on the variability of the local gradients. In the bottom subplot, the dotted line shows how their ratio varies during the training process. Up to iteration , is negligible in comparison to . DBW then selects small values for loosing a bit in terms of the gain, but significantly speeding up the duration of each iteration by only waiting for the fastest workers. As the parameter vector approaches a local minimum, approaches zero, and the gain becomes more and more sensitive to , so that DBW progressively increases up to reach as shown by the solid line. On the contrary B-DBW (the dashed line) selects most of the time with some variability to the randomness of the estimates .

4.1 Round trip time effect

In this subsection we consider round trip times (see Sect. 3.2) are i.i.d. according to a shifted exponential random variable , where . We consider later realistic time distributions. This choice, common to Lee; DuttaJGDN18, allows us to easily tune the variability of the round trip times by changing . When , all gradients arrive at the same time at the PS, so that the PS should always aggregate all of them. As changes from to , the variance of the round trip times increases, and waiting for gradients becomes advantageous.

Figure 4 compares the time needed to reach a training loss smaller than for the two dynamic algorithms and the static settings , , and , that are optimal respectively for , , and . For each of them, we carried out independent runs with different seeds. We find that our dynamic algorithm achieves the fastest convergence in all three scenarios, it is even x faster and x faster than the optimal static settings for and . There are two factors that determine this observation. First, as discussed for Fig. 3, there is no unique optimal value of to be used across the whole training process, and DBW manages to select the most indicated value in different stages of the training process. Second, DBW takes advantage of a larger learning rate. Both factors play a role. For example if we focus on Fig. 3(c), the learning rate for DBW is twice faster than that for , but DBW is on average x faster. Then, adapting achieves an additional x improvement. The importance of capturing the dynamics of the optimization process is again also evident by comparing DBW with B-DBW. While B-DBW takes advantage of a higher learning rate as well, it performs worse than our solution DBW.

4.2 Batch size effect

The batch size is another important hyper-parameter. It is often limited by the memory available at each worker, but can also be determined by generalization performance of the final model hoffer17. In this subsection we highlight how also affects the optimal setting for . These findings confirm that configuring the number of backup workers is indeed a difficult task, and knowing the characteristics of the underlying cluster is not sufficient.

The experiments differ in two additional aspects from those in Fig. 4. First, the distribution of the round trip times (as shown in Fig. 5

) is taken from a real ML experiment on running stochastic gradient descent on a production Spark cluster with sixteen servers using Zoe Analytics 

(pace17), each with two 8-core Intel E5-2630 CPUs running at 2.40GHz. Second, learning rates are configured according to the knee rule. We observe that the knee rule leads to a weaker variability of the learning rate in comparison to the proportional rule: for example, for , increases by less than a factor when changes from to , and it increases much less for larger .

Figure 5: Empirical distribution of round trip times on a Spark cluster
(a) ,
(b) ,
(c) ,
Figure 6: Effect of batch size B. MNIST, workers, estimates computed over the last iterations, knee rule for in static settings with values shown above for each .

Figure 6 shows the results for , comparing the dynamic methods with a few static settings, including the optimal static one that decreases from for to for . Again, Equation (10) helps to understand this change of the optimal static setting with different batch size: as the batch size increases, the variability of gradients decreases, so that the numerator depends less on . The advantage of reducing by selecting a small can compensate the corresponding decrease of the gain .

Since learning rates chosen by the knee rule for the static settings are now close to dynamic ones, DBW does not outperform the optimal static setting, but its performance are quite close, and significantly better than B-DBW for . It is worthy to stress that, when running a given ML problem on a specific cluster environment, the user cannot predict the optimal static setting without running preliminary short training experiments for every . DBW does not need them.

Figure 7: Robustness to slowdowns of the system. MNIST, workers, batch size , estimates computed over the last iterations, proportional rule for in static settings where .

4.3 Robustness to slowdowns

Until now, we have considered a stationary setting where the distribution of round trip times does not change during the training. Figure 7 shows an experiment in which half of the workers experience a sudden slowdown during the training process. Initially, round trip times are all equal and deterministic, so that the optimal setting is . Suddenly, at time s, half of the workers in the clusters slow down by a factor 5 and the optimal static configuration is now to select . We can see that DBW detects the slowdowns in the system and then correctly selects .

5 Conclusions

In this paper, we have shown that the number of backup workers needs to be adapted at run-time and the correct choice is inextricably bounded, not only to the cluster’s configuration and workload, but also to the hyper-parameters of the learning algorithm and the stage of the training. We have proposed a simple algorithm DBW that, without prior knowledge about the cluster or the problem, achieves good performance across a variety of scenarios, and even outperforms in some cases the optimal static setting.

As a future research direction, we want to extend the scope of DBW to dynamic resource allocation, e.g., by automatically releasing computing resources if and the fastest gradients are always coming from the same set of workers. In general, we believe that distributed systems for ML are in need of adaptive algorithms in the same spirit of the utility-based congestion control schemes developed in our community starting from the seminal paper kelly98. As our work points out, it is important to define new utility functions that take into account the learning process. Adaptive algorithms are even more needed in the federated learning scenario konecny15, where ML training is no more relegated to the cloud, but it occurs in the wild over the whole internet. Our paper shows that even simple algorithms can provide significant improvements.

6 Acknowledgement

This work has been carried out in the framework of a common lab agreement between Inria and Nokia Bell Labs (ADR ’Rethinking the Network’). We thank Alain Jean-Marie for having suggested the estimation technique in Sect. 3.2.

References

References