Distributed Machine Learning through Heterogeneous Edge Systems

11/16/2019 ∙ by Hanpeng Hu, et al. ∙ The University of Hong Kong 0

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large volumes and/or security/privacy concerns. Edge devices are intrinsically heterogeneous in computing capacity, posing significant challenges to parameter synchronization for parallel training with the parameter server (PS) architecture. This paper proposes ADSP, a parameter synchronization scheme for distributed machine learning (ML) with heterogeneous edge systems. Eliminating the significant waiting time occurring with existing parameter synchronization models, the core idea of ADSP is to let faster edge devices continue training, while committing their model updates at strategically decided intervals. We design algorithms that decide time points for each worker to commit its model update, and ensure not only global model convergence but also faster convergence. Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability and adaptability to large heterogeneity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many edge-based AI applications have emerged in recent years, where various edge systems (e.g., PCs, smart phones, IoT devices) collect local data, collaboratively train a ML model, and use the model for AI-driven services. For example, smart cameras are deployed in surveillance systems [30, 19]

, which capture local images/videos and train a global face recognition model aggregately. In Industry AI Operations (AIOps)

[23], chillers in a building or an area collect temperature and electricity consumption data in the households, and derive a global COP (Coefficient of Performance) prediction model [3].

A straightforward way of training a global model with data collected from multiple edge systems is to send all data to a central venue, e.g., a cloud data center, and train the datasets using a ML framework, such as TensorFlow

[1], MXNet [4] and Caffe2 [8]). Such a ‘data aggregation training’ approach may well incur large network bandwidth cost, due to large data transmission volumes and continuous generation nature of the data flow, as well as data security and privacy concerns. To alleviate these issues, collaborative, distributed training among edge systems has been advocated [27], where each edge system locally trains the dataset it collects, and exchanges model parameter updates (i.e., gradients) with each other through parameter servers [33, 13, 19] (a.k.a., geo-distributed data parallel training).

Edge systems are intrinsically heterogeneous: their hardware configurations can be vastly different, leading to different computation and communication capacities. This brings significant new issues on parameter synchronization among the edge workers. In a data center environment, synchronous training (i.e., Bulk Synchronous Parallel (BSP) [9] [15] [17]) is adopted by the majority of production ML jobs (based on our exchanges with large AI cloud operators), given the largely homogeneous worker configuration: each worker trains a mini-batch of input data and commits computed gradients to the PS; the PS updates global model after receiving commits from all workers, and then dispatches updated model parameters to all workers, before each worker can continue training its next mini-batch. In the edge environment, the vastly different training speeds among edge devices call for a more asynchronous parameter synchronization model, to expedite ML model convergence.

Stale Synchronous Parallel (SSP) [9] and Totally Asynchronous Parallel (TAP) [10] are representative asynchronous synchronization models. With TAP, the PS updates the global model upon commit from each individual worker, and dispatches updated model immediately to the respective worker; it has been proven that such complete asynchrony cannot ensure model convergence [10]. SSP enforces bounded asynchronization: fast workers wait for slow workers for a bounded difference in their training progress, in order to ensure model convergence. A few recent approaches have been proposed to further improve convergence speed of asynchronous training[7, 32] (see more in Sec. 2).

We investigate how existing parameter synchronization models work in a heterogeneous edge environment with testbed experiments (Sec. 2.3), and show that the waiting time (overall model training time minus gradient computation time) is still more than 50% of the total training time with the representative synchronization models.

Aiming at minimizing the waiting time and optimizing computing resource utilization, we propose ADSP (ADaptive Synchronous Parallel), a new parameter synchronization model for distributed ML with heterogeneous edge systems. Our core idea is to let faster workers continue with their mini-batch training all the time, while enabling all workers to commit their model updates at the same strategically decided intervals, to ensure not only model convergence but also faster convergence. The highlights of ADSP are summarized as follows:

ADAP is tailored for distributed training in heterogeneous edge systems, which fully exploits individual workers’ processing capacities by eliminating the waiting time.

ADSP actively controls the parameter update rate from each worker to the PS, to ensure that the total number of commits from each worker to the PS is roughly the same over time, no matter how fast or slow each worker performs local model training. Our algorithm exploits a momentum-based online search approach to identify the best cumulative commit number across all workers, and computes the commit rates of individual workers accordingly. ADSP is proven to converge after a sufficient number of training iterations.

We have done a full-fledged implementation of ADSP and evaluated it with real-world edge ML applications. Evaluation results show that it outperforms representative parameter synchronization models significantly in terms of model convergence time, scalability and adaptability to large heterogeneity.

2 Background and Motivation

2.1 SGD in PS Architecture

Stochastic Gradient Descent (SGD) is the widely used algorithm for training neural networks [7, 1]. Let be the set of global parameters of the ML model at . A common model update method with SGD is:

(1)

where is the gradient, is the learning rate, and is the momentum introduced to accelerate the training process, since it accumulates gradients in the right direction to the optimal point [22, 26].

In widely adopted data-parallel training with the parameter server (PS) architecture[5], SGD update rule can be applied at both the workers and the PS [11]. Each worker holds a local copy of the ML model, its local dataset is divided into mini-batches, and the worker trains its model in an iterative fashion: in each step, the worker calculates gradients of model parameters using one mini-batch of its data, and may commit its gradients to the PS and pull the newest global model parameters from the PS. The PS updates the global model using Eqn. (1) with gradients received from the workers and a global learning rate . In the case that a worker does not synchronize model parameters with the PS per step, the worker may carry out local model updates using computed gradients according to Eqn. (1), where the gradients are multiplied by a local learning rate .

2.2 Existing Parameter Synchronization Models

A parameter synchronization model specifies when each worker commits its gradients to the PS and whether it should be synchronized with updates from other workers; it critically affects convergence speed of model training. Three representative synchronization models, BSP, SSP and TAP, have been compared in [10], which proves that BSP and SSP guarantee model convergence whereas TAP does not. Training convergence with BSP is significantly slower than SSP [9], due to BSP’s strict synchronization barriers. Based on the three synchronization models, many studies have followed, aiming to reduce the convergence time by reducing communication contention or overhead [2, 16, 25], adjusting the learning rate [11], and others [34]. ADACOMM [32] allows accumulating local updates before committing to the PS, and adopts BSP-style synchronization model, i.e., all workers run training steps before synchronizing with the PS. It also suggests reducing the commit rate periodically according to model loss; however, the instability in loss values during training and the rapidly declining commit rate are not ideal for expediting training (according to our experiments).

Aiming at minimizing waiting time among heterogeneous workers, our synchronization model, ADSP, employs an online search algorithm to automatically find the optimal/near-optimal update commit rate for each worker to adopt.

2.3 Impact of Waiting

Figure 1: Training time breakdown with different parameter synchronization models.

We divide the time a worker spends in each training step into two parts: (i) the computation time, to carry out backward propagation to compute gradients/apply model updates and forward propagation to produce output [4]; and (ii) the waiting time, including the time for exchanging gradients/parameters with the PS and the blocked time due to synchronization barrier (i.e., the time when the worker is not doing computation nor communication).

We experiment with representative synchronization models to investigate their waiting time incurred. We train a convolutional neural network (CNN) model on the Cifar10 dataset

[14] with 1 PS and 3 workers with heterogeneous computation capacities (time ratio to train one mini-batch is 1:1:3). Fig. 1 shows the convergence time (overall training time to model convergence) and the average time spent per training step, incurred with BSP, SSP, and ADACOMM (See Sec. 5 for their details). TAP is not compared as it has no convergence guarantee. The computation/waiting time is averaged among all workers. We see that with heterogeneous workers, the waiting time dominates the overall training time with BSP and SSP, and their overall convergence time and time spent per training step are long. With ADACOMM, the waiting time and overall training time are much shorter. Nevertheless, its waiting time is still close to half of the total training time, i.e., the effective time used for model training is only around 50%, due to its relative conservative approach on local model updates.

Our key question is: what is the limit that we can further reduce the waiting time to, such that time is spent most efficiently on model computation and convergence can be achieved in the most expedited fashion? Our answer, ADSP, allows fast workers to keep training while maintaining approximately the same gradient commit rates among all workers. Fig. 1 shows that the waiting time is minimized to a negligible level with ADSP, as compared to the computation time. As such, almost all training time is effectively used for model computation and fast model convergence is achieved.

3 ADSP Overview

We consider a set of heterogeneous edge systems and a parameter server (PS) located in a datacenter, which together carry out SGD-based distributed training to learning a ML model. ADSP (ADaptive Synchronous Parallel) is a new parameter synchronization model for this distributed ML system. The design of ADSP targets the following goals: (i) make full use of the computation capacity of each worker; (ii) choose a proper commit rate to balance the tradeoff between hardware efficiency (utilization of worker computing capacity) and statistical efficiency (i.e., reduction of loss per training step), in order to minimize the overall time taken to achieve model convergence; (iii) ensure model convergence under various training speeds and bandwidth situations at different workers.

Figure 2: ADSP workflow.

With ADSP, time is divided into equal-sized slots of duration : , which we call as check periods. We refer to time points , as checkpoints. More precisely, we define the process of a worker sending computed gradients to the PS as a commit, and the number of commits from worker during a check period as commit rate . ADSP consists of two modules: 1) a novel synchronization model, which allows faster edge systems to perform more training before each update to the PS, and ensures that the commit rates of all worker are the same; 2) a global commit rate search algorithm, which selects an appropriate commit rate for all workers to pursue, in order to achieve fast convergence.

Let denote the total number of commits from worker to the PS, since the very beginning. At each checkpoint, we compute the target total number of commits that each worker is expected to have submitted by the next checkpoint, , and adjust the commit rate of each worker in the next check period as , respectively.

Fig. 2 shows the workflow of our ADSP model. The data produced/collected at each edge system/worker is stored into training datasets. For each mini-batch in its dataset, an edge system computes a local update of model parameters, i.e., gradients, using examples in this mini-batch. After training one mini-batch, it moves on to train the next mini-batch and derives another local update. Worker pushes its accumulative update (i.e., sum of all gradients it has produced since last commit multiplied with the local learning rate) according to the commit rate . A scheduler adjusts and informs each end system of the target commit rate over time. Upon receiving a commit from worker , the PS multiplies the accumulated update with the global learning rate [11] and then updates the global model with it; worker then pulls updated parameters from the PS and continues training the next mini-batch.

4 ADSP Algorithms and Analysis

It is common to have large heterogeneity among edge systems, including different computation power and network delays to the datacenter hosting the PS. Our core idea in designing ADSP is to adapt to the heterogeneity, i.e., to transform the training in heterogeneous settings into homogeneous settings using a no-waiting strategy: we allow different workers to process different numbers of mini-batches between two commits according to their training speed, while ensuring the number of commits of all the workers approximately equal at periodical checkpoints. To achieve this, we mainly control the hyper-parameter, commit rate, making faster workers accumulate more local updates before committing their updates, so as to eliminate the waiting time. By enforcing approximately equal numbers of commits from all the workers over time, we can ensure model convergence.

4.1 The Impact of on Convergence

The target total number of commits to be achieved by each worker by the next checkpoint, , decides commit rate of each worker within the next check period, as ( is ’s current total commit number). The commit rate has a significant impact on the training progress: if is large, a slow worker may fail to achieve that many commits in the next period, due to the limited compute capacity; even if it can hit the target, too many commits may incur high communication overhead, which in turn slows down the training process. On the other hand, if the number of target commits is too small, which implies that each end system commits its gradients after many steps of mini-batch training using local parameters, large difference and significant staleness exist among the model copies at different workers, which may adversely influence model convergence as well.

Figure 3: (a) the impact of on convergence time; (b) an illustration of ; (c) convergence time with different values.

To illustrate this, we train a CNN model on the Cifar10 dataset [14] with 1 PS and 3 workers (time ratio to train one mini-batch is 1:1:3), where all workers keep training their mini-batches and commit gradients to the PS at the same commit rate over time. We vary the value of in different runs of the experiment. Fig. 3(a) shows that with the increase of , the model convergence time becomes smaller at first and then increases. This is consistent with our discussions above.

We next quantify the effect of the commit rate on model convergence. Suppose that all workers communicate with the PS independently. Let denote the accumulative local updates that a worker commits when the global model is , and denote the number of steps that worker can train per unit time. We have the following theorem.

Theorem 1.

Set the momentum in the SGD update formula (1) to zero. The expected SGD update on the global model is equivalent to

(2)
(3)

The detailed proof is given in the supplemental file. Compared to the SGD update formula in Eqn. (1), the result is interesting: with our ADSP model, staleness induced by cumulative local updates can be considered as inducing an extra momentum term (i.e., ) into the SGD update equation. To distinguish this term from the original momentum in Eqn. (1), we refer to this term as the implicit momentum, denoted by . As we increase , the implicit momentum becomes smaller according to Eqn. (2).

With the same CNN training experiments as above, Fig. 3(b) illustrates how varies with (according to Eqn. (3)). The optimal momentum is derived based on Fig. 3(c), where we vary the value of in Eqn. (2) in our experiments, and show how the time taken for model convergence varies with different values. Inspired by the observations, we seek to identify the best commit rate for the workers, that decides the best to achieve the shortest convergence time.

4.2 The Commit Rate Search Algorithm

We propose a local search method to identify a near-optimal commit rate to achieve the fastest convergence, exploiting the observations that the staleness induced by local updates can be converted to an implicit momentum term in SGD update and the implicit momentum decreases as we increase the commit rate. The algorithm is given in Alg. 1, which is executed by the scheduler (Fig. 2).

In the algorithm, an epoch is a time interval containing multiple check periods, for commit rate adjustment. At the beginning of each epoch

(e.g., 1 hour), the scheduler performs the search for the optimal commit rates of workers in this epoch. We start with a small target total commit number

, allowing each worker to commit at least once in each check period; in this case, the commit rates ’s are small, asynchrony-induced implicit momentum is large, and the corresponding point in Fig. 3(b) is located to the left of the optimal momentum. Then the scheduler evaluates the training performance (i.e., loss decrease speed, to be detailed in Sec. 4.2) induced by and , by running the system using commit rates computed based on the two values for a specific period of time (e.g., 1 minute). If leads to better performance, the scheduler repeats the search, comparing performance achieved by and further; otherwise, the search stops and the commit rates ’s decided by the current are used for the rest of this epoch. The rationale behind is that the optimal for each epoch is larger than the initial value (), so we only need to determine whether to increase it or not.

1:function MainFunction
2:     for epoch e = 1, 2, … do
3:         
4:          DecideCommitRate()
5:         run ParameterServer and Wokers for the remaining time
6:     end for
7:end function
8:function DecideCommitRate()
9:      OnlineEvaluate()
10:      OnlineEvaluate()
11:     if  then
12:         return DecideCommitRate().
13:     else
14:         return
15:     end if
16:end function
17:function OnlineEvaluate()
18:     for  = 0, 1, 2, …,M do
19:         
20:         Send to worker
21:     end for
22:     Training for 1 minute
23:     return reward
24:end function
Algorithm 1 Commit Rate Adjustment at the Scheduler

Online Search and Reward Design.

Traditional search methods are usually offline [7], blocking the whole system when trying out a specific set of variable values and trying each configuration starting with the same system state. With an offline search method, one can select the best configuration by comparing the final loss achieved after running different configurations for the same length of time. However, such a search process incurs significant extra delay into the training progress and hence significant slowdown of model convergence. In Alg. 1, we instead adopt an online search method (in DECIDECOMMITRATE()): we consecutively run each configuration for a specific time (e.g., 1 minute) without blocking the training process.

To compare the performance of the configurations when they do not start with the same system state, we define a reward as follows. The loss convergence curve of SGD training usually follows the form of [20]. We collect a few (time , loss ) pairs when the system is running with a particular configuration, e.g., at the start, middle and end of the 1 minute period, and use them to fit the following formula on the left:

where are parameters. Then we obtain the reward as the loss decrease speed, by setting to a constant and calculating the reciprocal of corresponding . The target of the online search algorithm is to find the commit rate that reaches the maximum reward, i.e., the minimum time to converge to a certain loss.

1:i = 1, 2, …, m
2:function Worker
3:     for epoch  do
4:         receive from the scheduler
5:         set a timer with a timeout of and invoking TimeOut() upon timeout
6:         while model not converged do
7:              train a minibatch to obtain gradient
8:              accumulated gradient ( is the local learning rate)
9:         end while
10:     end for
11:end function
12:function TimeOut
13:     commit to the PS
14:     receive updated global model parameters from the PS and update local model accordingly
15:     restart the timer with timeout of
16:end function
1:
2:function ParameterServer
3:     while model not converged do
4:         if receive commit from worker  then
5:              
6:              Send to worker
7:         end if
8:     end while
9:end function
Algorithm 2 ADSP: Worker and PS Procedures

4.3 Worker and PS Procedures

The procedures at each end system (i.e., worker) and the PS with ADSP is summarized in Alg. 2, where represents the communication time for worker to commit an update to the PS and pull the updated parameters back. At each worker, we use a timer to trigger commit of local accumulative model update to the PS asynchronously, once every time.

4.4 Convergence Analysis

We show that ADSP in Alg. 2 ensures model convergence. We define

as the objective loss function at step

with global parameter state , where is the global number of steps (i.e., cumulative number of training steps carried out by all workers). Let be the set of global parameters obtained by ADSP right after step t, and denote the optimal model parameters that minimize the loss function. We make the following assumptions on the loss function and the learning rate, which are needed for our convergence proof, but are not followed in our experimental settings.

Assumptions:

  1. [label=(0), leftmargin=20pt]

  2. is convex

  3. is -Lipschitz, i.e.,

  4. The learning rate decreases as , , where is a constant.

Based on the assumptions, we have the following theorem on training convergence of ADSP.

Theorem 2 (Convergence).

ADSP ensures that by each checkpoint, the numbers of update commits submitted by any two different workers and are roughly equal, i.e., . The regret is upper-bounded by , when .

The regret is the accumulative difference between the loss achieved by ADSP and the optimal loss over the training course. When the accumulative difference is under a sub-linear bound about T (where T is the total number of parameter update steps at the PS), we have when t is large. Then as , showing that our ADSP model converges to the optimal loss. The detailed proof is given in the supplemental file.

5 Performance Evaluation

We implement ADSP as a ready-to-use Python library based on TensorFlow [1], and evaluate its performance with testbed experiments.

5.1 Experiment Setup

Testbed. We emulate heterogeneous edge systems following the distribution of hardware configurations of edge devices in a survey [12], using 19 Amazon EC2 instances [31]: 7 instances, 5 instances, 4 instances and 2 instances as workers, and 1 instance as the PS.

Applications. We evaluate ADSP with three distributed ML applications: (i) image classification on Cifar-10 [14] using a CNN model from the TensorFlow tutorial [28]

; (ii) Fatigue life prediction of bogies on high-speed trains, training a recurrent neural network (RNN) model with the dataset collected from the China high-speed rail system; (iii) Coefficient of Performance (COP) prediction of chillers, training a global linear SVM model with a chiller dataset.

Baselines. (1) SSP [9], which allows the fastest worker to run ahead of the slowest worker by up to steps; (2) BSP [29], where the PS strictly synchronizes among all workers such that they always perform the same number of training steps. (3) ADACOMM [32], which allows all workers to accumulate updates before synchronizing with the PS and reduces periodically. (4) Fixed ADACOMM, a variant of ADACOMM with fixed for all workers.

Default Settings. By default, each mini-batch in our model training includes 128 examples. The check period of ADSP is 60 seconds, and each epoch is 20 minutes long. The global learning rate is (which we find works well through experiments). The local learning rate is initialized to 0.1 and decays exponentially over time.

Figure 4: Comparison of ADSP with baselines in training efficiency: training CNN on the Cifar-10 dataset.

5.2 Experiment Results

All results given in the following are based on CNN training on the Cifar-10 dataset. More experiment results on fatigue life prediction and CoP prediction are given in the supplemental file.

Performance of ADSP.

We compare ADSP with the baselines in terms of the wall-clock time and the number of training steps needed to reach model convergence, to validate the effectiveness of no-waiting training of ADSP. In Fig. 4

, the global loss is the loss evaluated on the global model on the PS, and the number of steps is the cumulative number of steps trained at all workers. We stop training, i.e., decide that the model has converged, when the loss variance is smaller than a small enough value for 10 steps. Fig. 

4(a) plots the loss curves and Fig. 4(b) correspondingly shows the convergence time with each method. We see that ADSP achieves the fastest convergence: acceleration as compared to BSP, to SSP, and to Fixed ADACOMM. For ADACOMM, although we have used the optimal hyper-parameters as in [32], it converges quite slowly, which could be due to its instability in tuning : is tuned periodically based on the current loss; if the loss does not decrease, it simply multiplies with a constant. In Fig. 4(c), we see that ADSP carries out many more training steps within its short convergence time, which may potentially lead to a concern on its training efficiency. Fig. 4(d) further reveals that the per-training-step loss decrease achieved by ADSP is slightly lower than that of Fixed ADACOMM, and better than other baselines. The spike in ADSP curve at the beginning stage is due to small commit rates that our search algorithm derives, which make the loss fluctuates significantly. However, with ADSP, the model eventually converges to a smaller loss than losses that other baselines converge to.

Figure 5: Comparison of ADSP with Fixed ADACOMM at different degrees of heterogeneity and system scales.

Adaptability to Heterogeneity.

We next evaluate ADSP’s adaptability to different levels of end system heterogeneity. Besides hardware configuration difference among the workers, we further enable each worker to sleep for a specific short time after each step of training one mini-batch, and tune the sleep time to adjust training speeds of workers. We define the heterogeneity degree among the workers as follows:

where is the number of mini-batches that worker can process per unit time. The discussion of the heterogeneity degree considering communication overhead is given in our supplemental file.

Since BSP, SSP and ADACOMM are significantly slower than ADSP in training convergence, here we only compare ADSP with Fixed ADACOMM. Fig. 5(a)-(d) show that ADSP achieves faster convergence than Fixed ADACOMM (though with more spikes) in different heterogeneity levels. The corresponding convergence times are summarized in Fig. 5(e), which shows that the gap between ADSP and Fixed ADACOMM becomes larger when the workers differ more in training speeds. ADSP achieves a convergence speedup as compared to Fiexd ADACOMM when . The reason lies in that Fixed ADACOMM still enforces faster workers to stop and wait for the slower workers to finish local updates, so the convergence is significantly influenced by the slowest worker. With ADSP, the heterogeneity degree hardly affects the convergence time much, due to its no-waiting strategy. Therefore, ADSP can adapt well to heterogeneity in end systems.

System Scalability

We further evaluate ADSP with 36 workers used for model training, whose hardware configuration follows the same distribution as in the 18-worker case. Fig. 5(f) shows that when the worker number is larger, both ADACOMM and ADSP become slower, and ADSP still achieves convergence faster than Fixed ADACOMM (which is more obvious than in the case of smaller worker number). Intuitively, when the scale of the system becomes larger, the chances increase for workers to wait for slower ones to catch up, resulting in that more time being wasted with Fixed ADACOMM; ADSP can use this part of time to do more training, and is hence a more scalable solution in big ML training jobs.

The Impact of Network Latency.

Edge systems usually have relatively poor network connectivity [13]; the communication time for each commit is not negligible, and could be even larger than the processing time in each step. Fig. 6 presents the convergence curve of each method as we add different extra delays to the communication module. When we increase the communication delay, the speed-up ratio of ADSP, Adacomm and Fixed Adacomm, as compared to BSP and SSP, becomes larger. This is because the first three models allow local updates and commit to the PS less frequently, consequently less affected by the communication delay than the last two methods. Among the first three models, ADSP still performs the best in terms of convergence speed, regardless of the communication delay.

The rationale behind is that we can count the communication time when evaluating a worker’s ‘processing capacity’: for worker , the average processing time per training step is , where is the time to train a mini-batch, is the communication time for each commit, and is the number of local updates between two commits.Therefore, we can extend the scope of heterogeneity in processing capacity to include the heterogeneity of communication time as well. ADSP only needs to ensure the commit rates of all workers are consistent, and can inherently handle the generalized heterogeneity without regard to which components cause the heterogeneity.

Figure 6: Comparison of ADSP with baselines with different network delays.

6 Concluding Remarks

This paper presents ADSP, a new parameter synchronization model for distributed ML with heterogeneous edge systems. ADSP allows workers to keep training with minimum waiting and enforces approximately equal numbers of commits from all workers to ensure training convergence. An online search algorithm is carefully devised to identify the near-optimal global commit rate. ADSP maximally exploits computation resources at heterogeneous workers, targeting training convergence in the most expedited fashion. Our testbed experiments show that ADSP achieves up to convergence acceleration as compared to most of the state-of-the-art parameter synchronization models. ADSP is also well adapted to different degrees of heterogeneity and large-scale ML applications.

7 Acknowledgements

This work was supported in part by grants from Hong Kong RGC under the contracts HKU 17204715, 17225516, C7036-15G (CRF), C5026-18G (CRF), in part by WHU-Xiaomi AI Lab, and in part by GRF PolyU 15210119, ITF UIM/363, CRF C5026-18G, PolyU 1-ZVPZ, and a Huawei Collaborative Grant.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) TensorFlow: a system for large-scale machine learning.. In OSDI, Vol. 16, pp. 265–283. Cited by: §1, §2.1, §5.
  • [2] C. Chen, W. Wang, and B. Li (2019) Round-robin synchronization: mitigating communication bottlenecks in parameter servers. In IEEE INFOCOM, pp. 532–540. Cited by: §D.2, §2.2.
  • [3] Q. Chen, Z. Zheng, C. Hu, D. Wang, and F. Liu (2019)

    Data-driven task allocation for multi-task transfer learning on the edge

    .
    In IEEE ICDCS, Cited by: §1.
  • [4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2015) Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274. Cited by: §1, §2.3.
  • [5] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman (2014)

    Project adam: building an efficient and scalable deep learning training system.

    .
    In OSDI, Vol. 14, pp. 571–582. Cited by: §2.1.
  • [6] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, large minibatch sgd: training imagenet in 1 hour

    .
    arXiv:1706.02677. Cited by: §D.2.
  • [7] S. Hadjis, C. Zhang, I. Mitliagkas, D. Iter, and C. Ré (2016) Omnivore: an optimizer for multi-device deep learning on cpus and gpus. arXiv:1606.04487. Cited by: §1, §2.1, §4.2.
  • [8] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, et al. (2018) Applied machine learning at facebook: a datacenter infrastructure perspective. In HPCA, pp. 620–629. Cited by: §1.
  • [9] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing (2013) More effective distributed ml via a stale synchronous parallel parameter server. In NIPS, pp. 1223–1231. Cited by: Appendix B, §1, §1, §2.2, §5.1.
  • [10] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O. Mutlu (2017) Gaia: geo-distributed machine learning approaching lan speeds.. In NSDI, pp. 629–647. Cited by: §1, §2.2.
  • [11] J. Jiang, B. Cui, C. Zhang, and L. Yu (2017) Heterogeneity-aware distributed parameter servers. In SIGMOD, pp. 463–478. Cited by: §2.1, §2.2, §3.
  • [12] Jkielty (2019) The most popular smartphones in 2018. https://deviceatlas.com/blog/most-popular-smartphones. Cited by: §5.1.
  • [13] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: strategies for improving communication efficiency. arXiv:1610.05492. Cited by: §1, §5.2.
  • [14] A. Krizhevsky and G. Hinton (2010)

    Convolutional deep belief networks on cifar-10

    .
    Unpublished manuscript 40 (7), pp. 1–9. Cited by: §D.1, §2.3, §4.1, §5.1.
  • [15] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola (2013) Parameter server for distributed machine learning. In NIPS, Vol. 6, pp. 2. Cited by: §1.
  • [16] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally (2017) Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv:1712.01887. Cited by: §2.2.
  • [17] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein (2012) Distributed graphlab: a framework for machine learning and data mining in the cloud. PVLDB 5 (8), pp. 716–727. Cited by: §1.
  • [18] A. K. Mathur (1988-September 6) Determining the coefficient of performance of a refrigeration system. Google Patents. Note: US Patent 4,768,346 Cited by: §D.1.
  • [19] J. Park, S. Samarakoon, M. Bennis, and M. Debbah (2018) Wireless network intelligence at the edge. arXiv:1812.02858. Cited by: §1, §1.
  • [20] Y. Peng, Y. Bao, Y. Chen, C. Wu, and C. Guo (2018) Optimus: an efficient dynamic resource scheduler for deep learning clusters. In EuroSys, Cited by: §4.2.
  • [21] D. Pollard (1997) Poisson processes. http://www.stat.yale.edu/~pollard/Courses/241.fall97/Poisson.Proc.pdf. Cited by: Appendix A.
  • [22] B. T. Polyak (1964) Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4 (5), pp. 1–17. Cited by: §2.1.
  • [23] X. Qu and J. Ha (2017) Next generation of devops: aiops in practice@ baidu. In SREcon17, Cited by: §1.
  • [24] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Cited by: §D.2.
  • [25] P. Sun, Y. Wen, T. N. B. Duong, and S. Yan (2016) Timed dataflow: reducing communication overhead for distributed machine learning systems. In ICPADS, pp. 1110–1117. Cited by: §2.2.
  • [26] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013) On the importance of initialization and momentum in deep learning. In ICML, pp. 1139–1147. Cited by: §2.1.
  • [27] Z. Tao and Q. Li (2018) Esgd: communication efficient distributed deep learning on the edge. In HotEdge, Cited by: §1.
  • [28] Tensorflow (2019) Use tensorflow to train a cnn on cifar-10. https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10. Cited by: §D.1, §5.1.
  • [29] L. G. Valiant (1990) A bridging model for parallel computation. Commun. ACM 33 (8), pp. 103–111. Cited by: §5.1.
  • [30] M. Ved (2019) Artificial intelligence (ai) solutions on edge devices. https://medium.com/@mehulved1503/artificial-intelligence-ai-solutions-on-edge-devices-1cc08d411a7c. Cited by: §1.
  • [31] G. Wang and T. E. Ng (2010) The impact of virtualization on network performance of amazon ec2 data center. In IEEE INFOCOM, pp. 1–9. Cited by: §5.1.
  • [32] J. Wang and G. Joshi (2018) Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. arXiv:1810.08313. Cited by: §1, §2.2, §5.1, §5.2.
  • [33] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan (2018) Adaptive federated learning in resource constrained edge computing systems. J-SAC 8, pp. 9. Cited by: §1.
  • [34] C. Zhang, H. Tian, W. Wang, and F. Yan (2018) Stay fresh: speculative synchronization for fast distributed machine learning. In ICDCS, pp. 99–109. Cited by: §2.2.

Appendix A Proof of Theorem 1

We use to index the state of the global parameter state. denotes the state of parameters after the commit, is the global learning rate, Let denote the accumulative local updates (i.e., gradients) that a worker commits when the global model is . To prove Theorem 1, we explain that how the gloabl parameter update equation of ADSP is equivalent to the following form.

We assume the time that it cost to accumulate local updates for commit

obeys the exponential distribution

, where is the frequency of commits. The staleness distribution

, we define the number of commits from other workers between a worker’s two commits as the staleness, which follows a geometrical distribution

. Because we assume each commit is independent, the number of commits from other workers in units of time is

So the staleness should be euqal to the number of commits from other workers in units of time, i.e., . According to [21], We can convert this combination to a geometrical distribution

(4)

Suppose the noisy global parameters update euqation can be written as

(5)

where is the real state of parameters on which the ’th commit is calculated and is uncertain. Considering the staleness of each commit follows a geometrical distribution, let , we can get the expectation of .

(6)

Then, we can get the expectation of

(7)

Eqn. (7) can be regarded as the momentum-based SGD update equation (8), thus we prove that the asynchrony induced by local updates can be converted to an implicit momentum term.

(8)

Appendix B Convergence Proof Under ADSP

We follow the similar mehthod as that used in SSP[9] to formularize our ADSP model. In our proof, we focus on Stochastic Gradient Descent(SGD), which is very famous and widly used for finding the optimal parameters in the machine learning model. Our final analysis result is as below,

Suppose there are m workers which push updates to the parameter server at different time intervals. Take worker i for instance, denotes the noisy local state of parameters in worker i after c commits and k local updates, and let denote the local undate of worker i after c commits. And is the initial state of parameters in each worker. When we get defined above at some point, let represent the number of commits of worker at this point. For simplicity, we use instead of in the following analysis.

So the noisy state is equal to:

We define a reference sequence for state as follows, to link the noisy state sequence and optimal state sequence together. The reference sequence is viewed as the ‘true’ sequence of parameters produced by forcing workers to commit in a Round-Robin order. In other words, we can consider the reference sequence of state as that generated by training in the order as described in Algorithm 3.

where for each ,

Here denotes the number of local updates between two commits of worker , and its value can be calculated based on the commit rate and the speed of this machine , by taking the following transform euqation.

where denotes the check period.

1:m is the total number of workers, is the commit interval for worker i.
2:for c = 1, 2, … do
3:     for i = 1, 2, …., m do
4:          for k = 1, 2, …,  do
5:               train an example or a mini-batch examples
6:               calculate an update
7:          end for
8:          commit accumulative updates
9:     end for
10:end for
Algorithm 3 Reference sequence of state

Then we can derive that the reference state (i.e., ) is equal to:

(9)

We can get the relationship between the noisy sequence of state and the reference one as below:

where we have decomposed the difference between the nosiy state and the reference state into , the index set of updates that are missing from , and , the index set of “extra” updates in but not in .

Again, we reclaim Theorem 1 here.

Theorem 3 (Convergence).

Suppose ADSP is used when training an ML model with SGD, where the objective loss function is convex and the learning rate decreases as , , where is a constant. Let be the set of global parameters obtained by ADSP right after step t, and denote the ideal optimal parameters that can minimize the loss function. ADSP ensures that between any two different workers and , the commit numbers by each checkpoint is roughly equal, i.e., , where is a small constant, and the regret is bound by , when .

Proof: We define , and assume that for any , where is the possible value space of parameters, i.e.,the optimization problem has bounded diameter. Because we use SGD as our algorithm, our goal is to optimize the convex objective function f by iteratively applying gradient descent . So . And we assume that is L-Lipschitz, so . Let the learning rate , with t = 1, 2, …, T. Here and are all constants.

So, we first get the regret as follows,

(10)

where denotes .

Lemma 1: For any ,

(11)

Proof:

(12)

Where denotes a coefficient according to the class of example at t step. And

(13)

Combining Equation 12 and Equation 13, complete the proof of Lemma 1

Back to Equation 10: We use Lemma 1 to continue the proof:

(14)

First, we first bound the upper limit of term: :

(15)

Next, we bound the upper limit of the second term:

(16)

Third, before we bound the upper limit of the rest term, we give a lemma:

Lemma 2: For any different worker machine and , as long as where is a constant, and the minimum

Proof:

(17)

And because for any worker machine and , suppose at t step, for the minimum , we have:

So, we can bound the upper limit of the final term:

(18)

Note that .

Back to Equation 10: We final get the result:

(19)

This completes the proof of Theorem 1.

Type vCPUs
Memory
(GiB)
Number of
Instances
2 8 7
4 16 5
8 32 4
8 32 1
4 16 2
Table 1: AMAZON EC2
Geekbench Share
Vendor 4.1/4.2 64 Bit (USA)
Multi-Core Score
iPhone 6 2759
iPhone 6S 4459
iPhone 6S Plus 4459
iPhone SE 4459
iPhone 7 5937
iPhone 7 Plus 5937
Samsung Galaxy S8 6711
iPhone 8 Plus 11421
iPhone X 11421
iPhone 8 11421
Table 2: Smart Phone Market Share in the USA during over Q2 2018

Appendix C Convergence Speed Analysis

Given workers, worker trains steps per second, thus each step needs seconds, is the communication overhead when worker sends updates to the PS.

Bsp

The average speed: