Gradient Scheduling with Global Momentum for Non-IID Data Distributed Asynchronous Training

02/21/2019 ∙ by Chengjie Li, et al. ∙ Huazhong University of Science u0026 Technology New Paltz 0

Distributed asynchronous offline training has received widespread attention in recent years because of its high performance on large-scale data and complex models. As data are processed from cloud-centric positions to edge locations, a big challenge for distributed systems is how to handle native and natural non-independent and identically distributed (non-IID) data for training. Previous asynchronous training methods do not have a satisfying performance on non-IID data because it would result in that the training process fluctuates greatly which leads to an abnormal convergence. We propose a gradient scheduling algorithm with global momentum (GSGM) for non-IID data distributed asynchronous training. Our key idea is to schedule the gradients contributed by computing nodes based on a white list so that each training node's update frequency remains even. Furthermore, our new momentum method can solve the biased gradient problem. GSGM can make model converge effectively, and maintain high availability eventually. Experimental results show that for non-IID data training under the same experimental conditions, GSGM on popular optimization algorithms can achieve an 20 improvement in accuracy on Fashion-Mnist and CIFAR-10 datasets. Meanwhile, when expanding distributed scale on CIFAR-100 dataset that results in sparse data distribution, GSGM can perform an 37 Moreover, only GSGM can converge well when the number of computing nodes is 30, compared to the state-of-the-art distributed asynchronous algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 6

page 7

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Mobile and IoT devices that provide intelligent services for people have become the primary computing resources in recent years due to their increasing strong computing power. These devices are located at more dispersed and widely distributed “edge” locations (Zhang et al., 2017), and generate massive amounts of private data based on user-specific behaviors. In cloud computing mode, devices at the edge positions need to first transmit respective data to the cloud where data can be shuffled and distributed evenly over computing nodes, so that each computing node maintains a random sample from the same distribution, i.e. independent and identically distributed (IID) data points which represent the distribution of the entire dataset (Konecný et al., 2015). However, with the increasing importance of user privacy protection in today’s era and the limited bandwidth (McGraw et al., 2016) on edge nodes, data are urgently required to be processed locally. In this way, we have to face the non-independently and identically distributed (non-IID) data on edge nodes where none of the above assumptions are satisfied (Jeong et al., 2018).

Thanks to the parameter server (Li et al., 2014)

architecture, it is easy to handle the current computing scenarios that the cloud and edge devices are combined. The cloud can be viewed as the parameter server for interacting with the computational results of the various computing devices, while the edge devices act as computing nodes. Thus, the data remains on the edge devices, and the server just needs to interact with the calculations between them. At the same time, the easy-to-split performance of some optimization algorithms for offline training enables the parameter server architecture to be quickly applied to large-scale distributed systems. The most representative optimization algorithms are a series of stochastic gradient descent algorithms. Therefore, naturally, the computing nodes contribute gradients calculated on their data during the gradient descent process, while the parameter server aggregates the gradients for parameter updates. Hence, the quality of the training depends on the gradients contributed by the computing nodes, that is, the data characteristic on these computing nodes. The gradient descent algorithm abstracts the optimization problem in offline training into a process of finding excellent extreme points in multi-dimensional data. In the classic gradient descent iteration formula, which is

, represents the direction of finding the extreme point for each iteration, that is, the direction in which the parameter is updated, and the learning rate plays the role of the step size of “walking” in this direction.

Since the gradients are calculated from the data, the characteristics of the data determine the direction that the gradients represent. In the case of IID data distribution, each data slice can be regarded as a microcosm of the overall data. Thus, the gradients calculated on each IID data slice are roughly an unbiased estimate of the overall update direction. However, in the above-mentioned edge-dominated computing scene, non-IID data is ubiquitous, and local data characteristics on each device are very likely to be just a subset of all participating training data features. That is why gradients calculated on these edge devices represent the biased directions. Meantime, the independence of each device’s calculation, that is, asynchrony, will cause the entire offline training process to proceed in the wrong direction.

While machine learning, especially deep learning, has performed exceptionally well in recent years in areas such as computer vision

(Szegedy et al., 2017), speech recognition (Sercu et al., 2016)

, and natural language processing

(Bahdanau et al., 2014), edge devices are also actively involved in training such high quality models. On an assumption of IID data (Konecný et al., 2016)

, methods such as Binary Neural Networks (BNNs)

(Courbariaux and Bengio, 2016), quantization SGD methods (Seide et al., 2014; Alistarh et al., 2016) and sparse gradients (Zhao et al., 2018) have achieved excellent results in reducing device traffic and computational complexity. In terms of non-IID data, Federated Learning (McMahan et al., 2017; Zhao et al., 2018) based on synchronization has achieved excellent results. Nevertheless, few work currently considers the addition of asynchrony to the offline training of non-IID data.

We propose GSGM, a gradient scheduling algorithm with partly averaged gradients and global momentum for non-IID data distributed asynchronous training. Our key idea is to apply global momentum and local average to the biased gradient after scheduling, which is different from other methods of applying momentum to each learner and using gradients contributed by learners directly, so that the training process can be steady. We implement GSGM strategy in two different popular optimization algorithms, and compare them with the state-of-the-art asynchronous algorithms in the case of non-IID data for evaluating the availability of GSGM. Moreover, we measure the performances of GSGM under different distributed scales and different degrees of non-IID data.

This paper makes the following contributions:

  1. We describe that in terms of non-IID data, the global momentum should be used instead of using momentum methods separately for each computing node. This is the cornerstone of our GSGM approach.

  2. We propose a new gradient update method called partly averaged gradients used in gradient scheduling strategy based on a white list. On the one hand, partly averaged gradients make full use of the previous gradients of each computing node to balance the current biased direction. On the other hand, the scheduling method allows the gradients calculated by various computing nodes to be applied sequentially on the server side, so that the direction of model update can keep unbiased.

  3. Different from the traditional way of applying momentum on the gradients directly, we apply the global momentum to the partly averaged gradients to further stabilize the training process.

This paper is organized as follows. In Section 2, we review the algorithms and architectures associated with distributed training. In Section 3, we explain our GSGM method in detail. In Section 4, we introduce specific implementations of GSGM on two popular algorithms. In Section 5, we show the evaluation methodology and report the experimental results, followed by discussions and conclusions.

2. Background

2.1. Distributed optimization algorithm

Offline training relies on optimization algorithms, and the problem can be summarized as:

(1)

where

is the loss function, and gradients can be computed efficiently using backpropagation

(Rumelhart et al., 1988) on the local dataset for which represents.

Stochastic gradient descent (SGD) (Sinha and Griscik, 1971) is a commonly used optimization algorithm. For problem (1), SGD samples a random function (i.e., a random data-label pair) , and then performs the update step:

(2)

where is learning rate (stepsize) parameter. It can be easily used in a distributed environment where the computing nodes calculate on their local data and the server performs the update (2).

Stochastic variance reduced gradient (SVRG)

(Johnson and Zhang, 2013) optimizes the noise variance problem caused by random sampling in SGD. SVRG executes two nested loops. In the outer loop, it computes the full gradient of the whole function . In the inner loop, it performs the update step:

(3)

where is learning rate. In the distributed setting, each computing node is required to synchronize once to obtain unbiased full gradients , and then performs iterations (3) in the inner loop in parallel, just like distributed SGD.

2.2. Momentum method

Momentum method (Polyak, 1964)

is designed to speed up learning, especially gradients with high curvature, small but consistent gradients, or gradients with noise. Momentum method accumulates the moving average of exponential decay of the previous gradients and continues to move in that direction. The hyperparameter

determines how fast the contribution of previous gradients decay. Its update rule is:

(4)

where represents model parameters, is gradients calculated by specific algorithm like SGD or SVRG, and are velocity and momentum respectively, and is learning rate. Here, momentum can be regarded as the cumulative effect of previous gradients. When many successive gradients point in the same direction, the stepsize is maximized to achieve an acceleration effect.

Figure 1. Asynchronous training: each learner calculates gradients and updates parameters independently.

2.3. Distributed training architecture

The data parallel offline training approach works in the context of edge computing because the data is partitioned across computational devices, and each device (called learner here) has a copy of the learning model. Each learner computes gradients on its data shard, and the gradients are combined to update the target model parameters. Different ways of combining gradients result in different training modes.

Synchronous training. The gradients calculated by all learners are averaged or just summed after each iteration. Hence, the faster learners have to wait for the results of the slower learners, which makes this approach less efficient. At every iteration, all learners push their gradients to server, and server applies them to update the global model, and returns the latest model parameters to all learners to continue their calculation for the next iteration.

Asynchronous training. Each learner can access a shared-memory space (called server here), where the global parameters are stored. Each learner calculates gradients on its local data shard, and then uploads them to the server to update global parameters. Then it obtains the updated parameters to continue calculations. The advantage of asynchronous training is that each learner calculates at its own pace, with little or no waiting for other learners. Figure 1 shows how fully asynchronous training works. Server maintains a global time clock and only one learner can update model and get the latest parameters at every clock. That is, each learner works independently.

The Stale Synchronous Parallel (SSP) (Cipar et al., 2013) works by controlling the update frequency of each learner, compared to the completely asynchronous parallel. This is a pattern of trade-offs between synchronous and asynchronous training. SSP allows each learner to update the model in an asynchronous manner, but adds a limit (threshold) so that the difference between the fastest and the slowest learners’ progress is not too large. Figure 2 briefly illustrates how SSP works. Four learners work in an asynchronous way, and learners 2 and 4 just complete three iterations while learner 3 has completed five iterations. Since the limit threshold is one, learner 3 has to be blocked to wait for learners 2 and 4.

Figure 2. SSP training: each learner works asynchronously within the threshold and will be blocked when it works too fast.

3. Proposed Method

3.1. Distributed optimization problem

Problem (1) is for the overall optimization goal of machine learning, while in a data parallel distributed environment, problem (1) is decomposed into:

(5)

where is the number of learners, is the set of indexes of data points on learner , and . This means the loss function on each learner participates in the overall optimization goal, and it determine the convergence of the global model parameters. In asynchronous parallel, directly uses as its own estimate instead of aggregating .

1:repeat
2:     if  from server is received then
3:         compute gradient according to the specific algorithm
4:         push (upload) to the server
5:     end if
6:until learner satisfies the end condition
Algorithm 1 GSGM: Learner

3.2. Global momentum for non-IID data

1:learning rate , momentum , model parameters , velocity , partly averaged gradients
2:; ; ; is initialized randomly; ; a white list ; a wait queue is empty
3:repeat
4:     if gradients calculated by learner are received then
5:         if  is empty then
6:              
7:              
8:              recover
9:              for  in in order do
10:                  remove from
11:                  
12:                  
13:                  
14:                  
15:                  send to learner , i.e.
16:              end for
17:              clear
18:         end if
19:         if  in  then
20:              
21:              
22:              
23:              
24:              
25:              remove from
26:         else
27:              add to
28:         end if
29:     end if
30:until all learners complete calculations
Algorithm 2 GSGM: Server

Momentum methods play a crucial role in accelerating and stabilizing the training process of machine learning models. In a distributed scenario, there are two possibilities for applying momentum. One is to apply momentum separately on each computing node (we call this way “local momentum” here), and the central server receives the velocity after momentum acceleration, like EAMSGD (Zhang et al., 2015), deep gradient compression algorithm (Lin et al., 2017) and so on. Another way is to apply momentum uniformly to the gradients of all computing nodes on the server side (we call it “global momentum” here). In terms of problem (5), for the IID data setting, on each learner is an unbiased estimate of , that is, . In this way, applying local momentum is more intuitive, and the server only needs to perform clear iterative updates. However, for the non-IID data setting, could be an arbitrarily bad approximation to . In this way, the asynchronous nature makes the direction of parameter update after each iteration biased towards the direction of gradients used for the iteration. Now that local momentum is applied to each learner’s calculation, it will deteriorate this situation, causing the convergence process to be biased to the directions of each learner’s gradients. Naturally, global momentum is more suitable for non-IID data. In each iteration, what is used for updating is the velocity that accumulates all the previous gradients. For this reason, global momentum is equivalent to the correction of the current biased gradients, so that the parameter update proceeds in the normal direction.

Assume that there are 2 learners, and respectively, and each learner performs calculations twice asynchronously. They contribute their gradients one by one which is a very likely situation in asynchronous parallelism. We use the standard momentum method shown by Equation (4) to illustrate the difference between local and global momentum. Figure 3 shows the gradients and updated model parameters changed by local and global momentum methods where represents cumulative velocity after learner performs the -th calculation while represents global accumulated velocity after total i-th calculation. The arrows represent the update order of the global parameters. When ’s first gradients arrives, the parameters applying local momentum are biased towards ’s direction while ’s direction remains unbiased relatively due to applying global momentum. It can be speculated that the difference between these two methods will become more and more obvious when the scale of distribution increases.

Figure 3. Local momentum (left) vs global momentum (right): local momentum accumulates gradients in each learner’s direction, resulting in the biased update direction while global momentum makes full use of all the previous gradients to keep the update direction unbiased, pulling the biased direction back.

3.3. Gradient scheduling

Figure 3 shows that global momentum can keep training direction unbiased to a certain extent. However, if gradients of some learners are updated continuously under asynchronous conditions, the training direction will still be biased because global momentum would continuously weaken the contribution of the previous gradients. We should avoid the following situations.

  1. One or several learners contribute gradients significantly faster than other learners. Then the global parameters will be updated continuously toward the direction of gradients contributed by the fast learners. In this way, the effect of the previous gradients is gradually weakened, which still leads to the gradually biased training direction.

  2. We can force the fast learners to wait a little for the slow learners to keep the gap within a certain range, as the SSP does. However, within this range, the updates driven by each learner should also proceed in an orderly manner. Global momentum is sensitive to the gradient sequence submitted to the server. Therefore, when each learner submits gradients orderly, the velocity accumulated by global momentum will be closest to the unbiased estimation of the global optimization direction.

We use a gradient scheduling strategy based on a white list. The central idea of this strategy is that once the gradients submitted by a fast learner are used for updating, the learner is removed from the white list. Its gradients cannot be applied to updating again until the white list is empty and then restored to the initial setting. This white list prevents fast learners from continuously updating the global parameters, and guarantees the balance and orderliness of each learner’s update.

3.4. Partly averaged gradients

Besides, we propose partly averaged gradients as the object of global momentum usage. Specifically, when the white list is empty that means all the learners perform an iteration and a round of scheduling is over, we calculate and save the average of the gradients used by each learner’s last update as partly averaged gradients for the next round of scheduling. We apply global momentum to partly averaged gradients in the following iterations, rather than directly accumulating previous gradients. In summary, our proposed GSGM algorithm is shown as Algorithms 1 and 2.

In Algorithm 1, learners calculate gradients on their local data shards according to the specific algorithms like SGD or SVRG, and then upload the calculated gradients to server, waiting for update. After that, learners get the latest model parameters and continue their calculations. Learners keep doing so until the end condition is satisfied. What learners do has no difference with other asynchronous methods.

In Algorithm 2, server maintains a list of learners to be updated as a white list. When gradients calculated by a learner arrive, server first queries whether this learner is in the white list. If so, server applies its gradients to update the global parameters, and then returns the latest parameters to this learner, so that it can continue calculating the next gradients. This is the asynchronous nature of training. In the mean time, this learner is removed from the list. If the result of the query is negative, the gradients are added to a wait queue and this learner is blocked. When the list is empty, the wait queue will be traversed, applying the gradients in it in turn to update the global model. Then the list is restored to the initial setting. Server repeats the above process until all the learners complete their calculations.

In terms of the specific update iterations in the server, there are two kinds of gradients: biased gradients and partly averaged gradients .

is calculated by a learner based on its local data shard. It represents the biased direction of the skewed data.

is obtained when the white list is empty. It is calculated from the gradients of all the learners used in their last updates. We use as the velocity in the standard momentum method to remain and accumulate the correct direction, which “pulls” the unbiased direction from back (lines 18-20 and 9-11).

It is worth noting that partly averaged gradients are the average of recent gradients of all the learners. They can be considered as an approximately unbiased estimate of the overall update direction, which are similar to the average gradients of each iteration in synchronous training. Meanwhile, partly averaged gradients carry the information that the recent updates of all the learners. This is why they complement the stale gradients to some extent.

To sum up, our gradient scheduling algorithm using partly averaged gradients with global momentum allows a certain amount of asynchrony to be introduced when training non-IID data. Compared with distributed asynchronous optimization methods with local momentum, the correction of global momentum and the complement of partly averaged gradients make the training direction gradually stable and unbiased. Figure 4 illustrates the difference between our method and other methods.

1:local model parameters
2:repeat
3:     receive the model parameters from server.
4:     
5:     if this is an update task then
6:         pick a mini-batch subset randomly from the local data shard.
7:         compute the gradient .
8:         push to the server.
9:     else
10:         compute .
11:         push to the server.
12:     end if
13:until learner satisfies the end condition
Algorithm 3 GSGM-SVRG: Learner
Figure 4. Distributed asynchronous optimization with local momentum (left) and gradient scheduling using partly averaged gradients with global momentum (right) for non-IID data: the latter gradually stabilizes the training process.

4. Implementation

We implement our GSGM method in distributed asynchronous SGD and SVRG algorithms. For SGD, our prototype is Downpour SGD (Dean et al., 2012); for SVRG, our prototype is Asynchronous Stochastic Variance Reduction (referred as ASVRG) (Reddi et al., 2015).

In asynchronous SGD, GSGM is applied to the server. Intuitively, each learner calculates their gradients and submits them to the server, just like the Algorithm 1. The server carries out the Algorithm 2.

For asynchronous SVRG, we apply GSGM to the server for scheduling the parameter update process in the inner loop while learners calculate gradients as in ASVRG (Algorithms 4 and 3). It is important to note that there are two parallel processes in asynchronous SVRG. When we calculate , learners calculate and accumulate gradients directly based on their local data shards, then server averages them. This parallel process does not need to use GSGM because this only needs one communication between learners and server. This is an inevitable synchronous operation for learners. We use GSGM in SVRG’s inner loop which can be asynchronous for learners.

1:learning rate , momentum , model parameters , velocity
2:same as Algorithm 2
3:repeat
4:     if this is an update task then
5:         receive from learner .
6:         schedule gradients and update model parameters using Algorithm 2.
7:         push to the learner as an update task.
8:     else
9:         compute .
10:         push to all learners as a common task.
11:     end if
12:until all learners complete calculation
Algorithm 4 GSGM-SVRG: Server

5. Evaluation

(a) Model accuracy when
(b) Model accuracy when
(c) Model accuracy when
(d) Training stability when
(e) Training stability when
(f) Training stability when
Figure 5. GSGM-SGD experiments on MnistNet

5.1. Experimental Setup

We implement all the algorithms using the open source framework PyTorch

111https://pytorch.org/. Note that our main goal is to control the parametric variables to illustrate the advantages of our approach over other methods, instead of achieving state-of-the-art results on each selected dataset. Meanwhile, since edge devices are the main computational power in the context of our algorithm, GSGM is mainly for calculations on CPU, and models we select are suitable for being calculated on CPU. We use the following datasets in our experiments.

  1. Fashion-Mnist222https://github.com/zalandoresearch/fashion-mnist. Fashion-Mnist (Xiao et al., 2017) consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

  2. CIFAR-10 and CIFAR-100333https://www.cs.toronto.edu/ kriz/cifar.html. The CIFAR-10 dataset is a labeled dataset which consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. The CIFAR-100 dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. It is much more difficult to train models on the CIFAR-100 dataset.

(a) Model accuracy when
(b) Model accuracy when
(c) Model accuracy when
(d) Training stability when
(e) Training stability when
(f) Training stability when
Figure 6. GSGM-SGD experiments on LeNet

For non-IID data, we first arrange training images in the order of category labels, and then equally distribute them to each learner in order, similar to what has been done in the FedAvg algorithm (McMahan et al., 2017). Thus, each learner contains only one or a few categories of images. This is a pathological non-IID partition of the data. We explain the difficulty of training on this highly non-IID data. We use the following models during training.

  1. MnistNet on Fashion-Mnist (MnistNet)

    . MnistNet is a convolutional neural network with two convolutional layers and two fully connected layers from the PyTorch Tutorials

    444https://github.com/pytorch/examples. There are 21k parameters in MnistNet.

  2. LeNet-5 on CIFAR-10 (LeNet)

    . We classify images in the CIFAR-10 dataset using a convolutional neural network called LeNet-5

    (LeCun et al., 1998). LeNet-5 consists of two convolutional layers and three fully connected layers with about 62k parameters to be trained.

  3. ConvNet on CIFAR-100 (ConvNet). For CIFAR-100 dataset, we build a larger convolutional neural network with three convolutional layers and two fully connected layers (called ConvNet here), which is modified according to an open source project555https://github.com/simoninithomas/cifar-10-classifier-pytorch. ConvNet contains about 586k parameters.

We use mini-batches of size 100, and other important experimental settings used in training are shown in Table 1. Note that decay () and decay factor () indicate that the learning rate is decayed by multiples at the

-th epoch. In particular, for the SVRG algorithm, decay

means the -th outer loop. For SGD-based algorithms, we train 100 epochs on each dataset. Whereas 20 outer loops with 5 inner loops each (total 100 epochs) are performed for SVRG-based algorithms. For the sake of brevity in our experiments, we simply refer to local momentum as “LM”. When the momentum method is not applied in the algorithm, we adjust the initial learning rate to 10 times original setting for fairness.

Model
Initial
learning rate
Decay
of
Decay
factor
Momentum
SGD MnistNet 5e-4 75 0.5 0.9
LeNet 1e-3 50,80 0.5 0.9
SVRG ConvNet 0.0025 12 0.5 0.9
Table 1. Experiments Setup

In this section, the following distributed asynchronous algorithms are compared:

  1. Downpour SGD (Dean et al., 2012) (DSGD), our prototype when applying GSGM in asynchronous SGD.

  2. Petuum SGD (Ho et al., 2013; Dai et al., 2013) (PSGD), the state-of-the-art implementation of asynchronous SGD. We use PSGD- to indicate that the limit threshold (staleness) in PSGD is set to .

  3. Asynchronous Stochastic Variance Reduction (Reddi et al., 2015) (ASVRG), our prototype when applying GSGM in asynchronous SVRG.

  4. Distr-vr-sgd (Zhang et al., 2016) (DVRG), the state-of-the-art implementation of distributed asynchronous variance-reduced method. The picked computing nodes are all the learners in the system. We mark DVRG- as DVRG when the limit threshold (decayed parameter ) is set to .

5.2. GSGM-SGD experiments

Model accuracy. We use two different models on the Fashion-Mnist and CIFAR-10 datasets to evaluate the performance of GSGM in the distributed asynchronous SGD algorithms, as shown in Figures (a)a, (b)b, (c)c and (a)a, (b)b, (c)c. We compare GSGM with asynchronous algorithms including DSGD, DSGD with local momentum (DSGD-LM), PSGD when the threshold is 1 with and without local momentum (PSGD-1, PSGD-1-LM) and PSGD when the threshold is 2 (PSGD-2, PSGD-2-LM). Note that we do not compare with PSGD when its threshold is 0 because PSGD-0 is a kind of synchronous algorithm. It can be seen from the figures that the GSGM method achieves a slightly higher classification accuracy than the other asynchronous algorithms on the test datasets, and also reaches an acceptable higher classification accuracy faster. First of all, fully asynchronous DSGD (DSGD and DSGD-LM) cannot converge, while GSGM converges smoothly and normally. Secondly, the comparison of GSGM with other algorithms is shown in Table 2 specifically. The values in the table are subtracted from the peaks of the model accuracy during the training process of GSGM and other asynchronous algorithms (we measure the accuracy on the server at the end of each epoch). Under different distributed scales, GSGM improves model accuracy and achieves a increase at most for non-IID data.

Model PSGD-1 PSGD-1-LM PSGD-2 PSGD-2-LM
MnistNet 10 +0.85%
20 +0.64%
30 +0.35%
LeNet 10 +1.34%
20 +2.23%
30 +2.64%
Table 2. Improved accuracy of GSGM
(a) Model accuracy when
(b) Model accuracy when
(c) Model accuracy when
Figure 7. Training accuracy on ConvNet

Training stability

. Stability is especially important in offline training. When it is necessary to decide whether the training is to be terminated, a stable training process can produce an accurate judgment, while a curve with a large oscillation cannot determine whether the current model reaches the target. We intuitively treat the training stability as the standard deviation of the model accuracy values. Therefore, in Figures

(d)d, (e)e, (f)f and (d)d, (e)e, (f)f, the vertical axis represents the standard deviation of model accuracy under logarithmic coordinates. The smaller the value is, the smaller the variance of training process is, i.e., the smaller training oscillation is. These figures show that under different datasets, models and distributed scales, the variance of convergence process generated by GSGM is minimal, which means the oscillation caused by biased directions during non-IID data asynchronous training is effectively suppressed. More specifically, Table 3 explains the enhancement of GSGM in stability for non-IID data training. Compared with other algorithms, GSGM has achieved a significant improvement in training stability, which is up to .

Model PSGD-1 PSGD-1-LM PSGD-2 PSGD-2-LM
MnistNet 10 20.42%
20 11.46%
30 17.06%
LeNet 10 8.62%
20 15.33%
30 8.45%
Table 3. Improved stability of GSGM

5.3. GSGM-SVRG experiments

We evaluate the performance of GSGM based on the SVRG algorithm on the CIFAR-100 dataset which has much more categories. According to our method of generating non-IID data, as the number of learner increases, data on each learner becomes more and more sparse and skewed. Especially when , there are merely about 3-4 categories of data on each learner compared to the 100 categories of entire dataset.

(a) non-IID data
(b) non-IID data
(c) non-IID data
(d) non-IID data
(e) non-IID data
(f) non-IID data
Figure 8. Training accuracy and stability under different levels of non-IID data
(a) Training stability when
(b) Training stability when
Figure 9. Training stability on ConvNet

Figures 7 and 9 show the accuracy and stability of training on the CIFAR-100 dataset. When and , GSGM is still better than other algorithms in training stability (Figures (a)a and (b)b), achieving a faster convergence and a slightly higher accuracy (Figures (a)a and (b)b). When , only GSGM reaches a smooth training process and a highly available model, while other algorithms fail to converge normally (Figure (c)c). Quantitatively, the relevant improved values are displayed in Table 4. The results in the table are compared to the algorithms that can converge eventually. In terms of this kind of non-IID sparse data, the improvement of training stability by GSGM is more obvious, which is up to more than , compared to other asynchronous algorithms.

DVRG-1 DVRG-1-LM DVRG-2 DVRG-2-LM
Acc 10 +0.94%
20 - +0.47%
Stability 10 +37.51%
20 - +30.72%
Table 4. The improved values of GSGM

5.4. Partial non-IID data experiments

We also evaluate the robustness of GSGM in the case of non-extreme non-IID data distribution, i.e. different degrees of non-IID data. In order to generate of the data, the entire dataset is thoroughly shuffled, then data is firstly assigned to each learner, and after that the remaining parts are distributed to each learner by the category labels. We select the ConvNet as our model under the distributed scale of using SGD-based algorithms, and the related hyperparameters are the same as above.

Figure 8 shows that GSGM’s performance on different levels of non-IID data. We test (Figures (a)a and (d)d), (Figures (b)b and (e)e) and (Figures (c)c and (f)f) non-IID data distributions, respectively. For the model accuracy, GSGM is always slightly higher than other algorithms used for comparison, and for training stability, GSGM’s accuracy standard deviation is always the lowest. Therefor, GSGM can keep offline training process stable and effective on different situations. That is, GSGM is robust to multiple data distributions.

6. Related Work

Under IID data setting, the state-of-the-art implementations of asynchronous SGD and SVRG are Downpour SGD (Dean et al., 2012), Petuum SGD (Ho et al., 2013; Dai et al., 2013) and Asynchronous Stochastic Variance Reduction (Reddi et al., 2015), Distr-vr-sgd (Zhang et al., 2016) respectively. There is no barrier or blockage between each learner in Downpour SGD and Asynchronous Stochastic Variance Reduction, while Petuum SGD and Distr-vr-sgd have restrictions on the update frequency between learners. Based on these ideas, the latest algorithms such as the learning rate scheduling algorithm (Dutta et al., 2018) and DC-ASGD (Zheng et al., 2017) are dedicated to alleviating the problems caused by inconsistencies in asynchronous systems. However, these works are still under the assumption of IID data.

Regarding non-IID data, the FSVRG algorithm (Konecný et al., 2016, 2015) is an improvement of synchronization version of the native SVRG algorithm. In large-scale distributed computing scenario where data volume is unbalanced and data label distribution is inconsistent, the convergence efficiency of FSVRG algorithm is as good as GD algorithm. The FedAvg algorithm (McMahan et al., 2017) changes the distributed synchronous SGD algorithm. In the context of federated learning, it achieves the effect of reducing the number of communication rounds while ensuring the accuracy of the model. Both of these two distributed algorithms still need synchronous operations at the end of training epoch.

Our work differs from the above in that we focus on the adverse effects of asynchronous features in distributed systems on non-IID data training. GSGM introduces asynchrony for non-IID data offline training.

7. Conclusion

In the scenarios where the cloud and the edge devices are combined, offline distributed training often has to face non-IID data. Existing asynchronous algorithms are difficult to perform well due to their asynchronous features when processing non-IID data. This paper proposes a gradient scheduling strategy (GSGM) which applies global momentum to partly averaged gradients instead of using momentum directly in each computing node for non-IID data training. The core idea of GSGM is to perform an orderly scheduling on gradients contributed by computing nodes, so that the update direction of global model can keep unbiased. Meanwhile, GSGM makes full use of previous gradients to steady training process by partly averaged gradients and applying momentum methods globally. Experiments show that in asynchronous SGD and SVRG, for both densely and sparsely distributed non-IID data, GSGM can make a significant improvement in training stability, increasing model accuracy slightly at the same time. Besides, GSGM is robust to different degrees of non-IID data.

References