1. Introduction
Mobile and IoT devices that provide intelligent services for people have become the primary computing resources in recent years due to their increasing strong computing power. These devices are located at more dispersed and widely distributed “edge” locations (Zhang et al., 2017), and generate massive amounts of private data based on userspecific behaviors. In cloud computing mode, devices at the edge positions need to first transmit respective data to the cloud where data can be shuffled and distributed evenly over computing nodes, so that each computing node maintains a random sample from the same distribution, i.e. independent and identically distributed (IID) data points which represent the distribution of the entire dataset (Konecný et al., 2015). However, with the increasing importance of user privacy protection in today’s era and the limited bandwidth (McGraw et al., 2016) on edge nodes, data are urgently required to be processed locally. In this way, we have to face the nonindependently and identically distributed (nonIID) data on edge nodes where none of the above assumptions are satisfied (Jeong et al., 2018).
Thanks to the parameter server (Li et al., 2014)
architecture, it is easy to handle the current computing scenarios that the cloud and edge devices are combined. The cloud can be viewed as the parameter server for interacting with the computational results of the various computing devices, while the edge devices act as computing nodes. Thus, the data remains on the edge devices, and the server just needs to interact with the calculations between them. At the same time, the easytosplit performance of some optimization algorithms for offline training enables the parameter server architecture to be quickly applied to largescale distributed systems. The most representative optimization algorithms are a series of stochastic gradient descent algorithms. Therefore, naturally, the computing nodes contribute gradients calculated on their data during the gradient descent process, while the parameter server aggregates the gradients for parameter updates. Hence, the quality of the training depends on the gradients contributed by the computing nodes, that is, the data characteristic on these computing nodes. The gradient descent algorithm abstracts the optimization problem in offline training into a process of finding excellent extreme points in multidimensional data. In the classic gradient descent iteration formula, which is
, represents the direction of finding the extreme point for each iteration, that is, the direction in which the parameter is updated, and the learning rate plays the role of the step size of “walking” in this direction.Since the gradients are calculated from the data, the characteristics of the data determine the direction that the gradients represent. In the case of IID data distribution, each data slice can be regarded as a microcosm of the overall data. Thus, the gradients calculated on each IID data slice are roughly an unbiased estimate of the overall update direction. However, in the abovementioned edgedominated computing scene, nonIID data is ubiquitous, and local data characteristics on each device are very likely to be just a subset of all participating training data features. That is why gradients calculated on these edge devices represent the biased directions. Meantime, the independence of each device’s calculation, that is, asynchrony, will cause the entire offline training process to proceed in the wrong direction.
While machine learning, especially deep learning, has performed exceptionally well in recent years in areas such as computer vision
(Szegedy et al., 2017), speech recognition (Sercu et al., 2016), and natural language processing
(Bahdanau et al., 2014), edge devices are also actively involved in training such high quality models. On an assumption of IID data (Konecný et al., 2016), methods such as Binary Neural Networks (BNNs)
(Courbariaux and Bengio, 2016), quantization SGD methods (Seide et al., 2014; Alistarh et al., 2016) and sparse gradients (Zhao et al., 2018) have achieved excellent results in reducing device traffic and computational complexity. In terms of nonIID data, Federated Learning (McMahan et al., 2017; Zhao et al., 2018) based on synchronization has achieved excellent results. Nevertheless, few work currently considers the addition of asynchrony to the offline training of nonIID data.We propose GSGM, a gradient scheduling algorithm with partly averaged gradients and global momentum for nonIID data distributed asynchronous training. Our key idea is to apply global momentum and local average to the biased gradient after scheduling, which is different from other methods of applying momentum to each learner and using gradients contributed by learners directly, so that the training process can be steady. We implement GSGM strategy in two different popular optimization algorithms, and compare them with the stateoftheart asynchronous algorithms in the case of nonIID data for evaluating the availability of GSGM. Moreover, we measure the performances of GSGM under different distributed scales and different degrees of nonIID data.
This paper makes the following contributions:

We describe that in terms of nonIID data, the global momentum should be used instead of using momentum methods separately for each computing node. This is the cornerstone of our GSGM approach.

We propose a new gradient update method called partly averaged gradients used in gradient scheduling strategy based on a white list. On the one hand, partly averaged gradients make full use of the previous gradients of each computing node to balance the current biased direction. On the other hand, the scheduling method allows the gradients calculated by various computing nodes to be applied sequentially on the server side, so that the direction of model update can keep unbiased.

Different from the traditional way of applying momentum on the gradients directly, we apply the global momentum to the partly averaged gradients to further stabilize the training process.
This paper is organized as follows. In Section 2, we review the algorithms and architectures associated with distributed training. In Section 3, we explain our GSGM method in detail. In Section 4, we introduce specific implementations of GSGM on two popular algorithms. In Section 5, we show the evaluation methodology and report the experimental results, followed by discussions and conclusions.
2. Background
2.1. Distributed optimization algorithm
Offline training relies on optimization algorithms, and the problem can be summarized as:
(1) 
where
is the loss function, and gradients can be computed efficiently using backpropagation
(Rumelhart et al., 1988) on the local dataset for which represents.Stochastic gradient descent (SGD) (Sinha and Griscik, 1971) is a commonly used optimization algorithm. For problem (1), SGD samples a random function (i.e., a random datalabel pair) , and then performs the update step:
(2) 
where is learning rate (stepsize) parameter. It can be easily used in a distributed environment where the computing nodes calculate on their local data and the server performs the update (2).
Stochastic variance reduced gradient (SVRG)
(Johnson and Zhang, 2013) optimizes the noise variance problem caused by random sampling in SGD. SVRG executes two nested loops. In the outer loop, it computes the full gradient of the whole function . In the inner loop, it performs the update step:(3) 
where is learning rate. In the distributed setting, each computing node is required to synchronize once to obtain unbiased full gradients , and then performs iterations (3) in the inner loop in parallel, just like distributed SGD.
2.2. Momentum method
Momentum method (Polyak, 1964)
is designed to speed up learning, especially gradients with high curvature, small but consistent gradients, or gradients with noise. Momentum method accumulates the moving average of exponential decay of the previous gradients and continues to move in that direction. The hyperparameter
determines how fast the contribution of previous gradients decay. Its update rule is:(4) 
where represents model parameters, is gradients calculated by specific algorithm like SGD or SVRG, and are velocity and momentum respectively, and is learning rate. Here, momentum can be regarded as the cumulative effect of previous gradients. When many successive gradients point in the same direction, the stepsize is maximized to achieve an acceleration effect.
2.3. Distributed training architecture
The data parallel offline training approach works in the context of edge computing because the data is partitioned across computational devices, and each device (called learner here) has a copy of the learning model. Each learner computes gradients on its data shard, and the gradients are combined to update the target model parameters. Different ways of combining gradients result in different training modes.
Synchronous training. The gradients calculated by all learners are averaged or just summed after each iteration. Hence, the faster learners have to wait for the results of the slower learners, which makes this approach less efficient. At every iteration, all learners push their gradients to server, and server applies them to update the global model, and returns the latest model parameters to all learners to continue their calculation for the next iteration.
Asynchronous training. Each learner can access a sharedmemory space (called server here), where the global parameters are stored. Each learner calculates gradients on its local data shard, and then uploads them to the server to update global parameters. Then it obtains the updated parameters to continue calculations. The advantage of asynchronous training is that each learner calculates at its own pace, with little or no waiting for other learners. Figure 1 shows how fully asynchronous training works. Server maintains a global time clock and only one learner can update model and get the latest parameters at every clock. That is, each learner works independently.
The Stale Synchronous Parallel (SSP) (Cipar et al., 2013) works by controlling the update frequency of each learner, compared to the completely asynchronous parallel. This is a pattern of tradeoffs between synchronous and asynchronous training. SSP allows each learner to update the model in an asynchronous manner, but adds a limit (threshold) so that the difference between the fastest and the slowest learners’ progress is not too large. Figure 2 briefly illustrates how SSP works. Four learners work in an asynchronous way, and learners 2 and 4 just complete three iterations while learner 3 has completed five iterations. Since the limit threshold is one, learner 3 has to be blocked to wait for learners 2 and 4.
3. Proposed Method
3.1. Distributed optimization problem
Problem (1) is for the overall optimization goal of machine learning, while in a data parallel distributed environment, problem (1) is decomposed into:
(5) 
where is the number of learners, is the set of indexes of data points on learner , and . This means the loss function on each learner participates in the overall optimization goal, and it determine the convergence of the global model parameters. In asynchronous parallel, directly uses as its own estimate instead of aggregating .
3.2. Global momentum for nonIID data
Momentum methods play a crucial role in accelerating and stabilizing the training process of machine learning models. In a distributed scenario, there are two possibilities for applying momentum. One is to apply momentum separately on each computing node (we call this way “local momentum” here), and the central server receives the velocity after momentum acceleration, like EAMSGD (Zhang et al., 2015), deep gradient compression algorithm (Lin et al., 2017) and so on. Another way is to apply momentum uniformly to the gradients of all computing nodes on the server side (we call it “global momentum” here). In terms of problem (5), for the IID data setting, on each learner is an unbiased estimate of , that is, . In this way, applying local momentum is more intuitive, and the server only needs to perform clear iterative updates. However, for the nonIID data setting, could be an arbitrarily bad approximation to . In this way, the asynchronous nature makes the direction of parameter update after each iteration biased towards the direction of gradients used for the iteration. Now that local momentum is applied to each learner’s calculation, it will deteriorate this situation, causing the convergence process to be biased to the directions of each learner’s gradients. Naturally, global momentum is more suitable for nonIID data. In each iteration, what is used for updating is the velocity that accumulates all the previous gradients. For this reason, global momentum is equivalent to the correction of the current biased gradients, so that the parameter update proceeds in the normal direction.
Assume that there are 2 learners, and respectively, and each learner performs calculations twice asynchronously. They contribute their gradients one by one which is a very likely situation in asynchronous parallelism. We use the standard momentum method shown by Equation (4) to illustrate the difference between local and global momentum. Figure 3 shows the gradients and updated model parameters changed by local and global momentum methods where represents cumulative velocity after learner performs the th calculation while represents global accumulated velocity after total ith calculation. The arrows represent the update order of the global parameters. When ’s first gradients arrives, the parameters applying local momentum are biased towards ’s direction while ’s direction remains unbiased relatively due to applying global momentum. It can be speculated that the difference between these two methods will become more and more obvious when the scale of distribution increases.
3.3. Gradient scheduling
Figure 3 shows that global momentum can keep training direction unbiased to a certain extent. However, if gradients of some learners are updated continuously under asynchronous conditions, the training direction will still be biased because global momentum would continuously weaken the contribution of the previous gradients. We should avoid the following situations.

One or several learners contribute gradients significantly faster than other learners. Then the global parameters will be updated continuously toward the direction of gradients contributed by the fast learners. In this way, the effect of the previous gradients is gradually weakened, which still leads to the gradually biased training direction.

We can force the fast learners to wait a little for the slow learners to keep the gap within a certain range, as the SSP does. However, within this range, the updates driven by each learner should also proceed in an orderly manner. Global momentum is sensitive to the gradient sequence submitted to the server. Therefore, when each learner submits gradients orderly, the velocity accumulated by global momentum will be closest to the unbiased estimation of the global optimization direction.
We use a gradient scheduling strategy based on a white list. The central idea of this strategy is that once the gradients submitted by a fast learner are used for updating, the learner is removed from the white list. Its gradients cannot be applied to updating again until the white list is empty and then restored to the initial setting. This white list prevents fast learners from continuously updating the global parameters, and guarantees the balance and orderliness of each learner’s update.
3.4. Partly averaged gradients
Besides, we propose partly averaged gradients as the object of global momentum usage. Specifically, when the white list is empty that means all the learners perform an iteration and a round of scheduling is over, we calculate and save the average of the gradients used by each learner’s last update as partly averaged gradients for the next round of scheduling. We apply global momentum to partly averaged gradients in the following iterations, rather than directly accumulating previous gradients. In summary, our proposed GSGM algorithm is shown as Algorithms 1 and 2.
In Algorithm 1, learners calculate gradients on their local data shards according to the specific algorithms like SGD or SVRG, and then upload the calculated gradients to server, waiting for update. After that, learners get the latest model parameters and continue their calculations. Learners keep doing so until the end condition is satisfied. What learners do has no difference with other asynchronous methods.
In Algorithm 2, server maintains a list of learners to be updated as a white list. When gradients calculated by a learner arrive, server first queries whether this learner is in the white list. If so, server applies its gradients to update the global parameters, and then returns the latest parameters to this learner, so that it can continue calculating the next gradients. This is the asynchronous nature of training. In the mean time, this learner is removed from the list. If the result of the query is negative, the gradients are added to a wait queue and this learner is blocked. When the list is empty, the wait queue will be traversed, applying the gradients in it in turn to update the global model. Then the list is restored to the initial setting. Server repeats the above process until all the learners complete their calculations.
In terms of the specific update iterations in the server, there are two kinds of gradients: biased gradients and partly averaged gradients .
is calculated by a learner based on its local data shard. It represents the biased direction of the skewed data.
is obtained when the white list is empty. It is calculated from the gradients of all the learners used in their last updates. We use as the velocity in the standard momentum method to remain and accumulate the correct direction, which “pulls” the unbiased direction from back (lines 1820 and 911).It is worth noting that partly averaged gradients are the average of recent gradients of all the learners. They can be considered as an approximately unbiased estimate of the overall update direction, which are similar to the average gradients of each iteration in synchronous training. Meanwhile, partly averaged gradients carry the information that the recent updates of all the learners. This is why they complement the stale gradients to some extent.
To sum up, our gradient scheduling algorithm using partly averaged gradients with global momentum allows a certain amount of asynchrony to be introduced when training nonIID data. Compared with distributed asynchronous optimization methods with local momentum, the correction of global momentum and the complement of partly averaged gradients make the training direction gradually stable and unbiased. Figure 4 illustrates the difference between our method and other methods.
4. Implementation
We implement our GSGM method in distributed asynchronous SGD and SVRG algorithms. For SGD, our prototype is Downpour SGD (Dean et al., 2012); for SVRG, our prototype is Asynchronous Stochastic Variance Reduction (referred as ASVRG) (Reddi et al., 2015).
In asynchronous SGD, GSGM is applied to the server. Intuitively, each learner calculates their gradients and submits them to the server, just like the Algorithm 1. The server carries out the Algorithm 2.
For asynchronous SVRG, we apply GSGM to the server for scheduling the parameter update process in the inner loop while learners calculate gradients as in ASVRG (Algorithms 4 and 3). It is important to note that there are two parallel processes in asynchronous SVRG. When we calculate , learners calculate and accumulate gradients directly based on their local data shards, then server averages them. This parallel process does not need to use GSGM because this only needs one communication between learners and server. This is an inevitable synchronous operation for learners. We use GSGM in SVRG’s inner loop which can be asynchronous for learners.
5. Evaluation
5.1. Experimental Setup
We implement all the algorithms using the open source framework PyTorch
^{1}^{1}1https://pytorch.org/. Note that our main goal is to control the parametric variables to illustrate the advantages of our approach over other methods, instead of achieving stateoftheart results on each selected dataset. Meanwhile, since edge devices are the main computational power in the context of our algorithm, GSGM is mainly for calculations on CPU, and models we select are suitable for being calculated on CPU. We use the following datasets in our experiments.
FashionMnist^{2}^{2}2https://github.com/zalandoresearch/fashionmnist. FashionMnist (Xiao et al., 2017) consists of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

CIFAR10 and CIFAR100^{3}^{3}3https://www.cs.toronto.edu/ kriz/cifar.html. The CIFAR10 dataset is a labeled dataset which consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. The CIFAR100 dataset is just like the CIFAR10, except it has 100 classes containing 600 images each. It is much more difficult to train models on the CIFAR100 dataset.
For nonIID data, we first arrange training images in the order of category labels, and then equally distribute them to each learner in order, similar to what has been done in the FedAvg algorithm (McMahan et al., 2017). Thus, each learner contains only one or a few categories of images. This is a pathological nonIID partition of the data. We explain the difficulty of training on this highly nonIID data. We use the following models during training.

MnistNet on FashionMnist (MnistNet)
. MnistNet is a convolutional neural network with two convolutional layers and two fully connected layers from the PyTorch Tutorials
^{4}^{4}4https://github.com/pytorch/examples. There are 21k parameters in MnistNet. 
ConvNet on CIFAR100 (ConvNet). For CIFAR100 dataset, we build a larger convolutional neural network with three convolutional layers and two fully connected layers (called ConvNet here), which is modified according to an open source project^{5}^{5}5https://github.com/simoninithomas/cifar10classifierpytorch. ConvNet contains about 586k parameters.
We use minibatches of size 100, and other important experimental settings used in training are shown in Table 1. Note that decay () and decay factor () indicate that the learning rate is decayed by multiples at the
th epoch. In particular, for the SVRG algorithm, decay
means the th outer loop. For SGDbased algorithms, we train 100 epochs on each dataset. Whereas 20 outer loops with 5 inner loops each (total 100 epochs) are performed for SVRGbased algorithms. For the sake of brevity in our experiments, we simply refer to local momentum as “LM”. When the momentum method is not applied in the algorithm, we adjust the initial learning rate to 10 times original setting for fairness.Model 



Momentum  

SGD  MnistNet  5e4  75  0.5  0.9  
LeNet  1e3  50,80  0.5  0.9  
SVRG  ConvNet  0.0025  12  0.5  0.9 
In this section, the following distributed asynchronous algorithms are compared:

Downpour SGD (Dean et al., 2012) (DSGD), our prototype when applying GSGM in asynchronous SGD.

Asynchronous Stochastic Variance Reduction (Reddi et al., 2015) (ASVRG), our prototype when applying GSGM in asynchronous SVRG.

Distrvrsgd (Zhang et al., 2016) (DVRG), the stateoftheart implementation of distributed asynchronous variancereduced method. The picked computing nodes are all the learners in the system. We mark DVRG as DVRG when the limit threshold (decayed parameter ) is set to .
5.2. GSGMSGD experiments
Model accuracy. We use two different models on the FashionMnist and CIFAR10 datasets to evaluate the performance of GSGM in the distributed asynchronous SGD algorithms, as shown in Figures (a)a, (b)b, (c)c and (a)a, (b)b, (c)c. We compare GSGM with asynchronous algorithms including DSGD, DSGD with local momentum (DSGDLM), PSGD when the threshold is 1 with and without local momentum (PSGD1, PSGD1LM) and PSGD when the threshold is 2 (PSGD2, PSGD2LM). Note that we do not compare with PSGD when its threshold is 0 because PSGD0 is a kind of synchronous algorithm. It can be seen from the figures that the GSGM method achieves a slightly higher classification accuracy than the other asynchronous algorithms on the test datasets, and also reaches an acceptable higher classification accuracy faster. First of all, fully asynchronous DSGD (DSGD and DSGDLM) cannot converge, while GSGM converges smoothly and normally. Secondly, the comparison of GSGM with other algorithms is shown in Table 2 specifically. The values in the table are subtracted from the peaks of the model accuracy during the training process of GSGM and other asynchronous algorithms (we measure the accuracy on the server at the end of each epoch). Under different distributed scales, GSGM improves model accuracy and achieves a increase at most for nonIID data.
Model  PSGD1  PSGD1LM  PSGD2  PSGD2LM  
MnistNet  10  +0.85%  
20  +0.64%  
30  +0.35%  
LeNet  10  +1.34%  
20  +2.23%  
30  +2.64% 
Training stability
. Stability is especially important in offline training. When it is necessary to decide whether the training is to be terminated, a stable training process can produce an accurate judgment, while a curve with a large oscillation cannot determine whether the current model reaches the target. We intuitively treat the training stability as the standard deviation of the model accuracy values. Therefore, in Figures
(d)d, (e)e, (f)f and (d)d, (e)e, (f)f, the vertical axis represents the standard deviation of model accuracy under logarithmic coordinates. The smaller the value is, the smaller the variance of training process is, i.e., the smaller training oscillation is. These figures show that under different datasets, models and distributed scales, the variance of convergence process generated by GSGM is minimal, which means the oscillation caused by biased directions during nonIID data asynchronous training is effectively suppressed. More specifically, Table 3 explains the enhancement of GSGM in stability for nonIID data training. Compared with other algorithms, GSGM has achieved a significant improvement in training stability, which is up to .Model  PSGD1  PSGD1LM  PSGD2  PSGD2LM  

MnistNet  10  20.42%  
20  11.46%  
30  17.06%  
LeNet  10  8.62%  
20  15.33%  
30  8.45% 
5.3. GSGMSVRG experiments
We evaluate the performance of GSGM based on the SVRG algorithm on the CIFAR100 dataset which has much more categories. According to our method of generating nonIID data, as the number of learner increases, data on each learner becomes more and more sparse and skewed. Especially when , there are merely about 34 categories of data on each learner compared to the 100 categories of entire dataset.
Figures 7 and 9 show the accuracy and stability of training on the CIFAR100 dataset. When and , GSGM is still better than other algorithms in training stability (Figures (a)a and (b)b), achieving a faster convergence and a slightly higher accuracy (Figures (a)a and (b)b). When , only GSGM reaches a smooth training process and a highly available model, while other algorithms fail to converge normally (Figure (c)c). Quantitatively, the relevant improved values are displayed in Table 4. The results in the table are compared to the algorithms that can converge eventually. In terms of this kind of nonIID sparse data, the improvement of training stability by GSGM is more obvious, which is up to more than , compared to other asynchronous algorithms.
DVRG1  DVRG1LM  DVRG2  DVRG2LM  

Acc  10  +0.94%  
20    +0.47%  
Stability  10  +37.51%  
20    +30.72% 
5.4. Partial nonIID data experiments
We also evaluate the robustness of GSGM in the case of nonextreme nonIID data distribution, i.e. different degrees of nonIID data. In order to generate of the data, the entire dataset is thoroughly shuffled, then data is firstly assigned to each learner, and after that the remaining parts are distributed to each learner by the category labels. We select the ConvNet as our model under the distributed scale of using SGDbased algorithms, and the related hyperparameters are the same as above.
Figure 8 shows that GSGM’s performance on different levels of nonIID data. We test (Figures (a)a and (d)d), (Figures (b)b and (e)e) and (Figures (c)c and (f)f) nonIID data distributions, respectively. For the model accuracy, GSGM is always slightly higher than other algorithms used for comparison, and for training stability, GSGM’s accuracy standard deviation is always the lowest. Therefor, GSGM can keep offline training process stable and effective on different situations. That is, GSGM is robust to multiple data distributions.
6. Related Work
Under IID data setting, the stateoftheart implementations of asynchronous SGD and SVRG are Downpour SGD (Dean et al., 2012), Petuum SGD (Ho et al., 2013; Dai et al., 2013) and Asynchronous Stochastic Variance Reduction (Reddi et al., 2015), Distrvrsgd (Zhang et al., 2016) respectively. There is no barrier or blockage between each learner in Downpour SGD and Asynchronous Stochastic Variance Reduction, while Petuum SGD and Distrvrsgd have restrictions on the update frequency between learners. Based on these ideas, the latest algorithms such as the learning rate scheduling algorithm (Dutta et al., 2018) and DCASGD (Zheng et al., 2017) are dedicated to alleviating the problems caused by inconsistencies in asynchronous systems. However, these works are still under the assumption of IID data.
Regarding nonIID data, the FSVRG algorithm (Konecný et al., 2016, 2015) is an improvement of synchronization version of the native SVRG algorithm. In largescale distributed computing scenario where data volume is unbalanced and data label distribution is inconsistent, the convergence efficiency of FSVRG algorithm is as good as GD algorithm. The FedAvg algorithm (McMahan et al., 2017) changes the distributed synchronous SGD algorithm. In the context of federated learning, it achieves the effect of reducing the number of communication rounds while ensuring the accuracy of the model. Both of these two distributed algorithms still need synchronous operations at the end of training epoch.
Our work differs from the above in that we focus on the adverse effects of asynchronous features in distributed systems on nonIID data training. GSGM introduces asynchrony for nonIID data offline training.
7. Conclusion
In the scenarios where the cloud and the edge devices are combined, offline distributed training often has to face nonIID data. Existing asynchronous algorithms are difficult to perform well due to their asynchronous features when processing nonIID data. This paper proposes a gradient scheduling strategy (GSGM) which applies global momentum to partly averaged gradients instead of using momentum directly in each computing node for nonIID data training. The core idea of GSGM is to perform an orderly scheduling on gradients contributed by computing nodes, so that the update direction of global model can keep unbiased. Meanwhile, GSGM makes full use of previous gradients to steady training process by partly averaged gradients and applying momentum methods globally. Experiments show that in asynchronous SGD and SVRG, for both densely and sparsely distributed nonIID data, GSGM can make a significant improvement in training stability, increasing model accuracy slightly at the same time. Besides, GSGM is robust to different degrees of nonIID data.
References
 (1)
 Alistarh et al. (2016) Dan Alistarh, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2016. QSGD: Randomized Quantization for CommunicationOptimal Stochastic Gradient Descent. CoRR abs/1610.02132 (2016).
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014).
 Cipar et al. (2013) James Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton, and Eric P. Xing. 2013. Solving the Straggler Problem with Bounded Staleness. In Proceedings of the 14th Workshop on Hot Topics in Operating Systems (HotOS XIV), Santa Ana Pueblo, New Mexico, USA.
 Courbariaux and Bengio (2016) Matthieu Courbariaux and Yoshua Bengio. 2016. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. CoRR abs/1602.02830 (2016).
 Dai et al. (2013) Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho, and Eric P. Xing. 2013. Petuum: A Framework for IterativeConvergent Distributed ML. CoRR abs/1312.7651 (2013).
 Dean et al. (2012) Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS 2012), Lake Tahoe, Nevada, United States. 1232–1240.

Dutta et al. (2018)
Sanghamitra Dutta, Gauri
Joshi, Soumyadip Ghosh, Parijat Dube,
and Priya Nagpurkar. 2018.
Slow and Stale Gradients Can Win the Race:
ErrorRuntime Tradeoffs in Distributed SGD. In
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS 2018), Playa Blanca, Lanzarote, Canary Islands, Spain
. 803–812.  Ho et al. (2013) Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), Lake Tahoe, Nevada, United States. 1223–1231.
 Jeong et al. (2018) Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and SeongLyun Kim. 2018. CommunicationEfficient OnDevice Machine Learning: Federated Distillation and Augmentation under NonIID Private Data. CoRR abs/1811.11479 (2018).
 Johnson and Zhang (2013) Rie Johnson and Tong Zhang. 2013. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS 2013), Lake Tahoe, Nevada, United States. 315–323.
 Konecný et al. (2015) Jakub Konecný, Brendan McMahan, and Daniel Ramage. 2015. Federated Optimization: Distributed Optimization Beyond the Datacenter. CoRR abs/1511.03575 (2015).
 Konecný et al. (2016) Jakub Konecný, H. Brendan McMahan, Daniel Ramage, and Peter Richtárik. 2016. Federated Optimization: Distributed Machine Learning for OnDevice Intelligence. CoRR abs/1610.02527 (2016).
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradientbased learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
 Li et al. (2014) Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and BorYiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2014). 583–598.
 Lin et al. (2017) Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J. Dally. 2017. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. CoRR abs/1712.01887 (2017).
 McGraw et al. (2016) Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Françoise Beaufays, and Carolina Parada. 2016. Personalized speech recognition on mobile devices. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China. 5955–5959.
 McMahan et al. (2017) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. 2017. CommunicationEfficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), Fort Lauderdale, FL, USA. 1273–1282.
 Polyak (1964) B T Polyak. 1964. Some Methods of Speeding Up the Convergence of Iteration Methods. Ussr Computational Mathematics and Mathematical Physics 4, 5 (1964), 1–17.
 Reddi et al. (2015) Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alexander J. Smola. 2015. On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS 2015), Montreal, Quebec, Canada. 2647–2655.
 Rumelhart et al. (1988) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1988. Learning Representations by Backpropagating Errors. Nature 323, 6088 (1988), 696–699.
 Seide et al. (2014) Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1bit Stochastic Gradient Descent and its Application to Dataparallel Distributed Training of Speech DNNs. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014), Singapore. 1058–1062.
 Sercu et al. (2016) Tom Sercu, Christian Puhrsch, Brian Kingsbury, and Yann LeCun. 2016. Very Deep Multilingual Convolutional Neural Networks for LVCSR. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China. 4955–4959.
 Sinha and Griscik (1971) Naresh K. Sinha and Michael P. Griscik. 1971. A Stochastic Approximation Method. IEEE Trans. Systems, Man, and Cybernetics 1, 4 (1971), 338–344.

Szegedy
et al. (2017)
Christian Szegedy, Sergey
Ioffe, Vincent Vanhoucke, and
Alexander A. Alemi. 2017.
Inceptionv4, InceptionResNet and the Impact of Residual Connections on Learning. In
Proceedings of the ThirtyFirst Conference on Artificial Intelligence (AAAI 2017), San Francisco, California, USA. 4278–4284.  Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. FashionMNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. CoRR abs/1708.07747 (2017).
 Zhang et al. (2016) Ruiliang Zhang, Shuai Zheng, and James T. Kwok. 2016. Asynchronous Distributed SemiStochastic Gradient Optimization. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI 2016), Phoenix, Arizona, USA. 2323–2329.
 Zhang et al. (2015) Sixin Zhang, Anna Choromanska, and Yann LeCun. 2015. Deep learning with Elastic Averaging SGD. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS 2015), Montreal, Quebec, Canada. 685–693.
 Zhang et al. (2017) Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. Hello Edge: Keyword Spotting on Microcontrollers. CoRR abs/1711.07128 (2017).
 Zhao et al. (2018) Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. 2018. Federated Learning with NonIID Data. CoRR abs/1806.00582 (2018).
 Zheng et al. (2017) Shuxin Zheng, Qi Meng, Taifeng Wang, Wei Chen, Nenghai Yu, Zhiming Ma, and TieYan Liu. 2017. Asynchronous Stochastic Gradient Descent with Delay Compensation. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, NSW, Australia. 4120–4129.
Comments
There are no comments yet.