Due to the wide application in cluster-based large-scale learning, federated learning (Konečnỳ et al., 2016; Kairouz et al., 2019), edge computing (Shi et al., 2016) and so on, distributed learning has recently become a hot research topic (Zinkevich et al., 2010; Yang, 2013; Jaggi et al., 2014; Shamir et al., 2014; Zhang and Kwok, 2014; Ma et al., 2015; Lee et al., 2017; Lian et al., 2017; Zhao et al., 2017; Sun et al., 2018; Wangni et al., 2018; Zhao et al., 2018; Zhou et al., 2018; Yu et al., 2019a, b). Most existing distributed learning methods are based on stochastic gradient descent (SGD) and its variants (Bottou, 2010; Xiao, 2010; Duchi et al., 2011; Johnson and Zhang, 2013; Shalev-Shwartz and Zhang, 2013; Zhang et al., 2013; Lin et al., 2014; Schmidt et al., 2017; Zhao et al., 2018; Yu et al., 2019b). Furthermore, most existing distributed learning methods assume no error and attack on the workers.
However, in real distributed learning applications with multiple networked machines (nodes), different kinds of hardware or software errors may happen. Representative errors include bit-flipping in the communication media and the memory of some workers (Xie et al., 2019b). In this case, a small error on some machines (workers) might cause a distributed learning method to fail. In addition, malicious attack should not be neglected in an open network where the manager (or server) generally has not much control on the workers, such as the cases of edge computing and federated learning. Some malicious workers may behave arbitrarily or even adversarially. Hence, Byzantine learning (BL), which refers to distributed learning with attack or error, has recently attracted much attention (Blanchard et al., 2017; Alistarh et al., 2018; Damaskinos et al., 2018; Xie et al., 2019b).
Existing BL methods can be divided into two main categories: synchronous BL (SBL) methods and asynchronous BL (ABL) methods. In SBL methods, the learning information, such as the gradient in SGD, of all workers will be aggregated in a synchronous way. On the contrary, in ABL methods the learning information of workers will be aggregated in an asynchronous way. Existing SBL methods mainly take two different ways to achieve resilience against Byzantine workers which refer to those workers with attack or error. One way is to replace the simple averaging aggregation operation with some more robust aggregation operations, such as median, trimmed-mean (Yin et al., 2018) and Krum (Blanchard et al., 2017). The other way is to filter the suspicious learning information (gradients) before averaging. Representative examples include ByzantineSGD (Alistarh et al., 2018) and Zeno (Xie et al., 2019b). The advantage of SBL methods is that they are relatively simple and easy to be implemented. But SBL methods will result in slow convergence when there exist heterogeneous workers. Furthermore, in some applications like federated learning and edge computing, synchronization cannot even be performed most of the time due to the offline workers (clients or edge servers). Hence, ABL is more general and practical than SBL.
To the best of our knowledge, there exist only two ABL methods: Kardam (Damaskinos et al., 2018) and Zeno++ (Xie et al., 2019a). Kardam introduces two filters to drop out suspicious learning information (gradients), which can still achieve good performance when the communication delay is heavy. However, when in face of malicious attack, some work (Xie et al., 2019a) finds that Kardam also drops out most correct gradients in order to filter all faulty (error) gradients. Hence, Kardam cannot resist malicious attack. Zeno++ scores each received gradient, and determines whether to accept it according to the score. But Zeno++ needs to store some training instances on the server for scoring. In practical applications, storing data on the server will increase the risk of privacy leak or even face legal risk. Therefore, under the general setting where the server has no access to any training instances, there have not existed ABL methods to resist malicious attack.
In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for BL. The main contributions of BASGD are listed as follows:
BASGD is an asynchronous method, and hence BASGD is more general and practical than existing SBL methods.
BASGD has no need to store any training instances on the server, and hence can preserve privacy in ABL.
BASGD is theoretically proved to have the ability of resisting against error and malicious attack.
BASGD has a similar theoretical convergence rate to that of vanilla asynchronous SGD (ASGD), with an extra constant variance.
Empirical results show that BASGD can significantly outperform vanilla ASGD and other ABL baselines when there exist error or malicious attack on workers. In particular, BASGD can still converge under cases with malicious attack in which ASGD and other ABL methods fail.
This section presents the preliminary of this paper, including the distributed learning framework used in this paper and the definition of Byzantine worker.
2.1 Distributed Learning Framework
where is the parameter to learn, is the dimension of parameter, is the number of training instances, is the empirical loss on the training instance . The goal of this work is to solve the optimization problem in (1), by designing distributed learning algorithms on multiple networked machines.
Although there have appeared many distributed learning frameworks, in this paper we focus on the widely used Parameter Server (PS) framework (Li et al., 2014) which is shown in Figure 1. In a PS framework, there are several workers and one server or multiple servers. Each worker can only communicate with the server(s). There may exist more than one server in a PS framework, but for the problem of this paper the servers can be logically conceived as a unity. Without loss of generality, we will assume there is only one server in this paper. Training instances are disjointedly distributed across workers. Let denote the index set of training instances on worker_, we have and if . In this paper, we assume that the server has no access to any training instances.
One popular asynchronous method to solve the problem in (1) under the PS framework is ASGD which is presented in Algorithm 1. Here, each worker sample one instance for gradient computation each time. Each worker can also sample a mini-batch of instances for gradient computation each time. The effect of batch size is not the focus of this work, and the analysis of this paper can also be easily adapted for cases with mini-batch. Hence, in this paper we do not separately discuss the mini-batch case.
In PS based ASGD, the server is responsible for updating and maintaining the latest parameter. We use the number of iterations that the server has already executed as the current iteration number of the server. At the very beginning, the iteration number . Each time a SGD step is executed, will be increased by immediately. Iteration number can also be seen as the version of parameter, and we denote the parameter after iterations as .
The server may have executed several SGD steps between the time when worker receives parameter and the time when worker sends back the gradient computed based on . We use to denote the iteration number on the server when receiving the gradient computed based on . Delay is defined as .
2.2 Byzantine Worker
Between two iterations on the server, a worker may send nothing, send gradient only once, or send gradient more than once. Though the last case is impossible in ASGD (see Algorithm 1), but it may happen in BASGD (refer to Section 3).
For workers that have sent gradients to server at iteration , we call some worker a loyal worker if the worker has finished all the tasks without any fault and each sent gradient is correctly received by the server with delay , where is a constant. Otherwise, worker_ is called a Byzantine worker. If worker_ is a Byzantine worker, it means the received gradient from worker_ is not credible, which may be an arbitrary value.
We use to denote the index set of workers that the server has received gradient from at iteration , namely, between the -th SGD step and the ()-th SGD step. , we denote the gradient received from worker_ at iteration as .
Please note that a worker may not be always loyal or always Byzantine. For example, a loyal worker at iteration may suffer from a bit-flipping at iteration , so it will be identified as a Byzantine worker at iteration . Also, a malicious worker may sometimes behave as loyal ones to hide itself, and will be seen as loyal at these normally working iterations.
Furthermore, we define the index set of loyal workers at iteration as follows:
Thus, worker_ is Byzantine at iteration if . Then, we have:
where , and is randomly sampled from .
Our definition of Byzantine worker and loyal worker is consistent with most previous works (Blanchard et al., 2017; Xie et al., 2019b) under the setting of synchronous Byzantine learning which actually corresponds to the case . But our definition is more general since it includes the cases with time delays, i.e., , which cannot be neglected in an asynchronous method. In particular, there are mainly two types of Byzantine workers in ABL:
Workers with malicious attack: This type of workers are controlled or hacked by an adversarial party. They may send wrong or malicious gradients to the server on purpose, and try to make learning method fail. This type of workers can be appeared in some applications with open networks, such as edge computing and federated learning, where the manager (or server) generally has not much control on the workers.
Workers with accidental error: Although not necessarily malicious, this type of workers may go wrong during the learning process, due to accidental errors such as bit flipping and network failure. For cases with this type of workers, the gradient received by the server might be too stale or wrongly transmitted. Although unintentionally, the stale or faulty (error) gradients will slow down the convergence or even cause learning methods to fail.
3 Buffered Asynchronous SGD
In synchronous BL, all gradients are received at the same time for updating parameters. During this process, we can compare the gradients with each other, and then filter suspicious ones, or use more robust aggregation rules such as median and trimmed-mean for updating. However, in asynchronous BL, only one gradient is received by the server at a time. Without any training instances stored on the server, it is difficult for the server to identify whether a received gradient is credible or not.
In order to deal with this problem in asynchronous BL, we propose a novel ABL method called buffered asynchronous SGD (BASGD). BASGD introduces buffers () on the server, and the gradient used for updating parameters will be aggregated from these buffers. The learning procedure of BASGD is presented in Algorithm 2. We can find that BASGD degenerates to vanilla ASGD when buffer number .
In the following content of this section, we will introduce the details of the two key components of BASGD: buffer and aggregation function.
In BASGD, the workers do the same job as that in ASGD, while the updating rule on server is modified. More specifically, there are buffers () on the server. When the server receives a gradient from worker_, the parameter will not be updated immediately. The gradient will be stored in a buffer temporarily, where . A concrete example is illustrated in Figure 2. Only when all buffers have got changed since the last SGD step, a new SGD step will be executed.
For each buffer , more than one gradient may have been received between two iterations. We will store the average of these gradients, denoted by , in buffer . Assume that there are already gradients which should be stored in buffer , and
When the -th gradient is received, the new average value in buffer should be:
This is the updating rule for each buffer when a gradient is received. After the parameter is updated, all buffers will be zeroed out at once.
With the benefit of buffers, the server has access to candidate gradients when updating parameter. Thus, a more reliable (robust) gradient can be aggregated from the gradients of buffers, if a proper aggregation function is chosen.
3.2 Aggregation Function
When a SGD step is ready to be executed, there are
buffers providing candidate gradients. An aggregation function is needed to get the final gradient for updating. A simple function is to take the mean of all candidate gradients. However, mean value is sensitive to outliers which are common in BL.
For designing proper aggregation functions, we first define the -Byzantine Robust (-BR) condition to quantitatively describe the Byzantine resilience ability of an aggregation function.
Definition 1 (-Byzantine Robust).
For an aggregation function : , where and , we call -Byzantine Robust (), if it satisfies the following two properties:
b)., with ,
Intuitively, property a) in Definition 1 says that if all candidate gradients
are added by a same vector, the aggregated gradient will also be added by . Property b) says that for each coordinate , the aggregated value will be between the -th smallest value and the -th largest value among the -th coordinates of all candidate gradients. Thus, the gradient aggregated by a -BR function is insensitive to at least outliers.
We can find that -BR condition gets stronger when increases. In other words, if is -BR, then for any , is also -BR.
It is not hard to find that when , mean function is not -Byzantine Robust for any . We will illustrate this by a simple one-dimension example: , while . Then Namely, the mean is larger than any of the first values.
We find that the following two aggregation functions satisfy Byzantine Robust condition.
Definition 2 (Coordinate-wise median (Yin et al., 2018)).
For candidate gradients , , . Coordinate-wise median is defined as:
where is the scalar median of the -th coordinates of all candidate gradients, .
Definition 3 (Coordinate-wise -trimmed-mean (Yin et al., 2018)).
For any positive interger and candidate gradients , , . Coordinate-wise -trimmed-mean is defined as:
where is the scalar -trimmed-mean:
is the subset of obtained by removing the largest elements and smallest elements, .
In the following content, coordinate-wise median and coordinate-wise -trimmed-mean are also called median and trmean, respectively. Proposition 1 shows the -BR property of these two functions.
With candidate gradients, coordinate-wise -trimmed-mean is -BR, and coordinate-wise median is -BR.
Here, represents the maximum integer not larger than . According to Proposition 1, either median or trmean is a proper choice for aggregation function in BASGD.
The time complexity for computing the average value of all buffers in each iteration is . If trmean or median is chosen as , the time complexity for each iteration is and for trmean and median, respectively. Hence, the total time complexity is and for trmean and median respectively, where is the total number of iterations. For space complexity, buffers are introduced in BASGD. Hence, the extra space complexity of BASGD is .
In this section, we theoretically prove the convergence and resilience of BASGD against attack or error. Here we only present the main Lemmas and Theorems.
We make the following assumptions, which also have been widely used in stochastic optimization methods like SGD-based methods. Please note that we do not give any assumption about the behavior of Byzantine workers, which may behave arbitrarily.
Assumption 1 (Lower bound).
Global loss function
Global loss functionis bounded below: .
Assumption 2 (Unbiased estimation).
For any loyal worker, it can use locally stored training instances
to obtain an estimated gradient of the global loss function with no bias:
For any loyal worker, it can use locally stored training instances to obtain an estimated gradient of the global loss function with no bias:
Assumption 3 (Limited second order moment).
The gradient received from any loyal worker has limited second order moment:
The gradient received from any loyal worker has limited second order moment:.
Assumption 4 (-smoothness).
Global loss function is differentiable and -smooth:
Assumption 5 (Limited number of Byzantine workers).
The number of Byzantine workers at each iteration is not larger than .
Please note that we do not explicitly assume limited delay here, because it can be guaranteed by the definition of loyal workers. Workers with too heavy delay would be seen as Byzantine workers in our analysis.
Please also note that we do not give any assumption about convexity. The analysis in this section is suitable for both convex and non-convex models in machine learning, such as logistic regression and deep neural networks.
Before formally giving theoretical results about convergence, we define a type of constant , which will be used in our theoretical results.
, constant is defined as:
If is -Byzantine Robust, and there are no more than Byzantine workers , then:
If is -Byzantine Robust, then:
If is -Byzantine Robust and , taking learning rate , we have:
We can find that BASGD has a similar theoretical convergence rate as that of vanilla ASGD, with an extra constant variance which corresponds to the constant .
Then we have the following conclusions:
When and are fixed, the upper bound of will increase when (number of Byzantine workers) increases. Namely, the upper bound will be larger if there are more Byzantine workers.
When and are fixed, measures the Byzantine Robust degree of aggregation function . The factor is monotonically decreasing with respect to , when . Since , the upper bound will decrease when increases. Also, decreases when increases. Namely, the upper bound will be smaller if has a stronger -BR property.
In the worst case (), the upper bound of is linear to . Even in the best case (), the denominator is about and the upper bound of is linear to . That is to say, larger buffer number might result in slower convergence and higher loss. Hence, unless necessary, we should choose as small as possible.
In this section, we empirically evaluate the performance of BASGD and baselines. Our experiments are conducted on a distributed platform with dockers. Each docker is bound to an NVIDIA Tesla V100 (32G) GPU. In all experiments, we choose
dockers as workers, and one extra docker as the server. All algorithms are implemented with PyTorch 1.3.
5.1 Experimental Setting
The algorithms are evaluated on the CIFAR-10 image classification dataset (Krizhevsky et al., 2009)
with a deep learning model ResNet-20(He et al., 2016). Each worker is manually set to have a delay
which is randomly sampled from a truncated standard normal distribution within interval.
We use cross-entropy loss on training set (training loss) and top-1 accuracy on test set to quantitatively measure the performance. In an asynchronous algorithm, the epoch number on different workers may differ. Hence, we use the average cross-entropy loss and average top-1 accuracy on all workers w.r.t. epochs as the final metrics.
We set initial learning rate for each algorithm, and multiply by 0.1 at the -th epoch and the -th epoch respectively. The weight decay is set to . We run each algorithm for epochs, but only the results of the first epochs will be taken into account because some workers may finish earlier than others. Training set is randomly and equally distributed to different workers, and the batch size on each worker is set to .
Because the focus of this paper is on ABL, SBL methods cannot be directly compared with BASGD. The ABL method Zeno++ (Xie et al., 2019a) either cannot be directly compared with BASGD, because Zeno++ needs to store some training instances on the server. Hence, in our experiments, we compare BASGD with vanilla ASGD and the ABL baseline Kardam (Damaskinos et al., 2018). For Kardam, we set the dampening function to be as suggested in (Damaskinos et al., 2018).
5.2 Cases without Byzantine Workers
We compare the performance of different methods when there are no Byzantine workers. Experimental results with median and trmean aggregation functions are illustrated in Figure 3(a) and Figure 3(b), respectively.
We can find that ASGD achieves the best performance. BASGD () and Kardam have similar convergence rate as ASGD, but both sacrifice a little accuracy. Furthermore, the performance of BASGD gets worse when the buffer number increases, which is consistent with the theoretical results. Please note that ASGD is a degenerated case of BASGD with . Hence, in the cases without attack or error, BASGD can achieve the same performance as ASGD by setting .
5.3 Cases with Byzantine Workers
We compare the performance of different methods under two types of attack: negative gradient attack (NG-attack) and random disturbance attack (RD-attack). In NG-attack, Byzantine workers will send to the server, where is the correctly computed gradient based on its training data. In RD-attack, Byzantine workers will send to the server, where is a random vector with each coordinate randomly sampled from a normal distribution . We set for NG-attack, and for RD-attack. NG-attack is a typical kind of malicious attack, while RD-attack can be seen as an accidental error with expectation . For each type of attack, we conduct two experiments in which there are and Byzantine workers, respectively. We respectively set and buffers for BASGD in these two experiments.
Figure 4(a) (for 3 Byzantine workers) and Figure 4(b) (for 6 Byzantine workers) illustrate the average top- test accuracy w.r.t. epochs. Figure 5(a) and Figure 5(b) illustrate the average training loss w.r.t. epochs. In Figure 5, some curves do not appear, because the value of loss function is extremely large or even not a number (NaN), due to the Byzantine attack.
We can find that BASGD significantly outperforms ASGD and Kardam under both RD-attack (accidental error) and NG-attack (malicious attack). Although ASGD and Kardam can still converge under the less harmful RD-attack, they both suffer a significant loss on accuracy. Under the NG-attack, even if we have set the number of assumed Byzantine workers to the maximum value for Kardam (), both ASGD and Kardam cannot converge. Hence, both ASGD and Kardam cannot resist malicious attack. On the contrary, both types of attack have little effect on the performance of BASGD. Furthermore, in our experiments we find that Kardam filters more than of the gradients, which means that Kardam also filters most of the correct gradients in order to filter the faulty (error) gradients. This might explains why Kardam has a poor performance under malicious attack.
In this paper, we propose a novel method called BASGD for Byzantine learning. BASGD is asynchronous, which has more practical applications than synchronous methods. Furthermore, BASGD has no need to store any training instances on the server, which provides a potential solution for preserving privacy in distributed learning. In addition, BASGD is theoretically proved to have the ability of resisting against error and malicious attack. Empirical results show that BASGD can significantly outperform vanilla ASGD and other asynchronous Byzantine learning baselines, when there exists error or attack on workers.
- Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 4613–4623. Cited by: §1, §1.
- Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pp. 119–129. Cited by: §1, §1, §2.2.
- Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Cited by: §1.
- Asynchronous Byzantine machine learning (the case of SGD). In Proceedings of the International Conference on Machine Learning, pp. 1145–1154. Cited by: §1, §1, §5.1.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12 (Jul), pp. 2121–2159. Cited by: §1.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §5.1.
- Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 3068–3076. Cited by: §1.
- Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323. Cited by: §1.
- Advances and open problems in federated learning. arXiv:1912.04977. Cited by: §1.
- Federated learning: strategies for improving communication efficiency. arXiv:1610.05492. Cited by: §1.
- Learning multiple layers of features from tiny images. Cited by: §5.1.
- Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. The Journal of Machine Learning Research 18 (1), pp. 4404–4446. Cited by: §1.
- Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pp. 19–27. Cited by: §2.1.
- Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 5330–5340. Cited by: §1.
- An accelerated proximal coordinate gradient method. In Advances in Neural Information Processing Systems, pp. 3059–3067. Cited by: §1.
- Adding vs. averaging in distributed primal-dual optimization. In Proceedings of the International Conference on Machine Learning, pp. 1973–1982. Cited by: §1.
- Minimizing finite sums with the stochastic average gradient. Mathematical Programming 162 (1-2), pp. 83–112. Cited by: §1.
- Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research 14 (Feb), pp. 567–599. Cited by: §1.
- Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the International Conference on Machine Learning, pp. 1000–1008. Cited by: §1.
- Edge computing: vision and challenges. IEEE Internet of Things Journal 3 (5), pp. 637–646. Cited by: §1.
- Slim-dp: a multi-agent system for communication-efficient distributed deep learning. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 721–729. Cited by: §1.
- Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pp. 1299–1309. Cited by: §1.
- Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research 11 (Oct), pp. 2543–2596. Cited by: §1.
- Zeno++: robust asynchronous SGD with arbitrary number of Byzantine workers. arXiv:1903.07020. Cited by: §1, §5.1.
- Zeno: distributed stochastic gradient descent with suspicion-based fault-tolerance. In Proceedings of the International Conference on Machine Learning, pp. 6893–6901. Cited by: §1, §1, §2.2.
- Trading computation for communication: distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 629–637. Cited by: §1.
- Byzantine-robust distributed learning: towards optimal statistical rates. In Proceedings of the International Conference on Machine Learning, pp. 5650–5659. Cited by: §1, Definition 2, Definition 3.
- On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In Proceedings of the International Conference on Machine Learning, pp. 7184–7193. Cited by: §1.
Parallel restarted SGD with faster convergence and less communication: demystifying why model averaging works for deep learning.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5693–5700. Cited by: §1.
- Linear convergence with condition number independent access of full gradients. In Advances in Neural Information Processing Systems, pp. 980–988. Cited by: §1.
- Asynchronous distributed admm for consensus optimization. In Proceedings of the International Conference on Machine Learning, pp. 1701–1709. Cited by: §1.
- SCOPE: scalable composite optimization for learning on spark. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 2928–2934. Cited by: §1.
- Proximal SCOPE for distributed sparse learning. In Advances in Neural Information Processing Systems, pp. 6551–6560. Cited by: §1.
- Distributed proximal gradient algorithm for partially asynchronous computer clusters. The Journal of Machine Learning Research 19 (1), pp. 733–764. Cited by: §1.
- Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 2595–2603. Cited by: §1.