I Introduction
With the rapid development of information technologies, the volume of distributed data increases explosively. Every day, numerous distributed devices including sensors, cellphones, computers, and vehicles, generate huge amounts of data, which are often forwarded to datacenters for further processing and learning tasks. However, collecting data from distributed devices and storing them in datacenters raise major privacy concerns [1, 2, 3]
. Accounting for these concerns, federated learning has been advocated to provide a privacy-preserving, decentralized data processing and machine learning framework
[4]. Data in federated learning are kept private, and local computations are carried at the distributed devices. Updates of local variables (such as stochastic gradients, corrected stochastic gradients, and model parameters) are found using per-device private data, while the datacenter aggregates local variables and disseminates the aggregated result to the distributed devices.Even though privacy is preserved, the distributed nature of federated learning makes it vulnerable to errors and adversarial attacks. Devices can then become unreliable in either computing or communicating, or, they can even be hacked by adversaries. As a result, compromised devices may send malicious messages to the datacenter, thus misleading the learning process [5, 6, 7]. We will henceforth focus on the class of malicious attacks known as Byzantine attacks [8]. Robustifying federated learning against Byzantine attacks is of paramount importance for secure processing and learning.
To cope with Byzantine attacks in federated learning, several robust aggregation rules have been developed in recent years, mainly towards improving the distributed stochastic gradient descent (SGD) solver of the underlying optimization task. Through aggregating stochastic gradients with the geometric median [9, 10], median [11], trimmed mean [12], or iterative filtering [13], stochastic algorithms have been able to tolerate a small number of devices attacked by Byzantine adversaries. Other aggregation rules include Krum [14], that selects a stochastic gradient having the minimal cumulative squared distance from a given number of nearest stochastic gradients, and RSA [15]
which aggregates models other than stochastic gradients through penalizing the differences between the local and global model parameters. Related works also include adversarial learning in distributed principal component analysis
[16], escaping from saddle points in non-convex distributed learning under Byzantine attacks [17], and leveraging redundant gradients to improve robustness [18, 19].Although robust SGD iterates can ensure convergence to a neighborhood of the attack-free optimal solution, this neighborhood size can be large when Byzantine attacks are carefully crafted [20]. Essentially, SGD suffers from the sizeable approximation error (noise) associated with stochastic gradients. This leads to the challenge of distinguishing malicious messages sent by Byzantine attackers from the noisy stochastic gradients sent by ‘honest’ devices.
In the face of this challenge, we posed the following question: Is it possible to better distinguish the malicious messages from the stochastic gradients through reducing the stochastic gradient-induced noise? Our answer will turn out to be in the affirmative. Intuitively, if the stochastic gradient noise is small, the malicious messages should be easy to identify; see also the illustrative example in Section II-D. This intuition suggests combining variance reduction techniques with robust aggregation rules to handle Byzantine attacks in federated learning.
Existing variance reduction techniques in stochastic optimization include mini-batch [21], and abbreviated ones as SAG [22], SVRG [23], SAGA [24], SDCA [25], SARAH [26], Katyusha [27], to list a few. Among these, we are particularly interested in SAGA, which has been proven effective in finite-sum optimization. SAGA can also be implemented in a distributed manner [28, 29, 30], and hence it fits well the federated learning applications, where each device deals with a finite number of data samples.
Our proposed novel Byzantine attack resilient distributed (Byrd-) SAGA combines SAGA’s variance reduction with robust aggregation to deal with the malicious attacks in federated finite-sum optimization setups. Instead of the mean employed by distributed SAGA, the datacenter in Byrd-SAGA relies on the geometric median to aggregate the corrected stochastic gradients sent by distributed devices. Through reducing the stochastic gradient-induced noise, Byrd-SAGA turns out to outperform the Byzantine attack resilient distributed SGD. When less than half of the workers are Byzantine attackers, the robustness of geometric median to outliers enables Byrd-SAGA to achieve provably linear convergence to a neighborhood of the optimal solution, and the asymptotic learning error is solely determined by the number of Byzantine workers. Numerical tests demonstrate the robustness of Byrd-SAGA to various Byzantine attacks.
Ii Problem Statement
We start this section by specifying the federated finite-sum optimization problem in the presence of Byzantine attacks. We then elaborate on the limitations of Byzantine attack resilient distributed SGD algorithms, which motivate our subsequent development of Byrd-SAGA.
Ii-a Federated finite-sum optimization in the presence of Byzantine attacks
Consider a network with one master node (datacenter) and workers (devices), among which workers are Byzantine attackers with their identities unknown to the master node. Let be the set of all workers, and that of Byzantine attackers with respective cardinalities and . The data samples are evenly distributed across the honest workers . Each honest worker has data samples, and denotes the loss of the -th data sample at the honest worker with respect to the model parameter . We are interested in the finite-sum optimization problem
(1) |
where
(2) |
The main challenge of solving (1) is that the Byzantine attackers can collude and send arbitrary malicious messages to the master node so as to bias the optimization process. We aspire to develop a robust distributed stochastic algorithm to address this issue. Intuitively, when a majority of workers are Byzantine attackers, it is difficult to obtain a reasonable approximate solution to (1). For this reason, we will assume throughout, and prove that the proposed Byzantine attack resilient algorithm is able to tolerate attacks from up to half of the workers.
Ii-B Sensitivity of distributed SGD to Byzantine attacks
When all workers are honest, a popular solver of (1) is SGD [31]. At time slot (iteration) , the master node broadcasts to workers. Upon receiving , worker uniformly at random chooses a local data sample with index to obtain the stochastic gradient that then communicates back to the master node. Upon collecting stochastic gradients from all workers, the master node updates the model as
(3) |
where is the non-negative step size. Note that the distributed SGD can be extended to its mini-batch version; whereby, each worker uniformly at random chooses a mini-batch of data samples per iteration, and communicates the averaged stochastic gradient back to the master node.
While the honest workers send true stochastic gradients to the master node, the Byzantine ones can send arbitrary malicious messages to the master node in order to perturb (bias) the optimization process. Let denote the message worker sends to the master node at slot , given by
(4) |
where denotes an arbitrary vector. Then, the distributed SGD update (3) becomes
(5) |
Even when only one Byzantine attacker is present, the distributed SGD may fail. Consider that a Byzantine attacker sends to the master node , which yields . In practice, Byzantine attackers can send more sophisticated messages to fool the master node, and thus bias the optimization process.
Ii-C Byzantine attack resilient distributed SGD
Recent works often robustify the distributed SGD by incorporating robust aggregation rules when the master node receives messages from the workers. Here, we will adopt and analyze the geometric median, even though alternative robust aggregation rules are also viable [9, 10].
With denoting a subset in a normed space, the geometric median of is
(6) |
Using (6), the distributed SGD in (5) can be modified to its Byzantine attack resilient form as
(7) |
In essence, the geometric median chooses a reliable vector to represent the received messages through majority voting. When the number of Byzantine workers , the geometric median approximates reasonably well the mean of . This property enables the Byzantine attack resilient distributed SGD to converge to a neighborhood of the optimal solution [9, 10].
Ii-D Impact of stochastic gradient noise on robust aggregation
In distributed SGD, the stochastic gradients evaluated by honest workers are noisy because of the randomness in choosing data samples. Due to the stochastic gradient noise however, it is not always easy to distinguish the malicious messages from the stochastic gradients using just the robust aggregation rules, e.g. the geometric median. Several existing works have recognized this issue. With carefully crafted Byzantine attacks, outputs of several Byzantine attack resilient SGD algorithms can be far away from the optimal solution [20]. In [10] and [18], the workers are divided into several groups, with averages taken within groups and the geometric median obtained across groups. This approach leads to reduced variance and thus enhanced ability to distinguish malicious messages. In [14], it is explicitly assumed that the ratio of the variance of stochastic gradients to the distance between iterate and optimal solution is upper-bounded.
Fig. 1 shows the impact of stochastic gradient noise on geometric median-based robust aggregation. When the stochastic gradients sent by honest workers have small variance, the gap between the true mean and the aggregated value is also small; that is, the same Byzantine attacks are less effective. We will quantify this statement in our analysis of Section IV-A.
Prompted by this observation, our key idea is to reduce the variance of stochastic gradients in order to enhance robustness to Byzantine attacks. In the Byzantine attack-free case, an effective approach to alleviating stochastic gradient noise in SGD is through variance reduction. By compensating for stochastic gradient noise, variance reduction techniques lead to faster convergence than SGD. For specificity, we will focus on SAGA, which reduces stochastic gradient noise for finite-sum optimization [24], and we will show how SAGA can also aid robust aggregation against Byzantine attacks.
Iii Algorithm Development
In this section, we first introduce distributed SAGA with mean aggregation. Then, we propose Byrd-SAGA, which replaces mean aggregation by geometric median-based robust aggregation.
Iii-a Distributed SAGA with mean aggregation
In distributed SAGA, each worker maintains a table of stochastic gradients for all of its local data samples [28, 29]. As in distributed SGD, the master node at slot sends to the workers, and every worker uniformly at random chooses a local data sample with index to find the stochastic gradient . However, worker does not send back to the master node. Instead, it corrects by first subtracting the previously stored stochastic gradient of the -th data sample, and then adding the average of the stored stochastic gradients across local data samples. Then, worker sends such a corrected stochastic gradient to the master node, and stores as the stochastic gradient of the -th data sample in the table. After collecting the corrected stochastic gradients from all workers, the master node updates the model .
To better describe distributed SAGA, let
(8) |
where is the iterate at which the most recent is evaluated when slot ends. Then, refers to the previously stored stochastic gradient of the -th data sample prior to slot on worker , and
is the corrected stochastic gradient of worker at slot . The model update of SAGA is hence
(9) |
where is the constant step size.
Iii-B Distributed SAGA with geometric median aggregation
Here, it is useful to recall that Byzantine workers may send to the master node malicious messages, other than the corrected stochastic gradient. To account for this, the message sent from worker to the master node at slot is expressed as
(10) |
where denotes an arbitrary vector. Similar to distributed SGD, distributed SAGA is also sensitive to Byzantine attacks. Our robust aggregation rule here is the geometric median. This leads to the proposed Byzantine attack resilient distributed (Byrd) form of SAGA in (9), that is given by
(11) |
The proposed Byzantine attack resilient distributed SAGA, abbreviated as Byrd-SAGA, is listed step-by-step under Algorithm 1, and illustrated in Fig. 2. There are various implementations of the distributed SAGA. For example, [29] proposed to store the tables of stochastic gradients in the master node. The workers only need to upload the stochastic gradients and their indexes, while the master node performs the aggregation. This setup is also vulnerable to Byzantine attacks, since the Byzantine attackers may upload incorrect stochastic gradients. The proposed robust aggregation rule can also be applied therein.
Robust aggregations other than the geometric median are available, including the median [11], Krum [14], marginal trimmed mean [12], and iterative filtering [13]. In the median for instance, the aggregation outputs the element-wise median of ; while in the Krum, the aggregation outputs
where selects the indexes of the nearest neighbors of in . Note that Krum needs to know , the number of Byzantine attackers, in advance. In addition, other variance reduction techniques, such as mini-batch [21], SAG [22], SVRG [23], SAGA [24], SDCA [25], SARAH [26] and Katyusha [27], are also available to alleviate the gradient noise. Here we opted for the combination of geometric median and SAGA. Extending the current work to other robust aggregation rules and variance reduction techniques, is in our future research agenda.
Remark 1.
Computing the geometric median involves solving an optimization problem in the form of (6). Since it is costly to obtain the exact geometric median, one is typically satisfied with an -approximate value [32]. We say that is an -approximate geometric median of if
(12) |
We shall show that the -approximation only slightly affects the convergence of Byrd-SAGA.
Iv Theoretical Analysis
In this section, we theoretically justify the intuitive idea that reducing stochastic gradient noise helps identify malicious messages in robust aggregation, specifically to the geometric median in this paper. We prove that our Byrd-SAGA converges to a neighborhood of the optimal solution at a linear rate under Byzantine attacks, and the asymptotic learning error is determined by the number of Byzantine attackers. Due to the page limit, proofs are delegated to the full version of this paper^{1}^{1}1https://github.com/MrFive5555/Byrd-SAGA/blob/master/Full.pdf.
Iv-a Importance of reducing stochastic gradient noise
Here, we quantify the role of stochastic gradient noise on the geometric median aggregation. Towards this objective, consider the set of messages sent by all workers in , and the set of malicious messages sent by the Byzantine attackers in . Further, let denote the true gradient given by the ensemble average of stochastic gradients. Using these definitions, the ensuing lemma bounds the mean-square error of the geometric median relative to the true gradient.
Lemma 1.
(Concentration property) Let be a subset of random vectors distributed in a normed vector space. If and , then it holds that
(13) | ||||
where
while , and .
The left-hand side of (13) is the mean-square error of the geometric median relative to the true gradient, while the right-hand side is the sum of two terms. The first is determined by the variances of the local stochastic gradients sent by the honest workers (inner variation), while the second term is determined by the variations of the local gradients at the honest workers with respect to the true gradient (outer variation). In the Byzantine attack resilient SGD, the upper bound can be large due to the large stochastic gradient noise of SGD. Through reducing the stochastic gradient noise in terms of either inner variation or outer variation, we are able to attain improved accuracy under malicious attacks.
Iv-B Convergence of Byrd-SAGA and comparison with Byzantine attack resilient SGD
Here, we establish convergence of Byrd-SAGA, and theoretically justify that, through reducing the impact of inner variation, Byrd-SAGA enjoys superior robustness to Byzantine attacks. We begin with several needed assumptions on the functions .
Assumption 1.
(Strong convexity and Lipschitz continiuty of gradients) The function is -strongly convex and has -Lipschitz continuous gradients, which amounts to requiring that for any , it holds that
(14) |
and
(15) |
Assumption 2.
(Bounded outer variation) For any , variation of the aggregated gradients at the honest workers with respect to the overall gradient is upper-bounded by
(16) |
Assumption 3.
(Bounded inner variation) For every honest worker and any , the variation of its stochastic gradients with respect to its aggregated gradient is upper-bounded by
(17) |
Assumption 1 is standard in convex analysis. Assumptions 2 and 3 bound the variation of gradients and the variation of stochastic gradients within the honest workers, respectively [33]. For instance, most of the existing Byzantine attack resilient SGD algorihtms assume that the stochastic gradients at the honest workers are independently and identically distributed (i.i.d.) with finite variance, such that the outer variation in Assumption 2 is proportional to and the inner variation in Assumption 3 is finite. In the analysis of Byzantine attack resilient SGD, both outer and inner variations must be bounded. Interestingly, inner variation will turn out not to impact Byrd-SAGA, and Assumption 3 will no longer be necessary in its analysis.
To simplify notation, we will henceforth use
to represent the expectation with respect to all random variables
.The presence of geometric median makes Byrd-SAGA analysis challenging. Specifically, for every honest worker ,
is an unbiased estimate of
, meaning(18) |
Averaging (18) over all honest workers , we have
(19) |
From (19), we observe that the mean of over all the honest workers is an unbiased estimate of . Nevertheless, the geometric median of , even only over all the honest workers and calculated accurately, is a biased estimate of . This is the main challenge in adapting the proof of SAGA to that of Byrd-SAGA.
The following theorem asserts that Byrd-SAGA converges to a neighborhood of the optimal solution at a linear rate, with the asymptotic learning error determined by the number of Byzantine attackers.
Theorem 1.
In (20), the constant of convergence rate is given by
which is close to when (the number of data samples at each worker) and (the condition number of functions) are large. Observe that is monotonically increasing when the portion of Byzantine attackers increases. Therefore, (20) shows that Byrd-SAGA converges slower as the number of Byzantine attackers grows. Correspondingly, the theoretical upper bound of step size is small when and are large. The asymptotic learning error in (22) is also monotonically increasing when (and hence the number of Byzantine attackers) increases.
To demonstrate the superior robustness of Byrd-SAGA, we also establish the convergence of Byzantine attack resilient SGD with constant step size as a benchmark. As in Theorem 1, the convergence of Byzantine attack resilient SGD is in the mean-square error sense. This is different from [10]
, where convergence is asserted in the high probability sense.
Theorem 2.
Let us ignore the approximation error in computing geometric median by setting , and compare the two asymptotic learning errors and . With the step size , the constant in is in the order of . Therefore, we deduce that
Observe that , the asymptotic learning error of Byzantine attack resilient SGD, is proportional to the sum of inner and outer variations. With all honest workers having the same data sample, we have . In this case, the asymptotic learning error vanishes because the geometric median aggregation takes effect and attains the true gradient. However, when each honest worker has the same set of distinct data samples, the inner variation is no longer zero and the asymptotic learning error can be large. In contrast, Byrd-SAGA effectively reduces the impact of inner variation, and is able to achieve smaller learning error.
V Numerical Experiments
Here we present numerical experiments on convex and nonconvex learning problems^{2}^{2}2The codes are available at https://github.com/MrFive5555/Byrd-SAGA. For each problem, we evenly distribute the dataset into honest workers unless indicated otherwise. To account for malicious attacks, we additionally launch Byzantine workers. We test the performance of the proposed Byrd-SAGA under three typical Byzantine attacks: Gaussian, sign-flipping and zero-gradient attacks [15, 34]. For a Gaussian attack, a Byzantine attacker draws its
from a Gaussian distribution with mean
and variance . For a sign-flipping attack, a Byzantine attacker sets its message as , where the magnitude is used in the numerical experiments. And for a zero-gradient attack, a Byzantine attacker sends so that the messages at the master sum up to zero. We use the algorithm in [32] to obtain the -approximate geometric median with .V-a -regularized logistic regression
Consider the
-regularized logistic regression cost, where each summand
is given bywith being the feature vector, the label, and a constant. We use the IJCNN1 and COVTYPE datasets^{3}^{3}3https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets. IJCNN1 contains 49,990 training data samples of dimensions. COVTYPE contains 581,012 training data samples of dimensions.
We first compare SGD, mini-batch (B)SGD with batch size and SAGA, using mean and geometric median aggregation rules. Compared to SGD, BSGD enjoys smaller stochastic gradient noise, but incurs higher computational cost. In comparison, SAGA also reduces stochastic gradient noise, but its computational cost is in the same order as that of SGD. For each algorithm, we adopt a constant step size, which is tuned to achieve the best optimality gap in the Byzantine-free scenario. The performance of these algorithms on the IJCNN1 and COVTYPE datasets is depicted in Fig. 3 and Fig. 4, respectively. With Byzantine attacks, all three algorithms using mean aggregation fail. Among the three using geometric median aggregation, Byrd-SAGA markedly outperforms the other two, while BSGD is better than SGD. This demonstrates the importance of variance reduction to handling Byzantine attacks. Regarding the variance of honest messages in particular, Byrd-SAGA, Byzantine attack resilient BSGD and Byzantine attack resilient SGD are in the order of , and , respectively, for the IJCNN1 dataset. For the COVTYPE dataset, Byrd-SAGA and Byzantine attack resilient BSGD have the same order of variance with respect to honest messages. In this case, Byrd-SAGA achieves similar optimality gap as Byzantine attack resilient BSGD, but converges faster because it is able to use a larger step size.
Theorem 1 establishes that when the outer variation , the asymptotic learning error of Byrd-SAGA is zero, no matter how large the inner variation is. In contrast, according to Theorem 2, the asymptotic learning error of Byzantine attack resilient SGD is still proportional to the inner variation . To validate these theoretical results, we conducted a second set of numerical experiments, where every honest worker has the whole IJCNN1 dataset. Therefore, and remains the same as that in the first set of experiments. We compare SGD, BSGD with batch size and SAGA, all using the geometric median aggregation rule. The results depicted in Fig. 5 corroborate the theoretical findings – the asymptotic learning error of Byrd-SAGA vanishes, while those of Byzantine attack resilient SGD and BSGD are the same as those shown in Fig. 3.
In the third set of numerical experiments, we compare the use of different aggregation rules in distributed SAGA: mean, geometric median, median, and Krum. As shown in Fig. 6, distributed SAGA using mean aggregation is the best in terms of the optimality gap when there are no Byzantine attacks. However, it fails under all kinds of attacks. With Gaussian attacks, Byrd-SAGA using geometric median achieves the best performance. With sign-flipping and zero-gradient attacks, Byrd-SAGA using Krum is the best, while that using geometric median also performs well. Note that Krum has to know the exact number of Byzantine attackers in advance, while geometric median and median do not need this prior knowledge.
V-B Neural network training
Here we test training a neural network with one hidden layer of
neurons and “tanh” activation function, for multi-class classification on the MNIST dataset^{4}^{4}4http://yann.lecun.com/exdb/mnist comprising data samples, each with dimension . We compare SGD with step size , BSGD with step size and batch size , and SAGA with step size . We run the algorithms for iterations, and report the final accuracy in Table 1. With mean aggregation, all algorithms yield low accuracy in the presence of Byzantine attacks. With the help of geometric median aggregation, BSGD and SAGA are both robust and outperform SGD. Note that Byrd-SAGA exhibits a much lower per-iteration computational cost relative to Byzantine attack resilient BSGD.attack | algorithm | mean acc (%) | geomed acc (%) |
---|---|---|---|
without | SGD | 97.0 | 92.3 |
BSGD | 98.6 | 98.0 | |
SAGA | 96.5 | 96.3 | |
Gaussian | SGD | 36.3 | 92.5 |
BSGD | 36.3 | 98.0 | |
SAGA | 14.5 | 96.4 | |
sign-flipping | SGD | 0.11 | 0.03 |
BSGD | 0.16 | 90.3 | |
SAGA | 0.12 | 86.4 | |
zero-gradient | SGD | 9.94 | 26.2 |
BSGD | 9.89 | 81.5 | |
SAGA | 9.88 | 92.4 |
Vi Conclusions
The present paper developed a novel Byzantine attack resilient distributed (Byrd-) SAGA approach to federated finite-sum optimization in the presence of Byzantine attacks. On par with SAGA, Byrd-SAGA corrects stochastic gradients through variance reduction. Per iteration, distributed workers obtain their corrected stochastic gradients before uploading to the master node. Different from SAGA though, the master node in Byrd-SAGA aggregates the received messages using the geometric median rather than the mean. This robust aggregation markedly enhances robustness of Byrd-SAGA in the presence of Byzantine attacks. It was established that Byrd-SAGA converges linearly to a neighborhood of the optimal solution, with the asymptotic learning error determined solely by the number of Byzantine workers.
References
- [1] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” Proceedings of SIGMOD, Pordland, Oregan, USA, May 2000.
- [2] J. Duchi, M. J. Wainwright, and M. I. Jordan, “Local privacy and minimax bounds: Sharp rates for probability estimation,” Proceedings of NIPS, Stateline, Nevada, USA, Dec. 2013.
- [3] L. Zhou, K. Yeh, G. Hancke, Z. Liu, and C. Su, “Security and privacy for the industrial Internet of Things: An overview of approaches to safeguard endpoints,” IEEE Signal Processing Magazine, vol. 35, no. 5, pp. 76–87, Sep. 2018.
- [4] J. Konecny, H. B. McMahan, D. Ramage, and P. Richtarik, “Federated optimization: Distributed machine learning for on-device intelligence,” arXiv Preprint arXiv:1610.02527, Oct. 2016.
- [5] A. Vempaty, L. Tong, and P. K. Varshney, “Distributed inference with Byzantine data: State-of-the-art review on data falsification attacks,” IEEE Signal Processing Magazine, vol. 30, no. 5, pp. 65–75, Aug. 2013.
- [6] Y. Chen, S. Kar, and J. M. F. Moura, “The Internet of Things: Secure distributed inference,” IEEE Signal Processing Magazine, vol. 35, no. 5, pp. 64–75, Sep. 2018.
- [7] Z. Yang, A. Gang, and W. U. Bajwa, “Adversary-resilient inference and machine learning: From distributed to decentralized,” arXiv Preprint arXiv:1908.08649, Aug. 2019.
- [8] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generals problem,” ACM Transactions on Programming Languages and Systems, vol. 4, no. 3, pp. 382–401, Jul. 1982.
- [9] S. Minsker, “Geometric median and robust estimation in Banach spaces,” Bernoulli, vol. 21, no. 4, pp. 2308-2335, Nov. 2015.
- [10] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proceedings of SIGMETRICS, Phonenix, Arizona, USA, Jun. 2019.
- [11] C. Xie, O. Koyejo, and I. Gupta, “Generalized Byzantine-tolerant SGD,” arXiv Preprint arXiv:1802.10116, Feb. 2018.
- [12] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robust distributed learning: Towards optimal statistical rates,” Proceedings of ICML, Stockholm, Sweden, Jul. 2018.
- [13] L. Su and J. Xu, “Securing distributed machine learning in high dimensions,” arxiv Preprint arXiv:1804.10140, Apr. 2018.
- [14] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” Proceedings of NIPS, Long Beach, California, USA, Dec. 2017.
- [15] L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “RSA: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets,” Proceedings of AAAI, Honolulu, Hawaii, USA, Jan. 2019.
- [16] J. Feng, H. Xu, and S. Mannor, “Distributed robust learning,” arXiv Preprint arXiv:1409.5937, Sep. 2014.
- [17] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Defending against saddle point attack in Byzantine-robust distributed learning,” arXiv Preprint arXiv:1806.05358, Jun. 2018.
- [18] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO: Byzantine-resilient distributed training via redundant gradients,” arXiv Preprint arXiv:1803.09877, Mar. 2018.
- [19] S. Rajput, H. Wang, Z. Charles, and D. Papailiopoulos, “DETOX: A redundancy-based framework for faster and more robust gradient aggregation,” arXiv Preprint arXiv:1907.12205, Jul. 2019.
- [20] C. Xie, O. Koyejo, and I. Gupta, “Fall of empires: Breaking Byzantine-tolerant SGD by inner product manipulation,” arXiv Preprint arXiv:1903.03936, Mar. 2019.
- [21] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: Training imagenet in 1 hour,” arXiv Preprint arXiv:1706.02677, Jun. 2017.
- [22] M. W. Schmidt, N. Le Roux, and F. R. Bach, “Minimizing finite sums with the stochastic average gradient,” Mathematical Programming, vol. 162, no. 1–2, pp. 83–112, Mar. 2017.
- [23] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” Proceedings of NIPS, Stateline, Nevada, USA, Dec. 2013.
- [24] A. Defazio, F. R. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives,” Proceedings of NIPS, Montreal, Canada, Dec. 2014.
- [25] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss minimization,” Journal of Machine Learning Research, vol. 14, no. 2, pp. 567–599, Feb. 2013.
- [26] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takac, “SARAH: A novel method for machine learning problems using stochastic recursive gradient,” Proceedings of ICML, Sydney, Australia, Aug. 2017.
- [27] Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 8194–8244, Jun. 2017.
- [28] C. Calauzenes and N. Le Roux, “Distributed SAGA: Maintaining linear convergence rate with limited communication,” arXiv Preprint arXiv:1705.10405, May 2017.
- [29] S. De and T. Goldstein, “Efficient distributed SGD with variance reduction,” Proceedings of ICDM, Barcelona, Spain, Dec. 2016.
- [30] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. J. Smola, “On variance reduction in stochastic gradient descent and its asynchronous variants,” Proceedings of NIPS, Barcelona, Spain, Dec. 2015.
- [31] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” Proceedings of COMPSTAT, Paris, France, Aug. 2010.
- [32] E. Weiszfeld and F. Plastria, “On the point for which the sum of the distances to given points is minimum,” Annals of Operations Research, vol. 167, no. 1, pp. 7–41, Mar. 2009.
- [33] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D2: Decentralized training over decentralized data,” Proceedings of ICML, Stockholm, Sweden, Jul. 2018.
- [34] F. Lin, Q. Ling, and Z. Xiong, “Byzantine-resilient distributed large-scale matrix completion,” Proceedings of ICASSP, Brighton, UK, May 2019.
- [35] W. Ben-Ameur, P. Bianchi, and J. Jakubowicz, “Robust distributed consensus using total variation,” IEEE Transactions on Automatic Control, vol. 61, no. 6, pp. 1550–1564, Jun. 2016.
- [36] Z. Yang and W. U. Bajwa, “BRIDGE: Byzantine-resilient decentralized gradient descent,” arXiv Preprint arXiv:1908.08098, Aug. 2019.
- [37] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, 2013.
Appendix A Proof of Lemma 1
The proof of Lemma 1 relies on the following lemma.
Lemma 2.
Let be a subset of random vectors distributed in a normed vector space. If and , then it holds that
(26) |
where and .
Proof.
With and , it holds that ; and for all , we have . Then, summing up over all yields
(27) |
According to the definition of geometric median, it holds that
(28) |
Combining the two inequalities, we arrive at
(29) |
and upon squaring both sides of the latter, we find
(30) |
Then taking expectations on both sides, yields (26), and completes the proof. ∎
Appendix B Lemma 3 and Its Proof
Since computing the accurate geometric median is difficult, we consider the -approximate geometric median in this paper. The following lemma is the -approximate counterpart of Lemma 2.
Lemma 3.
Let be a subset of random vectors distributed in a normed vector space. If and , it holds that
(33) |
where , , and is an -approximate geometric median of .
Appendix C Lemma 4 and its proof
As we have indicated in Section IV-B, the main challenge in the proof of Byrd-SAGA is that the geometric median of is a biased estimate of the gradient . To handle the bias, the following lemma characterizes the error between an -approximate geometric median of and per slot .
Lemma 4.
Proof.
We begin with upper bounding the mean-square error , where . Using the definition of in (10), we have for any that
(42) | ||||
where the second equality is due to variance decomposition with , and ; while the last inequality comes from Assumption 1.
To further upper bound the mean-square error , we have that
(43) | ||||
Appendix D Lemma 5 and its proof
In Lemma 4, the upper bound of contains a time-varying term . The following lemma characterizes the evolution of .
Lemma 5.
Proof.
For the expectation of , we have that