Federated Variance-Reduced Stochastic Gradient Descent with Robustness to Byzantine Attacks

12/29/2019
by   Zhaoxian Wu, et al.
SUN YAT-SEN UNIVERSITY
21

This paper deals with distributed finite-sum optimization for learning over networks in the presence of malicious Byzantine attacks. To cope with such attacks, most resilient approaches so far combine stochastic gradient descent (SGD) with different robust aggregation rules. However, the sizeable SGD-induced stochastic gradient noise makes it challenging to distinguish malicious messages sent by the Byzantine attackers from noisy stochastic gradients sent by the 'honest' workers. This motivates us to reduce the variance of stochastic gradients as a means of robustifying SGD in the presence of Byzantine attacks. To this end, the present work puts forth a Byzantine attack resilient distributed (Byrd-) SAGA approach for learning tasks involving finite-sum optimization over networks. Rather than the mean employed by distributed SAGA, the novel Byrd- SAGA relies on the geometric median to aggregate the corrected stochastic gradients sent by the workers. When less than half of the workers are Byzantine attackers, the robustness of geometric median to outliers enables Byrd-SAGA to attain provably linear convergence to a neighborhood of the optimal solution, with the asymptotic learning error determined by the number of Byzantine workers. Numerical tests corroborate the robustness to various Byzantine attacks, as well as the merits of Byrd- SAGA over Byzantine attack resilient distributed SGD.

READ FULL TEXT VIEW PDF
09/17/2020

Byzantine-Robust Variance-Reduced Federated Learning over Distributed Non-i.i.d. Data

We propose a Byzantine-robust variance-reduced stochastic gradient desce...
02/22/2018

The Hidden Vulnerability of Distributed Learning in Byzantium

While machine learning is going through an era of celebrated success, co...
09/10/2019

Byzantine-Resilient Stochastic Gradient Descent for Distributed Learning: A Lipschitz-Inspired Coordinate-wise Median Approach

In this work, we consider the resilience of distributed algorithms based...
12/18/2020

Learning from History for Byzantine Robust Optimization

Byzantine robustness has received significant attention recently given i...
02/28/2020

Distributed Momentum for Byzantine-resilient Learning

Momentum is a variant of gradient descent that has been proposed for its...
10/18/2021

BEV-SGD: Best Effort Voting SGD for Analog Aggregation Based Federated Learning against Byzantine Attackers

As a promising distributed learning technology, analog aggregation based...
07/05/2019

Data Encoding for Byzantine-Resilient Distributed Optimization

We study distributed optimization in the presence of Byzantine adversari...

I Introduction

With the rapid development of information technologies, the volume of distributed data increases explosively. Every day, numerous distributed devices including sensors, cellphones, computers, and vehicles, generate huge amounts of data, which are often forwarded to datacenters for further processing and learning tasks. However, collecting data from distributed devices and storing them in datacenters raise major privacy concerns [1, 2, 3]

. Accounting for these concerns, federated learning has been advocated to provide a privacy-preserving, decentralized data processing and machine learning framework

[4]. Data in federated learning are kept private, and local computations are carried at the distributed devices. Updates of local variables (such as stochastic gradients, corrected stochastic gradients, and model parameters) are found using per-device private data, while the datacenter aggregates local variables and disseminates the aggregated result to the distributed devices.

Even though privacy is preserved, the distributed nature of federated learning makes it vulnerable to errors and adversarial attacks. Devices can then become unreliable in either computing or communicating, or, they can even be hacked by adversaries. As a result, compromised devices may send malicious messages to the datacenter, thus misleading the learning process [5, 6, 7]. We will henceforth focus on the class of malicious attacks known as Byzantine attacks [8]. Robustifying federated learning against Byzantine attacks is of paramount importance for secure processing and learning.

To cope with Byzantine attacks in federated learning, several robust aggregation rules have been developed in recent years, mainly towards improving the distributed stochastic gradient descent (SGD) solver of the underlying optimization task. Through aggregating stochastic gradients with the geometric median [9, 10], median [11], trimmed mean [12], or iterative filtering [13], stochastic algorithms have been able to tolerate a small number of devices attacked by Byzantine adversaries. Other aggregation rules include Krum [14], that selects a stochastic gradient having the minimal cumulative squared distance from a given number of nearest stochastic gradients, and RSA [15]

which aggregates models other than stochastic gradients through penalizing the differences between the local and global model parameters. Related works also include adversarial learning in distributed principal component analysis

[16], escaping from saddle points in non-convex distributed learning under Byzantine attacks [17], and leveraging redundant gradients to improve robustness [18, 19].

Although robust SGD iterates can ensure convergence to a neighborhood of the attack-free optimal solution, this neighborhood size can be large when Byzantine attacks are carefully crafted [20]. Essentially, SGD suffers from the sizeable approximation error (noise) associated with stochastic gradients. This leads to the challenge of distinguishing malicious messages sent by Byzantine attackers from the noisy stochastic gradients sent by ‘honest’ devices.

In the face of this challenge, we posed the following question: Is it possible to better distinguish the malicious messages from the stochastic gradients through reducing the stochastic gradient-induced noise? Our answer will turn out to be in the affirmative. Intuitively, if the stochastic gradient noise is small, the malicious messages should be easy to identify; see also the illustrative example in Section II-D. This intuition suggests combining variance reduction techniques with robust aggregation rules to handle Byzantine attacks in federated learning.

Existing variance reduction techniques in stochastic optimization include mini-batch [21], and abbreviated ones as SAG [22], SVRG [23], SAGA [24], SDCA [25], SARAH [26], Katyusha [27], to list a few. Among these, we are particularly interested in SAGA, which has been proven effective in finite-sum optimization. SAGA can also be implemented in a distributed manner [28, 29, 30], and hence it fits well the federated learning applications, where each device deals with a finite number of data samples.

Our proposed novel Byzantine attack resilient distributed (Byrd-) SAGA combines SAGA’s variance reduction with robust aggregation to deal with the malicious attacks in federated finite-sum optimization setups. Instead of the mean employed by distributed SAGA, the datacenter in Byrd-SAGA relies on the geometric median to aggregate the corrected stochastic gradients sent by distributed devices. Through reducing the stochastic gradient-induced noise, Byrd-SAGA turns out to outperform the Byzantine attack resilient distributed SGD. When less than half of the workers are Byzantine attackers, the robustness of geometric median to outliers enables Byrd-SAGA to achieve provably linear convergence to a neighborhood of the optimal solution, and the asymptotic learning error is solely determined by the number of Byzantine workers. Numerical tests demonstrate the robustness of Byrd-SAGA to various Byzantine attacks.

Ii Problem Statement

We start this section by specifying the federated finite-sum optimization problem in the presence of Byzantine attacks. We then elaborate on the limitations of Byzantine attack resilient distributed SGD algorithms, which motivate our subsequent development of Byrd-SAGA.

Ii-a Federated finite-sum optimization in the presence of Byzantine attacks

Consider a network with one master node (datacenter) and workers (devices), among which workers are Byzantine attackers with their identities unknown to the master node. Let be the set of all workers, and that of Byzantine attackers with respective cardinalities and . The data samples are evenly distributed across the honest workers . Each honest worker has data samples, and denotes the loss of the -th data sample at the honest worker with respect to the model parameter . We are interested in the finite-sum optimization problem

(1)

where

(2)

The main challenge of solving (1) is that the Byzantine attackers can collude and send arbitrary malicious messages to the master node so as to bias the optimization process. We aspire to develop a robust distributed stochastic algorithm to address this issue. Intuitively, when a majority of workers are Byzantine attackers, it is difficult to obtain a reasonable approximate solution to (1). For this reason, we will assume throughout, and prove that the proposed Byzantine attack resilient algorithm is able to tolerate attacks from up to half of the workers.

Ii-B Sensitivity of distributed SGD to Byzantine attacks

When all workers are honest, a popular solver of (1) is SGD [31]. At time slot (iteration) , the master node broadcasts to workers. Upon receiving , worker uniformly at random chooses a local data sample with index to obtain the stochastic gradient that then communicates back to the master node. Upon collecting stochastic gradients from all workers, the master node updates the model as

(3)

where is the non-negative step size. Note that the distributed SGD can be extended to its mini-batch version; whereby, each worker uniformly at random chooses a mini-batch of data samples per iteration, and communicates the averaged stochastic gradient back to the master node.

While the honest workers send true stochastic gradients to the master node, the Byzantine ones can send arbitrary malicious messages to the master node in order to perturb (bias) the optimization process. Let denote the message worker sends to the master node at slot , given by

(4)

where denotes an arbitrary vector. Then, the distributed SGD update (3) becomes

(5)

Even when only one Byzantine attacker is present, the distributed SGD may fail. Consider that a Byzantine attacker sends to the master node , which yields . In practice, Byzantine attackers can send more sophisticated messages to fool the master node, and thus bias the optimization process.

Ii-C Byzantine attack resilient distributed SGD

Recent works often robustify the distributed SGD by incorporating robust aggregation rules when the master node receives messages from the workers. Here, we will adopt and analyze the geometric median, even though alternative robust aggregation rules are also viable [9, 10].

With denoting a subset in a normed space, the geometric median of is

(6)

Using (6), the distributed SGD in (5) can be modified to its Byzantine attack resilient form as

(7)

In essence, the geometric median chooses a reliable vector to represent the received messages through majority voting. When the number of Byzantine workers , the geometric median approximates reasonably well the mean of . This property enables the Byzantine attack resilient distributed SGD to converge to a neighborhood of the optimal solution [9, 10].

Fig. 1: Impact of stochastic gradient noise on geometric median-based robust aggregation. Blue dots denote stochastic gradients sent by the honest workers. Red dots denote malicious messages sent by the Byzantine workers. Plus signs denote the outputs of geometric median-based robust aggregation. Pentagrams denote the means of the stochastic gradients sent by the honest workers. Variance of the stochastic gradients from the honest workers is large in left and small in right.

Ii-D Impact of stochastic gradient noise on robust aggregation

In distributed SGD, the stochastic gradients evaluated by honest workers are noisy because of the randomness in choosing data samples. Due to the stochastic gradient noise however, it is not always easy to distinguish the malicious messages from the stochastic gradients using just the robust aggregation rules, e.g. the geometric median. Several existing works have recognized this issue. With carefully crafted Byzantine attacks, outputs of several Byzantine attack resilient SGD algorithms can be far away from the optimal solution [20]. In [10] and [18], the workers are divided into several groups, with averages taken within groups and the geometric median obtained across groups. This approach leads to reduced variance and thus enhanced ability to distinguish malicious messages. In [14], it is explicitly assumed that the ratio of the variance of stochastic gradients to the distance between iterate and optimal solution is upper-bounded.

Fig. 1 shows the impact of stochastic gradient noise on geometric median-based robust aggregation. When the stochastic gradients sent by honest workers have small variance, the gap between the true mean and the aggregated value is also small; that is, the same Byzantine attacks are less effective. We will quantify this statement in our analysis of Section IV-A.

Prompted by this observation, our key idea is to reduce the variance of stochastic gradients in order to enhance robustness to Byzantine attacks. In the Byzantine attack-free case, an effective approach to alleviating stochastic gradient noise in SGD is through variance reduction. By compensating for stochastic gradient noise, variance reduction techniques lead to faster convergence than SGD. For specificity, we will focus on SAGA, which reduces stochastic gradient noise for finite-sum optimization [24], and we will show how SAGA can also aid robust aggregation against Byzantine attacks.

Iii Algorithm Development

In this section, we first introduce distributed SAGA with mean aggregation. Then, we propose Byrd-SAGA, which replaces mean aggregation by geometric median-based robust aggregation.

Iii-a Distributed SAGA with mean aggregation

In distributed SAGA, each worker maintains a table of stochastic gradients for all of its local data samples [28, 29]. As in distributed SGD, the master node at slot sends to the workers, and every worker uniformly at random chooses a local data sample with index to find the stochastic gradient . However, worker does not send back to the master node. Instead, it corrects by first subtracting the previously stored stochastic gradient of the -th data sample, and then adding the average of the stored stochastic gradients across local data samples. Then, worker sends such a corrected stochastic gradient to the master node, and stores as the stochastic gradient of the -th data sample in the table. After collecting the corrected stochastic gradients from all workers, the master node updates the model .

To better describe distributed SAGA, let

(8)

where is the iterate at which the most recent is evaluated when slot ends. Then, refers to the previously stored stochastic gradient of the -th data sample prior to slot on worker , and

is the corrected stochastic gradient of worker at slot . The model update of SAGA is hence

(9)

where is the constant step size.

Iii-B Distributed SAGA with geometric median aggregation

Here, it is useful to recall that Byzantine workers may send to the master node malicious messages, other than the corrected stochastic gradient. To account for this, the message sent from worker to the master node at slot is expressed as

(10)

where denotes an arbitrary vector. Similar to distributed SGD, distributed SAGA is also sensitive to Byzantine attacks. Our robust aggregation rule here is the geometric median. This leads to the proposed Byzantine attack resilient distributed (Byrd) form of SAGA in (9), that is given by

(11)
Fig. 2: Illustration of Byzantine attack resilient distributed SAGA. For the ease of illustration, the honest workers are from to while the Byzantine attackers are from to . But in practice, the identities of Byzantine attackers are unknown to the master node.

The proposed Byzantine attack resilient distributed SAGA, abbreviated as Byrd-SAGA, is listed step-by-step under Algorithm 1, and illustrated in Fig. 2. There are various implementations of the distributed SAGA. For example, [29] proposed to store the tables of stochastic gradients in the master node. The workers only need to upload the stochastic gradients and their indexes, while the master node performs the aggregation. This setup is also vulnerable to Byzantine attacks, since the Byzantine attackers may upload incorrect stochastic gradients. The proposed robust aggregation rule can also be applied therein.

0:  step size ; number of workers ; number of data samples on every honest worker
  Master node and honest workers initialize
  for all honest worker  do
     for  do
        Initializes gradient storage
     end for
     Initializes average gradient
     Sends to master node
  end for
  Master node updates
  for all  do
     Master node broadcasts to all workers
     for all honest worker node  do
        Samples from uniformly at random
        Updates
        Sends to master node
        Updates
        Stores gradient
     end for
     Master node updates
  end for
Algorithm 1 Byzantine Attack Resilient Distributed SAGA

Robust aggregations other than the geometric median are available, including the median [11], Krum [14], marginal trimmed mean [12], and iterative filtering [13]. In the median for instance, the aggregation outputs the element-wise median of ; while in the Krum, the aggregation outputs

where selects the indexes of the nearest neighbors of in . Note that Krum needs to know , the number of Byzantine attackers, in advance. In addition, other variance reduction techniques, such as mini-batch [21], SAG [22], SVRG [23], SAGA [24], SDCA [25], SARAH [26] and Katyusha [27], are also available to alleviate the gradient noise. Here we opted for the combination of geometric median and SAGA. Extending the current work to other robust aggregation rules and variance reduction techniques, is in our future research agenda.

Remark 1.

Computing the geometric median involves solving an optimization problem in the form of (6). Since it is costly to obtain the exact geometric median, one is typically satisfied with an -approximate value [32]. We say that is an -approximate geometric median of if

(12)

We shall show that the -approximation only slightly affects the convergence of Byrd-SAGA.

Iv Theoretical Analysis

In this section, we theoretically justify the intuitive idea that reducing stochastic gradient noise helps identify malicious messages in robust aggregation, specifically to the geometric median in this paper. We prove that our Byrd-SAGA converges to a neighborhood of the optimal solution at a linear rate under Byzantine attacks, and the asymptotic learning error is determined by the number of Byzantine attackers. Due to the page limit, proofs are delegated to the full version of this paper111https://github.com/MrFive5555/Byrd-SAGA/blob/master/Full.pdf.

Iv-a Importance of reducing stochastic gradient noise

Here, we quantify the role of stochastic gradient noise on the geometric median aggregation. Towards this objective, consider the set of messages sent by all workers in , and the set of malicious messages sent by the Byzantine attackers in . Further, let denote the true gradient given by the ensemble average of stochastic gradients. Using these definitions, the ensuing lemma bounds the mean-square error of the geometric median relative to the true gradient.

Lemma 1.

(Concentration property) Let be a subset of random vectors distributed in a normed vector space. If and , then it holds that

(13)

where

while , and .

The left-hand side of (13) is the mean-square error of the geometric median relative to the true gradient, while the right-hand side is the sum of two terms. The first is determined by the variances of the local stochastic gradients sent by the honest workers (inner variation), while the second term is determined by the variations of the local gradients at the honest workers with respect to the true gradient (outer variation). In the Byzantine attack resilient SGD, the upper bound can be large due to the large stochastic gradient noise of SGD. Through reducing the stochastic gradient noise in terms of either inner variation or outer variation, we are able to attain improved accuracy under malicious attacks.

Iv-B Convergence of Byrd-SAGA and comparison with Byzantine attack resilient SGD

Here, we establish convergence of Byrd-SAGA, and theoretically justify that, through reducing the impact of inner variation, Byrd-SAGA enjoys superior robustness to Byzantine attacks. We begin with several needed assumptions on the functions .

Assumption 1.

(Strong convexity and Lipschitz continiuty of gradients) The function is -strongly convex and has -Lipschitz continuous gradients, which amounts to requiring that for any , it holds that

(14)

and

(15)
Assumption 2.

(Bounded outer variation) For any , variation of the aggregated gradients at the honest workers with respect to the overall gradient is upper-bounded by

(16)
Assumption 3.

(Bounded inner variation) For every honest worker and any , the variation of its stochastic gradients with respect to its aggregated gradient is upper-bounded by

(17)

Assumption 1 is standard in convex analysis. Assumptions 2 and 3 bound the variation of gradients and the variation of stochastic gradients within the honest workers, respectively [33]. For instance, most of the existing Byzantine attack resilient SGD algorihtms assume that the stochastic gradients at the honest workers are independently and identically distributed (i.i.d.) with finite variance, such that the outer variation in Assumption 2 is proportional to and the inner variation in Assumption 3 is finite. In the analysis of Byzantine attack resilient SGD, both outer and inner variations must be bounded. Interestingly, inner variation will turn out not to impact Byrd-SAGA, and Assumption 3 will no longer be necessary in its analysis.

To simplify notation, we will henceforth use

to represent the expectation with respect to all random variables

.

The presence of geometric median makes Byrd-SAGA analysis challenging. Specifically, for every honest worker ,

is an unbiased estimate of

, meaning

(18)

Averaging (18) over all honest workers , we have

(19)

From (19), we observe that the mean of over all the honest workers is an unbiased estimate of . Nevertheless, the geometric median of , even only over all the honest workers and calculated accurately, is a biased estimate of . This is the main challenge in adapting the proof of SAGA to that of Byrd-SAGA.

The following theorem asserts that Byrd-SAGA converges to a neighborhood of the optimal solution at a linear rate, with the asymptotic learning error determined by the number of Byzantine attackers.

Theorem 1.

Under Assumptions 1 and 2, if the number of Byzantine attackers satisfies and the step size satisfies

then for Byrd-SAGA with -approximate geometric median aggregation, it holds that

(20)

where

(21)
(22)

In (20), the constant of convergence rate is given by

which is close to when (the number of data samples at each worker) and (the condition number of functions) are large. Observe that is monotonically increasing when the portion of Byzantine attackers increases. Therefore, (20) shows that Byrd-SAGA converges slower as the number of Byzantine attackers grows. Correspondingly, the theoretical upper bound of step size is small when and are large. The asymptotic learning error in (22) is also monotonically increasing when (and hence the number of Byzantine attackers) increases.

To demonstrate the superior robustness of Byrd-SAGA, we also establish the convergence of Byzantine attack resilient SGD with constant step size as a benchmark. As in Theorem 1, the convergence of Byzantine attack resilient SGD is in the mean-square error sense. This is different from [10]

, where convergence is asserted in the high probability sense.

Theorem 2.

Under Assumptions 1, 2 and 3, if the number of Byzantine attackers is and the step size satisfies

then for Byzantine attack resilient SGD with -approximate geometric median aggregation, it holds that

(23)

where

(24)
(25)

.

Let us ignore the approximation error in computing geometric median by setting , and compare the two asymptotic learning errors and . With the step size , the constant in is in the order of . Therefore, we deduce that

Observe that , the asymptotic learning error of Byzantine attack resilient SGD, is proportional to the sum of inner and outer variations. With all honest workers having the same data sample, we have . In this case, the asymptotic learning error vanishes because the geometric median aggregation takes effect and attains the true gradient. However, when each honest worker has the same set of distinct data samples, the inner variation is no longer zero and the asymptotic learning error can be large. In contrast, Byrd-SAGA effectively reduces the impact of inner variation, and is able to achieve smaller learning error.

Fig. 3: Performance of the distributed SGD, mini-batch (B)SGD and SAGA, with mean and geometric median (geomed) aggregation rules on IJCNN1 dataset. The step sizes are 0.02, 0.01 and 0.02, respectively. SAGA geomed stands for the proposed Byrd-SAGA. From top to bottom: optimality gap and variance of honest messages. From left to right: without attack, Gaussian attack, sign-flipping attack, and zero-gradient attack.
Fig. 4: Performance of the distributed SGD, mini-batch (B)SGD and SAGA, with mean and geometric median (geomed) aggregation rules on COVTYPE dataset. The step sizes are 0.01, 0.005 and 0.01, respectively. SAGA geomed stands for the proposed Byrd-SAGA. From top to bottom: optimality gap and variance of honest messages. From left to right: without attack, Gaussian attack, sign-flipping attack, and zero-gradient attack.

V Numerical Experiments

Here we present numerical experiments on convex and nonconvex learning problems222The codes are available at https://github.com/MrFive5555/Byrd-SAGA. For each problem, we evenly distribute the dataset into honest workers unless indicated otherwise. To account for malicious attacks, we additionally launch Byzantine workers. We test the performance of the proposed Byrd-SAGA under three typical Byzantine attacks: Gaussian, sign-flipping and zero-gradient attacks [15, 34]. For a Gaussian attack, a Byzantine attacker draws its

from a Gaussian distribution with mean

and variance . For a sign-flipping attack, a Byzantine attacker sets its message as , where the magnitude is used in the numerical experiments. And for a zero-gradient attack, a Byzantine attacker sends so that the messages at the master sum up to zero. We use the algorithm in [32] to obtain the -approximate geometric median with .

V-a -regularized logistic regression

Consider the

-regularized logistic regression cost, where each summand

is given by

with being the feature vector, the label, and a constant. We use the IJCNN1 and COVTYPE datasets333https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets. IJCNN1 contains 49,990 training data samples of dimensions. COVTYPE contains 581,012 training data samples of dimensions.

We first compare SGD, mini-batch (B)SGD with batch size and SAGA, using mean and geometric median aggregation rules. Compared to SGD, BSGD enjoys smaller stochastic gradient noise, but incurs higher computational cost. In comparison, SAGA also reduces stochastic gradient noise, but its computational cost is in the same order as that of SGD. For each algorithm, we adopt a constant step size, which is tuned to achieve the best optimality gap in the Byzantine-free scenario. The performance of these algorithms on the IJCNN1 and COVTYPE datasets is depicted in Fig. 3 and Fig. 4, respectively. With Byzantine attacks, all three algorithms using mean aggregation fail. Among the three using geometric median aggregation, Byrd-SAGA markedly outperforms the other two, while BSGD is better than SGD. This demonstrates the importance of variance reduction to handling Byzantine attacks. Regarding the variance of honest messages in particular, Byrd-SAGA, Byzantine attack resilient BSGD and Byzantine attack resilient SGD are in the order of , and , respectively, for the IJCNN1 dataset. For the COVTYPE dataset, Byrd-SAGA and Byzantine attack resilient BSGD have the same order of variance with respect to honest messages. In this case, Byrd-SAGA achieves similar optimality gap as Byzantine attack resilient BSGD, but converges faster because it is able to use a larger step size.

Theorem 1 establishes that when the outer variation , the asymptotic learning error of Byrd-SAGA is zero, no matter how large the inner variation is. In contrast, according to Theorem 2, the asymptotic learning error of Byzantine attack resilient SGD is still proportional to the inner variation . To validate these theoretical results, we conducted a second set of numerical experiments, where every honest worker has the whole IJCNN1 dataset. Therefore, and remains the same as that in the first set of experiments. We compare SGD, BSGD with batch size and SAGA, all using the geometric median aggregation rule. The results depicted in Fig. 5 corroborate the theoretical findings – the asymptotic learning error of Byrd-SAGA vanishes, while those of Byzantine attack resilient SGD and BSGD are the same as those shown in Fig. 3.

Fig. 5: Performance of the distributed SGD, mini-batch (B)SGD and SAGA, with geometric median (geomed) aggregation rule. Every honest worker has the whole IJCNN1 dataset. The step sizes are 0.0004, 0.0002 and 0.0004, respectively. SAGA geomed stands for the proposed Byrd-SAGA. From top to bottom: optimality gap and variance of honest messages. From left to right: without attack, Gaussian attack, sign-flipping attack, and zero-gradient attack.

In the third set of numerical experiments, we compare the use of different aggregation rules in distributed SAGA: mean, geometric median, median, and Krum. As shown in Fig. 6, distributed SAGA using mean aggregation is the best in terms of the optimality gap when there are no Byzantine attacks. However, it fails under all kinds of attacks. With Gaussian attacks, Byrd-SAGA using geometric median achieves the best performance. With sign-flipping and zero-gradient attacks, Byrd-SAGA using Krum is the best, while that using geometric median also performs well. Note that Krum has to know the exact number of Byzantine attackers in advance, while geometric median and median do not need this prior knowledge.

Fig. 6: Optimality gaps of distributed SAGA with different aggregation rules: mean, geometric median, median and Krum. The step sizes are 0.02 and 0.01 for the IJCNN1 and COVTYPE datasets, respectively. Curves of geometric median correspond to the proposed Byrd-SAGA. From top to bottom: on IJCNN1 dataset and on COVTYPE dataset. From left to right: without attacks, with Gaussian attacks, with sign-flipping attacks, and with zero-gradient attacks.

V-B Neural network training

Here we test training a neural network with one hidden layer of

neurons and “tanh” activation function, for multi-class classification on the MNIST dataset444http://yann.lecun.com/exdb/mnist comprising data samples, each with dimension . We compare SGD with step size , BSGD with step size and batch size , and SAGA with step size . We run the algorithms for iterations, and report the final accuracy in Table 1. With mean aggregation, all algorithms yield low accuracy in the presence of Byzantine attacks. With the help of geometric median aggregation, BSGD and SAGA are both robust and outperform SGD. Note that Byrd-SAGA exhibits a much lower per-iteration computational cost relative to Byzantine attack resilient BSGD.

attack algorithm mean acc (%) geomed acc (%)
without SGD 97.0 92.3
BSGD 98.6 98.0
SAGA 96.5 96.3
Gaussian SGD 36.3 92.5
BSGD 36.3 98.0
SAGA 14.5 96.4
sign-flipping SGD 0.11 0.03
BSGD 0.16 90.3
SAGA 0.12 86.4
zero-gradient SGD 9.94 26.2
BSGD 9.89 81.5
SAGA 9.88 92.4
TABLE I: Accuracy of SGD, mini-batch (B)SGD and SAGA, with mean and geometric median (geomed) aggregation rules. SAGA geomed stands for the proposed Byrd-SAGA.

Vi Conclusions

The present paper developed a novel Byzantine attack resilient distributed (Byrd-) SAGA approach to federated finite-sum optimization in the presence of Byzantine attacks. On par with SAGA, Byrd-SAGA corrects stochastic gradients through variance reduction. Per iteration, distributed workers obtain their corrected stochastic gradients before uploading to the master node. Different from SAGA though, the master node in Byrd-SAGA aggregates the received messages using the geometric median rather than the mean. This robust aggregation markedly enhances robustness of Byrd-SAGA in the presence of Byzantine attacks. It was established that Byrd-SAGA converges linearly to a neighborhood of the optimal solution, with the asymptotic learning error determined solely by the number of Byzantine workers.

As confirmed by numerical tests, combinations with other robust aggregation rules also exhibit satisfactory robustness. Our future research agenda includes their analysis, as well as the development and analysis of Byzantine attack resilient algorithms over fully decentralized networks [35, 36].

References

Appendix A Proof of Lemma 1

The proof of Lemma 1 relies on the following lemma.

Lemma 2.

Let be a subset of random vectors distributed in a normed vector space. If and , then it holds that

(26)

where and .

Proof.

With and , it holds that ; and for all , we have . Then, summing up over all yields

(27)

According to the definition of geometric median, it holds that

(28)

Combining the two inequalities, we arrive at

(29)

and upon squaring both sides of the latter, we find

(30)

Then taking expectations on both sides, yields (26), and completes the proof. ∎

With Lemma 2, the proof of Lemma 1 is straightforward.

Proof.

It follows readily from Lemma 2 that

(31)

Applying the inequality of to (31), yields

(32)

which completes the proof. ∎

Appendix B Lemma 3 and Its Proof

Since computing the accurate geometric median is difficult, we consider the -approximate geometric median in this paper. The following lemma is the -approximate counterpart of Lemma 2.

Lemma 3.

Let be a subset of random vectors distributed in a normed vector space. If and , it holds that

(33)

where , , and is an -approximate geometric median of .

Proof.

Because is an -approximate geometric median, it follows that

(34)

Notice that (27) remains valid here. Hence, we have

(35)

Squaring both sides of (35), leads to

(36)
(37)
(38)

Then taking expectations on both sides, yields (33), and completes the proof. ∎

Appendix C Lemma 4 and its proof

As we have indicated in Section IV-B, the main challenge in the proof of Byrd-SAGA is that the geometric median of is a biased estimate of the gradient . To handle the bias, the following lemma characterizes the error between an -approximate geometric median of and per slot .

Lemma 4.

Consider Byrd-SAGA with -approximate geometric median aggregation. Under Assumptions 1 and 2, if the number of Byzantine attackers satisfies , then an -approximate geometric median of , denoted by , satisfies

(39)

where

(40)

while is defined as

(41)
Proof.

We begin with upper bounding the mean-square error , where . Using the definition of in (10), we have for any that

(42)

where the second equality is due to variance decomposition with , and ; while the last inequality comes from Assumption 1.

To further upper bound the mean-square error , we have that

(43)

where the last inequality relies on (42) and Assumption 2.

Next, we will derive an upper bound on . According to (33) in Lemma 3 and (43), it holds that

(44)

which completes the proof. ∎

Appendix D Lemma 5 and its proof

In Lemma 4, the upper bound of contains a time-varying term . The following lemma characterizes the evolution of .

Lemma 5.

Consider Byrd-SAGA with -approximate geometric median aggregation. Under Assumptions 1, it holds that

(45)

where is defined in (41).

Proof.

For the expectation of , we have that