Secure Distributed On-Device Learning Networks With Byzantine Adversaries

06/03/2019 ∙ by Yanjie Dong, et al. ∙ IEEE The University of British Columbia 0

The privacy concern exists when the central server has the copies of datasets. Hence, there is a paradigm shift for the learning networks to change from centralized in-cloud learning to distributed on-device learning. Benefit from the parallel computing, the on-device learning networks have a lower bandwidth requirement than the in-cloud learning networks. Moreover, the on-device learning networks also have several desirable characteristics such as privacy preserving and flexibility. However, the on-device learning networks are vulnerable to the malfunctioning terminals across the networks. The worst-case malfunctioning terminals are the Byzantine adversaries, that can perform arbitrary harmful operations to compromise the learned model based on the full knowledge of the networks. Hence, the design of secure learning algorithms becomes an emerging topic in the on-device learning networks with Byzantine adversaries. In this article, we present a comprehensive overview of the prevalent secure learning algorithms for the two promising on-device learning networks: Federated-Learning networks and decentralized-learning networks. We also review several future research directions in the Federated-Learning and decentralized-learning networks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the big data era, the high volume of datasets will improve the accuracy of learning networks. However, the ever-increasing dimension of data samples and the size of datasets introduce new challenges to the centralized in-cloud learning networks. For example, the in-cloud learning networks will experience accuracy degradation since the limited bandwidth in current networking infrastructure may not satisfy the demands for transmitting the high-volume datasets to the central cloud [1]. The central cloud and terminals have the copies of datasets. Hence, there is a privacy concern as long as either the terminals or central cloud is hacked by malfunctioning users, e.g., the unexpected leakage of photos of celebrities from iCloud in 2014111

The soaring demands on bandwidth and privacy concerns cause a shift of research focus from the centralized in-cloud learning networks to the distributed on-device learning networks. Moreover, the on-device learning networks also benefit from the improvement of storage volume and computational power of mobile terminals. The objective of on-device learning networks is to obtain the optimal model parameter, which provides an optimal mapping between the data samples and labels of a dataset. For example, the optimal model parameter is used to map bits of an image to the number in the image in the MNIST dataset222

Among the on-device learning networks, the Federated-Learning networks (FLNs) [1] and decentralized-learning networks (DLNs) [2] are the two categories of promising candidates as shown in Figs. 1(a) and 1(b), respectively. As an on-device learning network recently proposed by Google, the FLN consists of a parameter server and multiple terminals as shown in Fig. 1(a). In the FLNs, the terminals perform the distributed parallel computation based on the local datasets. The parameter server updates the model parameter based on the results of the terminals. Due to the parallel computing, the FLNs converge as fast as the centralized counterpart: the in-cloud learning networks. When the parameter server experiences unprecedented outage333The outage events include the energy outage at the location of server, the hardware failure of server, etc., the DLN becomes a promising on-device learning network since the DLN does not require a central parameter server [2]. In the DLNs, the terminals need to communicate with the neighbors to exchange the distributed training results. Hereinafter, two terminals are neighbors when they have one-hop connection with each other. For example, the first and the second terminals are neighbors to each other in Fig. 1(b). When each terminal is the neighbor of the remaining terminals, the DLN is called fully-connected; otherwise, it is called partially-connected. The advantages and disadvantages of in-cloud learning networks and on-device learning networks are summarized in the table of Fig. 1(c).

(a) A typical FLN with terminals.
(b) A typical DLN with five terminals.
(c) Comparisons of in-cloud learning networks with FLNs and DLNs.
Fig. 1: An illustration of FLNs and DLNs, and the comparisons with in-cloud learning networks.

The FLNs and DLNs have several desirable characteristics. For example, the datasets are kept at the terminals such that the FLNs and DLNs do not have multiple copies of the privacy-related datasets. Hence, the probability of privacy leakage is reduced. Another desirable characteristic is that the terminals are flexible to join and leave the FLNs (or DLNs). In order to achieve these desirable characteristics, several challenges need to be solved such as the uneven distribution of training datasets, redundant exchanging of model parameter and secrecy over malfunctioning terminals

[1, 3, 4]. Among these challenges, the secrecy over malfunctioning terminals is the most important one since failure in preserving secrecy degrades the accuracy of FLNs and DLNs. However, the current literature on the FLNs (or DLNs) mostly assume that the terminals can reliably upload (or exchange) the local results to the parameter server (or neighbors). Therefore, they cannot be used to secure the FLNs and DLNs over malfunctioning terminals (see [1, 3, 4] and references therein).

Several recent research works [3, 4, 5, 6, 7, 8, 9, 10, 11] have reported that secure learning algorithms can be used to protect the security of FLNs and DLNs with multiple Byzantine adversaries. In the FLNs and DLNs, the Byzantine adversaries are the malfunctioning terminals, that can 1) obtain the full malicious action space based on the full knowledge of the networks; and 2) choose arbitrarily bad action from the full malicious action space to compromise the prediction accuracy of the networks [12]. Since the action space of any other malfunctioning terminals is a subset of the full malicious action space, the Byzantine adversaries are considered as the worst-case malfunctioning terminals. When a learning algorithm preserves secrecy over the Byzantine adversaries, it is robust to the full malicious action space of the networks. Thus, the learning algorithm is robust to any other malfunctioning terminals. Hereinafter, we focus on reviewing the state-of-the-art research progresses on secure learning algorithms in the FLNs and DLNs with Byzantine adversaries. Our contributions are summarized as follows.

  • In the FLNs with Byzantine adversaries, we classify the current secure Federated-Learning algorithms (SFLAs) into four categories: 1) aggregation rule based SFLAs; 2) preprocess based SFLAs; 3) model based SFLAs; and 4) adversarial detection based SFLAs. We also provide qualitative comparisons of current SFLAs.

  • Since the design of secure decentralized-learning algorithms (SDLAs) is at the early stage, few works investigated the SDLAs. Therefore, we review several exemplary works on SDLAs.

  • We discuss several future research directions for secure learning algorithms in the FLNs and DLNs with Byzantine adversaries.

To the best of the authors’ knowledge, this is the first comprehensive overview on the secure learning algorithms in the on-device learning networks with Byzantine adversaries.

Ii On-Device Learning Networks With Byzantine Adversaries: Basics

Ii-a Functions of Devices in the On-Device Learning Networks

In the on-device learning networks with Byzantine adversaries, secure learning algorithms have two objectives: convergence and optimality [3, 4, 5, 6]. In order to achieve both goals, the following functions need to be implemented at different devices of the FLNs and DLNs.

  • Parameter server: In the FLNs, the updating and broadcasting of model parameter are performed at the parameter server.

  • Reliable terminals: In the FLNs, the terminals upload the local gradients to the parameter server [3, 4, 8, 6, 5, 7, 9]. In the DLNs, the terminals exchange the local model parameter with the neighbors [10, 11].

  • Byzantine adversaries: In the FLNs and DLNs, the Byzantine adversaries perform arbitrarily bad actions. Different Byzantine adversaries can collude to compromise the prediction accuracy of learning algorithms [3, 4, 6].

Ii-B Description of Federated-Learning Networks

(a) The -th to the -th terminals are Byzantine adversaries.
(b) The implementation procedures of a typical SFLA with Byzantine adversaries.
Fig. 2: General setup of FLNs and SFLAs.

We consider an FLN with a parameter server and terminals, where the first terminals are reliable and the remaining terminals are Byzantine adversaries as shown in Fig. 2(a). Let denote the randomly-distributed data at the -th terminal. The distribution of is unknown, and the distributions of data at different terminals can be different. Let

denote the average loss function with respect to the model parameter

at the -th terminal. Here, the loss function quantifies the prediction accuracy associated with the model parameter at the -th terminal. Leveraging the parallel computation of terminals, SFLAs in the FLN can obtain the optimal model parameter that minimizes the sum-of-average-loss-function (SoALF) without impairment of Byzantine adversaries.

The detailed procedures of a typical SFLA in the FLN with Byzantine adversaries are shown in Fig. 2(b). At the beginning of -th slot, the parameter server broadcasts the global model parameter to all the terminals. After receiving the model parameter , the reliable terminals calculate the local gradient of average loss function based on the data sample (or mini-batch of samples444Mini-batch of samples is a subset of data samples in a dataset.) . The Byzantine adversaries generate harmful local gradients. Then, the terminals synchronously upload the local gradients to the parameter server. At last, the parameter server obtains a global gradient via the aggregation rule. Here, an aggregation rule defines a method for the parameter server to obtain the global gradient based on the uploaded local gradients. Using the global gradient, the parameter server updates the model parameter for the next iteration until the convergence of SoALF.

Ii-C Description of Decentralized-Learning Networks

(a) The -nd and -th terminals are Byzantine adversaries.
(b) The implementation procedures of a typical SDLA with Byzantine adversaries.
Fig. 3: General setup of DLNs and SDLAs.

We consider a DLN with terminals. As shown in Fig. 3(a), the second and the fourth terminals are Byzantine adversaries among the five terminals. The first, the third and the fifth terminals are reliable terminals (i.e., ). Different from the FLNs, the terminals in the DLN obtain the optimal model parameter that minimizes the SoALF as in a decentralized way. During the operation of SDLA, each terminal exchanges the local model parameter with the neighbor terminals. We take the third terminal as an example, and present the detailed procedures of a typical SDLA in the DLN with Byzantine adversaries as shown in Fig. 3(b).

At the start of the -th slot, the third reliable terminal computes the value of the third average loss function based on the local model parameter . Then, the third reliable terminal updates the local model parameter and obtain . Finally, the third reliable terminal broadcasts the local model parameter to the neighbor terminals, e.g., the second, the fourth and the fifth terminals. The Byzantine adversaries broadcast arbitrary adversarial model parameter to their neighbor terminals. All terminals repeat the above process until the SoALF converges.

Ii-D Stochastic Gradient, Batch Gradient and Mini-Batch Gradient

Since the distribution of data is unknown, it is challenging to obtain the exact local gradient due to the high-dimensional integration over the unknown data

. Therefore, several estimators of local gradient are obtained based on the ways to use data samples of terminals. The batch gradient is obtained as

where and respectively denote the size of data samples and the -th data sample of the -th terminal at the beginning of -th slot. Since the batch gradient needs to evaluate the gradient of loss function over all data samples at the -th terminal, the computational cost at the -th terminals increases as the size of data samples. By sacrificing the convergence rate, the computational cost can be reduced via the stochastic gradient. The stochastic gradient is obtained by calculating the gradient of loss function over one randomly-drawn data sample as . In order to balance the computational cost and convergence rate, the mini-batch gradient can be used by evaluating the gradient of loss function over a mini-batch data samples at the -th terminal as with as the size of mini-batch data samples.

Iii Prevalent Secure Distributed Learning Algorithms

Iii-a Aggregation Rule Based Algorithms

Aggregation of gradients is the important step for an SFLA in the FLNs with Byzantine adversaries. In their seminal work [3], Blanchard et al. reported that the mean-value aggregation rule outputs a sequence of global gradients that can be arbitrarily biased by one Byzantine adversary. Hence, the learning algorithms with mean-value aggregation rule do not converge or converge to an ineffective model parameter. When the fraction of Byzantine adversaries is less than , Blanchard et al. proposed a secure aggregation rule named as Krum [3]. The objective of Krum is to approximate the optimal global gradient, which is defined as the average of local gradients when no Byzantine adversary exists. In each iteration, Krum selects one of the local gradients as the global gradient. Since selected global gradient has the smallest sum Euclidean distance to its first closest local gradients, the Euclidean distance between the selected global gradient and optimal global gradient is bounded. Therefore, the impact of Byzantine adversaries is mitigated. Moreover, Blanchard et al. also demonstrated that the selected sequence of global gradients almost-surely converges [3].

Compared with the mean value of a sequence, the median gives a reliable estimation when more than half of the sequence is correct. Hence, several aggregation rules are proposed based on the variants of median [4, 5]. For the multi-dimensional local gradients, Chen et al. defined a geometric median (GeoMed) as the minimizer of the sum distances to all local gradients [4]. With the GeoMed, the parameter server outputs a global gradient that secures the distributed learning algorithms. When the convex SoALF is used, Chen et al. proved that the SFLAs with GeoMed aggregation rule are Byzantine-robust when the number of Byzantine adversaries is less than the number of reliable terminals [4]. In addition, Chen et al. also quantified the convergence rate of each iteration [4].

When the fraction of Byzantine adversaries is less than , Yin et al. proposed two aggregation rules [5]: component-wise median (CwMed) and component-wise trimmed mean (CwTM). The CwMed constructs a global gradient, where each entry is the median of entries in the local gradients with the same coordinate [5, Def. 1]. With CwMed, each entry of global gradient is not polluted by the Byzantine adversaries. Different from CwMed, the CwTM first component-wisely removes the largest and smallest fraction of entries in the local gradients [5, Def. 2]. Then, the CwTM constructs a global gradient, where each entry is the mean value of remaining entries of local gradients. The CwTM needs to know the number of Byzantine adversaries such that the fault entries of Byzantine adversaries are removed with high probability. Yin et al. also quantified the convergence rate of SFLAs with CwMed or CwTM aggregation rules for strongly convex, non-strongly convex and smooth non-convex SoALFs [5].

Since the GeoMed aggregation rule requires solving an convex optimization to obtain the global gradients, the computational complexity exponentially increases with the dimension of model parameter. Since Krum, CwMed and CwTM only include linear algebraic operations to obtain the model parameter, these three aggregation rules have lower computational complexity than GeoMed.

Iii-B Preprocessing Based Algorithms

In addition to the aggregation rule, several preprocessing methods utilizing the local gradients are investigated to design the SFLAs [7, 6]. For example, Chen et al. proposed a preprocess based SFLA [6], known as DRACO, in the FLNs. DRACO leverages the coding theory to remove the negative effect (i.e., degradation of prediction accuracy) of Byzantine adversaries. From the information-theoretical perspective, the optimal gradient can be recovered when the reliable terminals report sufficient information to the parameter server. Specifically, Chen et al. demonstrated that DRACO secures the learning algorithms when the number of Byzantine adversaries is less than half of the average number of gradients based on a data sample (i.e., redundancy ratio is higher than ) [6]. Moreover, the authors illustrated two practical coding schemes to implement DRACO. Since DRACO secures the learning algorithms via redundant storage of data samples, DRACO does not impose limitation on the fraction of Byzantine adversaries and can secure the traditional learning algorithms. Besides, the convergence behavior of DRACO highly depends on the used learning procedures.

El Mhamdi et al. proposed a preprocess method [7]: Bulyan. Bulyan uses Krum [3] to obtain a subset of uploaded local gradients. The parameter server constructs a global gradient by taking the component-wise average to the refined subset of local gradients. We observe that Bulyan introduces extra linear operations to Krum. Hence, Bulyan converges when the convex and non-convex SoALFs are used. Bulyan guarantees that the prediction accuracy is not affected by the dimension of data samples. Since Bulyan is a refinement of Krum by removing several unnecessary local gradients, it has a more strict limitation on the number of Byzantine adversaries. More specifically, the number of Byzantine adversaries is less than 25% of the number of terminals in the FLNs.

Iii-C Model Based Algorithms

The SFLAs can be designed during the formulation of problem model in the FLNs. Li et al. proposed a problem model based SFLA [8], which is named as Byzantine-robust stochastic aggregation (BRSA) algorithm. In the BRSA algorithm, a regularization term is introduced to the SoALF such that the uploaded local gradients of reliable terminals and Byzantine adversaries are discretized into finite set. Using the discrete local gradients, the BRSA algorithm converges to a suboptimal model parameter. Since the problem model used by BRSA is secured, the BRSA algorithm converges without the limitation on the fraction of Byzantine adversaries [8]. For the convex SoALF, Li et al. quantitatively analyzed the relation between the accuracy of BRSA algorithm and the number of Byzantine adversaries [8]. Selecting different regularization terms, the convergence rate can be traded for the reduction of gap between the optimal model parameter and convergent model parameter. Besides, the BRSA algorithm allows that the data samples at different terminals to have different distributions.

Iii-D Adversarial Detection Based Algorithms

While the previous works mainly focused on designing Byzantine-resilient learning algorithms, an extra step to detect Byzantine adversaries has been included before the aggregation of local gradients. When the SoALF is convex and the number of Byzantine adversaries is less than that of reliable terminals, Alistarh et al.

proposed the Byzantine stochastic gradient descent (SGD) algorithm to detect the Byzantine adversaries

[9]. The motivation of this two-threshold approach is based on the facts that the reliable terminals can introduce: 1) limited variation of time-average local gradients, and 2) limited fluctuation for the time-average inner products of local gradient and variation of model parameter. When either one of the facts is violated, the terminal is detected as a Byzantine adversary. For a Byzantine adversary without violating the two thresholds, Alistarh et al. demonstrated that the Byzantine adversary cannot compromise the convergence and accuracy of Byzantine SGD algorithm [9]. Based on the detection of Byzantine adversaries, the number of local gradients is reduced to one such that the computational complexity at the terminals is reduced. Through theoretical analysis, the proposed detection method achieves the optimal number of gradient exchange between the parameter server and the terminals [9, Theorems 3.4 and 3.5].

Iii-E Qualitative Comparisons

While GeoMed, Byzantine SGD and BRSA are effective for the convex loss functions, Krum, CwTM, CwMed, Bulyan and DRACO work for both convex and non-convex loss functions. Since the Bulyan is a refinement of Krum, it has more strict limitations than Krum on the number of Byzantine adversaries. Table I presents a coarse-grained comparison among the state-of-the-art SFLAs. Note that the complexity in Table I indicates the computational complexity to obtain the global gradient at the parameter. The computational complexity scales with the number of Byzantine adversaries, the dimensions of model parameter, and the number of terminals.

Scheme Loss Function % of Byzantine Adversaries Sample Redundancy Complexity
Krum [3] Convex, non-convex Medium
GeoMed [4] Convex High
CwTM [5] Convex, non-convex LowMedium
CwMed [5] Convex, non-convex LowMedium
BRSA [8] Convex N/A Low
Bulyan [6] Convex, non-convex MediumHigh
DRACO [7] Convex, non-convex N/A Low
Byzantine SGD [9] Convex Low
TABLE I: Coarse-Grained Comparison of SFLAs

Iii-F A Case Study

The loss function takes the form of the regularized softmax regression. We consider there are 10 terminals, among which there are two Byzantine adversaries. All the results are run over MNIST dataset with training samples and test samples. We consider two types of attacks: 1) reverse attacks; and 2) Gaussian attacks. In the reverse attacks, the Byzantine adversaries upload the scaled local gradients scaled by a negative value (e.g.,

in our simulations) to the parameter server. In the Gaussian attacks, the Byzantine adversaries upload local gradients where each entry follows a Gaussian distribution with mean zero and Gaussian distributed variance. The variance of the inner Gaussian distribution is set as

. The model parameter is updated by a mini-batch SGD method with mini-batch size of . Hereinafter, the simulations are run for iterations with the learning rate .

(a) Classical learning algorithm without attack.
(b) Classical learning algorithm under reverse attacks.
(c) Classical learning algorithm under Gaussian attacks.
Fig. 4: An illustration of the impact of Byzantine adversaries on the prediction accuracy. The title of each image shows the predicted value of the image, and the incorrect predicted values are highlighted.

Figures 4(a)-4(c) illustrate that the Byzantine adversaries can compromise the prediction accuracy of the classical learning algorithms. As shown in Fig. 4(a), the model parameter obtained by the classical learning algorithms return two errors. When the Byzantine adversaries are included (e.g., reverse attacks in Fig. 4(b) and Gaussian attacks in Fig. 4(c)), the prediction accuracy of the classical learning algorithms is significantly compromised.

(a) Convergence of SFLAs with reverse attacks.
(b) Convergence of SFLAs with Gaussian attacks.
Fig. 5: Convergence behaviors of SFLAs under different attacks.

Figures 5(a) and 5(b) show the accuracy of the four aggregation rules: 1) Krum [3], 2) GeoMed [4], 3) CwMed [5] and 4) CwTM [5]. The benchmark scheme is a classical learning algorithm where model parameter is updated by mini-batch SGD method. All terminals are reliable in the benchmark scheme. The Krum algorithm, GeoMed algorithm and CwTM algorithm perform similarly when the reverse attacks are present. When there are Gaussian attacks, the GeoMed algorithm can obtain a near-optimal model parameter and outperforms the remaining aggregation rules. However, the GeoMed algorithm requires the parameter server to solve an optimization problem to update the model parameter during each iteration. The Krum algorithm, CwMed algorithm and CwTM algorithm rely on the simple algebraic manipulations; therefore, the Krum algorithm, CwMed algorithm and CwTM algorithm have lower computational complexities than the GeoMed algorithm. Hence, we conclude that different aggregation rules provide different tradeoff between the accuracy and the computational complexity.

Iv Prevalent Secure Decentralized Learning Algorithms

Before proceeding to discuss the SDLAs, we first define the trim operation used in DLNs with Byzantine adversaries. Performing trim operation to a sequence of scalar values means that the largest and the smallest values are removed from the sequence. In this section, we present two exemplary works for fully-connected DLNs [10] and partially-connected DLNs [11] with a scalar model parameter. In the DLNs, each terminal needs to converge to the same local model parameter, i.e., consensus model parameter. However, the Byzantine adversaries can broadcast different fake model parameters to their neighbors such that the convergent model parameter is biased from the consensus model parameter. Moreover, the prediction accuracy of the convergent model parameter is significantly degraded.

In the proposed SDLA for fully-connected DLNs, each terminal exchanges local model parameter and local gradient with the neighbors [10]. Different from [10], the SDLA for partially-connected DLNs [11] only requires each terminal to exchange the local gradient with the neighbors. Therefore, each terminal maintains a local-gradient sequence [11] (or a local-model-parameter sequence and a local-gradient sequence [10]). In each iteration, each terminal performs the trim operation to the local-gradient sequence [11] (or the local-model-parameter sequence and local-gradient sequence [10]). Then, each terminal updates the model parameter based on the local-gradient sequence [11] (or the local-model-parameter sequence and local-gradient sequence [10]).

In order to obtain the consensus model parameter, the fully-connected DLNs requires that the reliable terminals are twice larger in number than the Byzantine adversaries. For the partially-connected DLNs, the requirements differ from the connectivity of the terminals. When there are Byzantine adversaries in the DLNs, the consensus-based SDLA converges if and only if the partially connected DLNs are -robust [11]. When each terminal has at most Byzantine-adversary neighbors, the consensus-based SDLA converges to the consensual model parameter given the -robust DLNs [11].

V Future Research Directions

V-a Bandwidth Reduction of SFLAs

In the FLNs, the frequent uploads of gradients from terminals to parameter server become inevitable for the SFLAs. With the high-dimensional gradients and large number of iterations, the bandwidth demand is still high during the learning process of the FLNs. In order to reduce the bandwidth demand, Chen et al. proposed a lazily-aggregated gradient method where the terminals can adaptively skip the gradient calculation and gradient-exchanging [13]

. Moreover, the authors also theoretically demonstrated that the lazily-aggregated gradient method has the same convergence rate and less communication complexity than the batch gradient descent method. However, the convergence and optimality of the lazily-aggregated gradient method are unknown when multiple Byzantine adversaries exist in the FLNs. The state-of-the-art SFLAs require terminals to upload local gradients in each iteration. The communication complexity becomes a major obstacle to scale up the SFLAs in the FLNs. In order to reduce the computational complexity of SFLAs, one promising method is to exclude the redundant exchanged gradients of the reliable terminals. Besides, compressing the exchanged gradients per iteration alleviates the scarcity of bandwidth. In order to compensate the loss during compression, the SFLAs with gradient compression also need to include methods such as momentum correction and local gradient clipping.

V-B Multi-Task SFLAs

While the current learning algorithms mainly focus on obtaining a single global model, it is attractive to concurrently compute multiple models when these models are correlated. The correlation can be induced by similar behaviors of different terminals, such as reposting the similar news in the social media and watching the popular video episodes over internet. Moreover, the parallel computation of multiple models introduces several benefits such as learning efficiency and prediction accuracy. Therefore, the multi-task learning algorithms are proposed for the FLNs [14]. In the presence of Byzantine adversaries, one potential research direction is to investigate the secure multi-task learning algorithms (SMTLAs) in the FLNs. The SMTLAs need to converge to a near-optimal model parameter without being falsified by Byzantine adversaries, and also need to preserve the communication expenditure as minimal as possible. The SMTLAs can be developed based on the aforementioned aggregation rules and regularization term. However, the convergence property and communication expenditure remain agnostic when the aggregation rules and regularization term are used. Moreover, introducing a precoding process for the update of gradients can also secure the multi-task learning algorithms in the FLNs.

V-C Stochastic SDLAs

Since the data samples are locally collected at each terminal in the DLNs, each terminal may have a similar model parameter to its neighbors but retain a small difference [15]. As pointed by Koppel et al. in [15], the proximity constraints among the neighboring terminals are a promising technique to formulate such small differences in the DLNs. In order to avoid multiple assessments of average loss functions, the stochastic saddle point algorithm was introduced by allowing each terminal to access its local loss function once in each iteration [15]. However, the convergence and optimality of stochastic saddle point algorithm can be compromised by the multiple Byzantine adversaries in the DLNs. While the proposed algorithms in [10, 11] are secure to DLNs with Byzantine adversaries and the scalar model parameter, it is more practical to learn a high-dimensional model parameters due to the ever-increasing volume of datasets. The regularization term used in [8] is promising to handle the Byzantine adversaries in the FLNs, it remains an open problem to design the regularized term in the DLNs with the Byzantine adversaries. When the heterogeneity of data samples is considered, the design of multi-task SDLAs is also an interesting research problem in the DLNs.

Vi Conclusions

Since the current learning algorithms are vulnerable to the Byzantine adversaries, we provided a comprehensive overview of the SFLAs and SDLAs in the FLNs and DLNs, respectively. The Byzantine adversaries are considered since the Byzantine adversaries can act arbitrarily to compromise the classical learning algorithms. Therefore, the secure learning algorithms, which are robust over the Byzantine adversaries, can work under any attacks from the terminals. We presented the signaling-exchange procedures of the secure learning algorithms in both FLNs and DLNs when Byzantine adversaries coexist with the reliable terminals. Numerous state-of-the-art secure learning algorithms were discussed in terms of the main contributions in the FLNs and DLNs. Several future research directions were discussed for the secure learning algorithms in the FLNs and DLNs.


  • [1] J. Konečnỳ, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015.
  • [2]

    B. Ying, K. Yuan, and A. H. Sayed, “Supervised learning under distributed features,”

    IEEE Trans. Signal Process., vol. 67, no. 4, pp. 977–992, Feb. 2019.
  • [3]

    P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” in

    Proc. Advances in Neural Information Processing Systems (NIPS), Long Beach, USA, Dec. 2017, pp. 119–129.
  • [4] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proc. ACM on Measurement and Analysis of Computing Systems (SIGMETRICS), vol. 1, no. 2, pp. 44:1–44:25, Dec. 2017.
  • [5] D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Byzantine-robust distributed learning: Towards optimal statistical rates,” in Proc. International Conference on Machine Learning (ICML), Stockholmsmässan, Stockholm, Sweden, July 2018, pp. 5650–5659.
  • [6] E. M. El Mhamdi, R. Guerraoui, and S. Rouault, “The hidden vulnerability of distributed learning in Byzantium,” in Proc. International Conference on Machine Learning (ICML), Stockholmsmässan, Stockholm, Sweden, July 2018, pp. 3521–3530.
  • [7] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO: Byzantine-resilient distributed training via redundant gradients,” in Proc. International Conference on Machine Learning (ICML), Stockholmsmässan, Stockholm, Sweden, July 2018, pp. 903–912.
  • [8] L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “RSA: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets,” in

    Proc. AAAI Conference on Artificial Intelligence

    , to be published.
  • [9] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradient descent,” in Proc. Advances in Neural Information Processing Systems (NIPS), Montreal, CA, Dec. 2018, pp. 4614–4624.
  • [10] L. Su and N. H. Vaidya, “Fault-tolerant multi-agent optimization: Optimal iterative distributed algorithms,” in Proc. ACM Symposium on Principles of Distributed Computing, Chicago, Illinois, USA, July 2016, pp. 425–434.
  • [11] S. Sundaram and B. Gharesifard, “Distributed optimization under adversarial nodes,” IEEE Trans. Autom. Control, vol. 64, no. 3, pp. 1063–1076, Mar. 2019.
  • [12] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generals problem,” ACM Transactions on Programming Languages and Systems, vol. 4, no. 3, pp. 382–401, July 1982.
  • [13] T. Chen, G. B. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated gradient for communication-efficient distributed learning,” in Proc. Advances in Neural Information Processing Systems (NIPS), Montreal, CA, Dec. 2018, pp. 5050–5060.
  • [14] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Proc. Advances in Neural Information Processing Systems (NIPS), Long Beach, USA, Dec. 2017, pp. 4424–4434.
  • [15] A. Koppel, B. M. Sadler, and A. Ribeiro, “Proximity without consensus in online multiagent optimization,” IEEE Trans. Signal Process., vol. 65, no. 12, pp. 3062–3077, June 2017.