I Introduction
In the big data era, the high volume of datasets will improve the accuracy of learning networks. However, the everincreasing dimension of data samples and the size of datasets introduce new challenges to the centralized incloud learning networks. For example, the incloud learning networks will experience accuracy degradation since the limited bandwidth in current networking infrastructure may not satisfy the demands for transmitting the highvolume datasets to the central cloud [1]. The central cloud and terminals have the copies of datasets. Hence, there is a privacy concern as long as either the terminals or central cloud is hacked by malfunctioning users, e.g., the unexpected leakage of photos of celebrities from iCloud in 2014^{1}^{1}1https://en.wikipedia.org/wiki/ICloud_leaks_of_celebrity_photos.
The soaring demands on bandwidth and privacy concerns cause a shift of research focus from the centralized incloud learning networks to the distributed ondevice learning networks. Moreover, the ondevice learning networks also benefit from the improvement of storage volume and computational power of mobile terminals. The objective of ondevice learning networks is to obtain the optimal model parameter, which provides an optimal mapping between the data samples and labels of a dataset. For example, the optimal model parameter is used to map bits of an image to the number in the image in the MNIST dataset^{2}^{2}2http://yann.lecun.com/exdb/mnist/.
Among the ondevice learning networks, the FederatedLearning networks (FLNs) [1] and decentralizedlearning networks (DLNs) [2] are the two categories of promising candidates as shown in Figs. 1(a) and 1(b), respectively. As an ondevice learning network recently proposed by Google, the FLN consists of a parameter server and multiple terminals as shown in Fig. 1(a). In the FLNs, the terminals perform the distributed parallel computation based on the local datasets. The parameter server updates the model parameter based on the results of the terminals. Due to the parallel computing, the FLNs converge as fast as the centralized counterpart: the incloud learning networks. When the parameter server experiences unprecedented outage^{3}^{3}3The outage events include the energy outage at the location of server, the hardware failure of server, etc., the DLN becomes a promising ondevice learning network since the DLN does not require a central parameter server [2]. In the DLNs, the terminals need to communicate with the neighbors to exchange the distributed training results. Hereinafter, two terminals are neighbors when they have onehop connection with each other. For example, the first and the second terminals are neighbors to each other in Fig. 1(b). When each terminal is the neighbor of the remaining terminals, the DLN is called fullyconnected; otherwise, it is called partiallyconnected. The advantages and disadvantages of incloud learning networks and ondevice learning networks are summarized in the table of Fig. 1(c).
The FLNs and DLNs have several desirable characteristics. For example, the datasets are kept at the terminals such that the FLNs and DLNs do not have multiple copies of the privacyrelated datasets. Hence, the probability of privacy leakage is reduced. Another desirable characteristic is that the terminals are flexible to join and leave the FLNs (or DLNs). In order to achieve these desirable characteristics, several challenges need to be solved such as the uneven distribution of training datasets, redundant exchanging of model parameter and secrecy over malfunctioning terminals
[1, 3, 4]. Among these challenges, the secrecy over malfunctioning terminals is the most important one since failure in preserving secrecy degrades the accuracy of FLNs and DLNs. However, the current literature on the FLNs (or DLNs) mostly assume that the terminals can reliably upload (or exchange) the local results to the parameter server (or neighbors). Therefore, they cannot be used to secure the FLNs and DLNs over malfunctioning terminals (see [1, 3, 4] and references therein).Several recent research works [3, 4, 5, 6, 7, 8, 9, 10, 11] have reported that secure learning algorithms can be used to protect the security of FLNs and DLNs with multiple Byzantine adversaries. In the FLNs and DLNs, the Byzantine adversaries are the malfunctioning terminals, that can 1) obtain the full malicious action space based on the full knowledge of the networks; and 2) choose arbitrarily bad action from the full malicious action space to compromise the prediction accuracy of the networks [12]. Since the action space of any other malfunctioning terminals is a subset of the full malicious action space, the Byzantine adversaries are considered as the worstcase malfunctioning terminals. When a learning algorithm preserves secrecy over the Byzantine adversaries, it is robust to the full malicious action space of the networks. Thus, the learning algorithm is robust to any other malfunctioning terminals. Hereinafter, we focus on reviewing the stateoftheart research progresses on secure learning algorithms in the FLNs and DLNs with Byzantine adversaries. Our contributions are summarized as follows.

In the FLNs with Byzantine adversaries, we classify the current secure FederatedLearning algorithms (SFLAs) into four categories: 1) aggregation rule based SFLAs; 2) preprocess based SFLAs; 3) model based SFLAs; and 4) adversarial detection based SFLAs. We also provide qualitative comparisons of current SFLAs.

Since the design of secure decentralizedlearning algorithms (SDLAs) is at the early stage, few works investigated the SDLAs. Therefore, we review several exemplary works on SDLAs.

We discuss several future research directions for secure learning algorithms in the FLNs and DLNs with Byzantine adversaries.
To the best of the authors’ knowledge, this is the first comprehensive overview on the secure learning algorithms in the ondevice learning networks with Byzantine adversaries.
Ii OnDevice Learning Networks With Byzantine Adversaries: Basics
Iia Functions of Devices in the OnDevice Learning Networks
In the ondevice learning networks with Byzantine adversaries, secure learning algorithms have two objectives: convergence and optimality [3, 4, 5, 6]. In order to achieve both goals, the following functions need to be implemented at different devices of the FLNs and DLNs.

Parameter server: In the FLNs, the updating and broadcasting of model parameter are performed at the parameter server.
IiB Description of FederatedLearning Networks
We consider an FLN with a parameter server and terminals, where the first terminals are reliable and the remaining terminals are Byzantine adversaries as shown in Fig. 2(a). Let denote the randomlydistributed data at the th terminal. The distribution of is unknown, and the distributions of data at different terminals can be different. Let
denote the average loss function with respect to the model parameter
at the th terminal. Here, the loss function quantifies the prediction accuracy associated with the model parameter at the th terminal. Leveraging the parallel computation of terminals, SFLAs in the FLN can obtain the optimal model parameter that minimizes the sumofaveragelossfunction (SoALF) without impairment of Byzantine adversaries.The detailed procedures of a typical SFLA in the FLN with Byzantine adversaries are shown in Fig. 2(b). At the beginning of th slot, the parameter server broadcasts the global model parameter to all the terminals. After receiving the model parameter , the reliable terminals calculate the local gradient of average loss function based on the data sample (or minibatch of samples^{4}^{4}4Minibatch of samples is a subset of data samples in a dataset.) . The Byzantine adversaries generate harmful local gradients. Then, the terminals synchronously upload the local gradients to the parameter server. At last, the parameter server obtains a global gradient via the aggregation rule. Here, an aggregation rule defines a method for the parameter server to obtain the global gradient based on the uploaded local gradients. Using the global gradient, the parameter server updates the model parameter for the next iteration until the convergence of SoALF.
IiC Description of DecentralizedLearning Networks
We consider a DLN with terminals. As shown in Fig. 3(a), the second and the fourth terminals are Byzantine adversaries among the five terminals. The first, the third and the fifth terminals are reliable terminals (i.e., ). Different from the FLNs, the terminals in the DLN obtain the optimal model parameter that minimizes the SoALF as in a decentralized way. During the operation of SDLA, each terminal exchanges the local model parameter with the neighbor terminals. We take the third terminal as an example, and present the detailed procedures of a typical SDLA in the DLN with Byzantine adversaries as shown in Fig. 3(b).
At the start of the th slot, the third reliable terminal computes the value of the third average loss function based on the local model parameter . Then, the third reliable terminal updates the local model parameter and obtain . Finally, the third reliable terminal broadcasts the local model parameter to the neighbor terminals, e.g., the second, the fourth and the fifth terminals. The Byzantine adversaries broadcast arbitrary adversarial model parameter to their neighbor terminals. All terminals repeat the above process until the SoALF converges.
IiD Stochastic Gradient, Batch Gradient and MiniBatch Gradient
Since the distribution of data is unknown, it is challenging to obtain the exact local gradient due to the highdimensional integration over the unknown data
. Therefore, several estimators of local gradient are obtained based on the ways to use data samples of terminals. The batch gradient is obtained as
where and respectively denote the size of data samples and the th data sample of the th terminal at the beginning of th slot. Since the batch gradient needs to evaluate the gradient of loss function over all data samples at the th terminal, the computational cost at the th terminals increases as the size of data samples. By sacrificing the convergence rate, the computational cost can be reduced via the stochastic gradient. The stochastic gradient is obtained by calculating the gradient of loss function over one randomlydrawn data sample as . In order to balance the computational cost and convergence rate, the minibatch gradient can be used by evaluating the gradient of loss function over a minibatch data samples at the th terminal as with as the size of minibatch data samples.Iii Prevalent Secure Distributed Learning Algorithms
Iiia Aggregation Rule Based Algorithms
Aggregation of gradients is the important step for an SFLA in the FLNs with Byzantine adversaries. In their seminal work [3], Blanchard et al. reported that the meanvalue aggregation rule outputs a sequence of global gradients that can be arbitrarily biased by one Byzantine adversary. Hence, the learning algorithms with meanvalue aggregation rule do not converge or converge to an ineffective model parameter. When the fraction of Byzantine adversaries is less than , Blanchard et al. proposed a secure aggregation rule named as Krum [3]. The objective of Krum is to approximate the optimal global gradient, which is defined as the average of local gradients when no Byzantine adversary exists. In each iteration, Krum selects one of the local gradients as the global gradient. Since selected global gradient has the smallest sum Euclidean distance to its first closest local gradients, the Euclidean distance between the selected global gradient and optimal global gradient is bounded. Therefore, the impact of Byzantine adversaries is mitigated. Moreover, Blanchard et al. also demonstrated that the selected sequence of global gradients almostsurely converges [3].
Compared with the mean value of a sequence, the median gives a reliable estimation when more than half of the sequence is correct. Hence, several aggregation rules are proposed based on the variants of median [4, 5]. For the multidimensional local gradients, Chen et al. defined a geometric median (GeoMed) as the minimizer of the sum distances to all local gradients [4]. With the GeoMed, the parameter server outputs a global gradient that secures the distributed learning algorithms. When the convex SoALF is used, Chen et al. proved that the SFLAs with GeoMed aggregation rule are Byzantinerobust when the number of Byzantine adversaries is less than the number of reliable terminals [4]. In addition, Chen et al. also quantified the convergence rate of each iteration [4].
When the fraction of Byzantine adversaries is less than , Yin et al. proposed two aggregation rules [5]: componentwise median (CwMed) and componentwise trimmed mean (CwTM). The CwMed constructs a global gradient, where each entry is the median of entries in the local gradients with the same coordinate [5, Def. 1]. With CwMed, each entry of global gradient is not polluted by the Byzantine adversaries. Different from CwMed, the CwTM first componentwisely removes the largest and smallest fraction of entries in the local gradients [5, Def. 2]. Then, the CwTM constructs a global gradient, where each entry is the mean value of remaining entries of local gradients. The CwTM needs to know the number of Byzantine adversaries such that the fault entries of Byzantine adversaries are removed with high probability. Yin et al. also quantified the convergence rate of SFLAs with CwMed or CwTM aggregation rules for strongly convex, nonstrongly convex and smooth nonconvex SoALFs [5].
Since the GeoMed aggregation rule requires solving an convex optimization to obtain the global gradients, the computational complexity exponentially increases with the dimension of model parameter. Since Krum, CwMed and CwTM only include linear algebraic operations to obtain the model parameter, these three aggregation rules have lower computational complexity than GeoMed.
IiiB Preprocessing Based Algorithms
In addition to the aggregation rule, several preprocessing methods utilizing the local gradients are investigated to design the SFLAs [7, 6]. For example, Chen et al. proposed a preprocess based SFLA [6], known as DRACO, in the FLNs. DRACO leverages the coding theory to remove the negative effect (i.e., degradation of prediction accuracy) of Byzantine adversaries. From the informationtheoretical perspective, the optimal gradient can be recovered when the reliable terminals report sufficient information to the parameter server. Specifically, Chen et al. demonstrated that DRACO secures the learning algorithms when the number of Byzantine adversaries is less than half of the average number of gradients based on a data sample (i.e., redundancy ratio is higher than ) [6]. Moreover, the authors illustrated two practical coding schemes to implement DRACO. Since DRACO secures the learning algorithms via redundant storage of data samples, DRACO does not impose limitation on the fraction of Byzantine adversaries and can secure the traditional learning algorithms. Besides, the convergence behavior of DRACO highly depends on the used learning procedures.
El Mhamdi et al. proposed a preprocess method [7]: Bulyan. Bulyan uses Krum [3] to obtain a subset of uploaded local gradients. The parameter server constructs a global gradient by taking the componentwise average to the refined subset of local gradients. We observe that Bulyan introduces extra linear operations to Krum. Hence, Bulyan converges when the convex and nonconvex SoALFs are used. Bulyan guarantees that the prediction accuracy is not affected by the dimension of data samples. Since Bulyan is a refinement of Krum by removing several unnecessary local gradients, it has a more strict limitation on the number of Byzantine adversaries. More specifically, the number of Byzantine adversaries is less than 25% of the number of terminals in the FLNs.
IiiC Model Based Algorithms
The SFLAs can be designed during the formulation of problem model in the FLNs. Li et al. proposed a problem model based SFLA [8], which is named as Byzantinerobust stochastic aggregation (BRSA) algorithm. In the BRSA algorithm, a regularization term is introduced to the SoALF such that the uploaded local gradients of reliable terminals and Byzantine adversaries are discretized into finite set. Using the discrete local gradients, the BRSA algorithm converges to a suboptimal model parameter. Since the problem model used by BRSA is secured, the BRSA algorithm converges without the limitation on the fraction of Byzantine adversaries [8]. For the convex SoALF, Li et al. quantitatively analyzed the relation between the accuracy of BRSA algorithm and the number of Byzantine adversaries [8]. Selecting different regularization terms, the convergence rate can be traded for the reduction of gap between the optimal model parameter and convergent model parameter. Besides, the BRSA algorithm allows that the data samples at different terminals to have different distributions.
IiiD Adversarial Detection Based Algorithms
While the previous works mainly focused on designing Byzantineresilient learning algorithms, an extra step to detect Byzantine adversaries has been included before the aggregation of local gradients. When the SoALF is convex and the number of Byzantine adversaries is less than that of reliable terminals, Alistarh et al.
proposed the Byzantine stochastic gradient descent (SGD) algorithm to detect the Byzantine adversaries
[9]. The motivation of this twothreshold approach is based on the facts that the reliable terminals can introduce: 1) limited variation of timeaverage local gradients, and 2) limited fluctuation for the timeaverage inner products of local gradient and variation of model parameter. When either one of the facts is violated, the terminal is detected as a Byzantine adversary. For a Byzantine adversary without violating the two thresholds, Alistarh et al. demonstrated that the Byzantine adversary cannot compromise the convergence and accuracy of Byzantine SGD algorithm [9]. Based on the detection of Byzantine adversaries, the number of local gradients is reduced to one such that the computational complexity at the terminals is reduced. Through theoretical analysis, the proposed detection method achieves the optimal number of gradient exchange between the parameter server and the terminals [9, Theorems 3.4 and 3.5].IiiE Qualitative Comparisons
While GeoMed, Byzantine SGD and BRSA are effective for the convex loss functions, Krum, CwTM, CwMed, Bulyan and DRACO work for both convex and nonconvex loss functions. Since the Bulyan is a refinement of Krum, it has more strict limitations than Krum on the number of Byzantine adversaries. Table I presents a coarsegrained comparison among the stateoftheart SFLAs. Note that the complexity in Table I indicates the computational complexity to obtain the global gradient at the parameter. The computational complexity scales with the number of Byzantine adversaries, the dimensions of model parameter, and the number of terminals.
Scheme  Loss Function  % of Byzantine Adversaries  Sample Redundancy  Complexity 
Krum [3]  Convex, nonconvex  Medium  
GeoMed [4]  Convex  High  
CwTM [5]  Convex, nonconvex  LowMedium  
CwMed [5]  Convex, nonconvex  LowMedium  
BRSA [8]  Convex  N/A  Low  
Bulyan [6]  Convex, nonconvex  MediumHigh  
DRACO [7]  Convex, nonconvex  N/A  Low  
Byzantine SGD [9]  Convex  Low 
IiiF A Case Study
The loss function takes the form of the regularized softmax regression. We consider there are 10 terminals, among which there are two Byzantine adversaries. All the results are run over MNIST dataset with training samples and test samples. We consider two types of attacks: 1) reverse attacks; and 2) Gaussian attacks. In the reverse attacks, the Byzantine adversaries upload the scaled local gradients scaled by a negative value (e.g.,
in our simulations) to the parameter server. In the Gaussian attacks, the Byzantine adversaries upload local gradients where each entry follows a Gaussian distribution with mean zero and Gaussian distributed variance. The variance of the inner Gaussian distribution is set as
. The model parameter is updated by a minibatch SGD method with minibatch size of . Hereinafter, the simulations are run for iterations with the learning rate .Figures 4(a)4(c) illustrate that the Byzantine adversaries can compromise the prediction accuracy of the classical learning algorithms. As shown in Fig. 4(a), the model parameter obtained by the classical learning algorithms return two errors. When the Byzantine adversaries are included (e.g., reverse attacks in Fig. 4(b) and Gaussian attacks in Fig. 4(c)), the prediction accuracy of the classical learning algorithms is significantly compromised.
Figures 5(a) and 5(b) show the accuracy of the four aggregation rules: 1) Krum [3], 2) GeoMed [4], 3) CwMed [5] and 4) CwTM [5]. The benchmark scheme is a classical learning algorithm where model parameter is updated by minibatch SGD method. All terminals are reliable in the benchmark scheme. The Krum algorithm, GeoMed algorithm and CwTM algorithm perform similarly when the reverse attacks are present. When there are Gaussian attacks, the GeoMed algorithm can obtain a nearoptimal model parameter and outperforms the remaining aggregation rules. However, the GeoMed algorithm requires the parameter server to solve an optimization problem to update the model parameter during each iteration. The Krum algorithm, CwMed algorithm and CwTM algorithm rely on the simple algebraic manipulations; therefore, the Krum algorithm, CwMed algorithm and CwTM algorithm have lower computational complexities than the GeoMed algorithm. Hence, we conclude that different aggregation rules provide different tradeoff between the accuracy and the computational complexity.
Iv Prevalent Secure Decentralized Learning Algorithms
Before proceeding to discuss the SDLAs, we first define the trim operation used in DLNs with Byzantine adversaries. Performing trim operation to a sequence of scalar values means that the largest and the smallest values are removed from the sequence. In this section, we present two exemplary works for fullyconnected DLNs [10] and partiallyconnected DLNs [11] with a scalar model parameter. In the DLNs, each terminal needs to converge to the same local model parameter, i.e., consensus model parameter. However, the Byzantine adversaries can broadcast different fake model parameters to their neighbors such that the convergent model parameter is biased from the consensus model parameter. Moreover, the prediction accuracy of the convergent model parameter is significantly degraded.
In the proposed SDLA for fullyconnected DLNs, each terminal exchanges local model parameter and local gradient with the neighbors [10]. Different from [10], the SDLA for partiallyconnected DLNs [11] only requires each terminal to exchange the local gradient with the neighbors. Therefore, each terminal maintains a localgradient sequence [11] (or a localmodelparameter sequence and a localgradient sequence [10]). In each iteration, each terminal performs the trim operation to the localgradient sequence [11] (or the localmodelparameter sequence and localgradient sequence [10]). Then, each terminal updates the model parameter based on the localgradient sequence [11] (or the localmodelparameter sequence and localgradient sequence [10]).
In order to obtain the consensus model parameter, the fullyconnected DLNs requires that the reliable terminals are twice larger in number than the Byzantine adversaries. For the partiallyconnected DLNs, the requirements differ from the connectivity of the terminals. When there are Byzantine adversaries in the DLNs, the consensusbased SDLA converges if and only if the partially connected DLNs are robust [11]. When each terminal has at most Byzantineadversary neighbors, the consensusbased SDLA converges to the consensual model parameter given the robust DLNs [11].
V Future Research Directions
Va Bandwidth Reduction of SFLAs
In the FLNs, the frequent uploads of gradients from terminals to parameter server become inevitable for the SFLAs. With the highdimensional gradients and large number of iterations, the bandwidth demand is still high during the learning process of the FLNs. In order to reduce the bandwidth demand, Chen et al. proposed a lazilyaggregated gradient method where the terminals can adaptively skip the gradient calculation and gradientexchanging [13]
. Moreover, the authors also theoretically demonstrated that the lazilyaggregated gradient method has the same convergence rate and less communication complexity than the batch gradient descent method. However, the convergence and optimality of the lazilyaggregated gradient method are unknown when multiple Byzantine adversaries exist in the FLNs. The stateoftheart SFLAs require terminals to upload local gradients in each iteration. The communication complexity becomes a major obstacle to scale up the SFLAs in the FLNs. In order to reduce the computational complexity of SFLAs, one promising method is to exclude the redundant exchanged gradients of the reliable terminals. Besides, compressing the exchanged gradients per iteration alleviates the scarcity of bandwidth. In order to compensate the loss during compression, the SFLAs with gradient compression also need to include methods such as momentum correction and local gradient clipping.
VB MultiTask SFLAs
While the current learning algorithms mainly focus on obtaining a single global model, it is attractive to concurrently compute multiple models when these models are correlated. The correlation can be induced by similar behaviors of different terminals, such as reposting the similar news in the social media and watching the popular video episodes over internet. Moreover, the parallel computation of multiple models introduces several benefits such as learning efficiency and prediction accuracy. Therefore, the multitask learning algorithms are proposed for the FLNs [14]. In the presence of Byzantine adversaries, one potential research direction is to investigate the secure multitask learning algorithms (SMTLAs) in the FLNs. The SMTLAs need to converge to a nearoptimal model parameter without being falsified by Byzantine adversaries, and also need to preserve the communication expenditure as minimal as possible. The SMTLAs can be developed based on the aforementioned aggregation rules and regularization term. However, the convergence property and communication expenditure remain agnostic when the aggregation rules and regularization term are used. Moreover, introducing a precoding process for the update of gradients can also secure the multitask learning algorithms in the FLNs.
VC Stochastic SDLAs
Since the data samples are locally collected at each terminal in the DLNs, each terminal may have a similar model parameter to its neighbors but retain a small difference [15]. As pointed by Koppel et al. in [15], the proximity constraints among the neighboring terminals are a promising technique to formulate such small differences in the DLNs. In order to avoid multiple assessments of average loss functions, the stochastic saddle point algorithm was introduced by allowing each terminal to access its local loss function once in each iteration [15]. However, the convergence and optimality of stochastic saddle point algorithm can be compromised by the multiple Byzantine adversaries in the DLNs. While the proposed algorithms in [10, 11] are secure to DLNs with Byzantine adversaries and the scalar model parameter, it is more practical to learn a highdimensional model parameters due to the everincreasing volume of datasets. The regularization term used in [8] is promising to handle the Byzantine adversaries in the FLNs, it remains an open problem to design the regularized term in the DLNs with the Byzantine adversaries. When the heterogeneity of data samples is considered, the design of multitask SDLAs is also an interesting research problem in the DLNs.
Vi Conclusions
Since the current learning algorithms are vulnerable to the Byzantine adversaries, we provided a comprehensive overview of the SFLAs and SDLAs in the FLNs and DLNs, respectively. The Byzantine adversaries are considered since the Byzantine adversaries can act arbitrarily to compromise the classical learning algorithms. Therefore, the secure learning algorithms, which are robust over the Byzantine adversaries, can work under any attacks from the terminals. We presented the signalingexchange procedures of the secure learning algorithms in both FLNs and DLNs when Byzantine adversaries coexist with the reliable terminals. Numerous stateoftheart secure learning algorithms were discussed in terms of the main contributions in the FLNs and DLNs. Several future research directions were discussed for the secure learning algorithms in the FLNs and DLNs.
References
 [1] J. Konečnỳ, B. McMahan, and D. Ramage, “Federated optimization: Distributed optimization beyond the datacenter,” arXiv preprint arXiv:1511.03575, 2015.

[2]
B. Ying, K. Yuan, and A. H. Sayed, “Supervised learning under distributed features,”
IEEE Trans. Signal Process., vol. 67, no. 4, pp. 977–992, Feb. 2019. 
[3]
P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer, “Machine learning with adversaries: Byzantine tolerant gradient descent,” in
Proc. Advances in Neural Information Processing Systems (NIPS), Long Beach, USA, Dec. 2017, pp. 119–129.  [4] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learning in adversarial settings: Byzantine gradient descent,” Proc. ACM on Measurement and Analysis of Computing Systems (SIGMETRICS), vol. 1, no. 2, pp. 44:1–44:25, Dec. 2017.
 [5] D. Yin, Y. Chen, R. Kannan, and P. Bartlett, “Byzantinerobust distributed learning: Towards optimal statistical rates,” in Proc. International Conference on Machine Learning (ICML), Stockholmsmässan, Stockholm, Sweden, July 2018, pp. 5650–5659.
 [6] E. M. El Mhamdi, R. Guerraoui, and S. Rouault, “The hidden vulnerability of distributed learning in Byzantium,” in Proc. International Conference on Machine Learning (ICML), Stockholmsmässan, Stockholm, Sweden, July 2018, pp. 3521–3530.
 [7] L. Chen, H. Wang, Z. Charles, and D. Papailiopoulos, “DRACO: Byzantineresilient distributed training via redundant gradients,” in Proc. International Conference on Machine Learning (ICML), Stockholmsmässan, Stockholm, Sweden, July 2018, pp. 903–912.

[8]
L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “RSA: Byzantinerobust
stochastic aggregation methods for distributed learning from heterogeneous
datasets,” in
Proc. AAAI Conference on Artificial Intelligence
, to be published.  [9] D. Alistarh, Z. AllenZhu, and J. Li, “Byzantine stochastic gradient descent,” in Proc. Advances in Neural Information Processing Systems (NIPS), Montreal, CA, Dec. 2018, pp. 4614–4624.
 [10] L. Su and N. H. Vaidya, “Faulttolerant multiagent optimization: Optimal iterative distributed algorithms,” in Proc. ACM Symposium on Principles of Distributed Computing, Chicago, Illinois, USA, July 2016, pp. 425–434.
 [11] S. Sundaram and B. Gharesifard, “Distributed optimization under adversarial nodes,” IEEE Trans. Autom. Control, vol. 64, no. 3, pp. 1063–1076, Mar. 2019.
 [12] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generals problem,” ACM Transactions on Programming Languages and Systems, vol. 4, no. 3, pp. 382–401, July 1982.
 [13] T. Chen, G. B. Giannakis, T. Sun, and W. Yin, “LAG: Lazily aggregated gradient for communicationefficient distributed learning,” in Proc. Advances in Neural Information Processing Systems (NIPS), Montreal, CA, Dec. 2018, pp. 5050–5060.
 [14] V. Smith, C.K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multitask learning,” in Proc. Advances in Neural Information Processing Systems (NIPS), Long Beach, USA, Dec. 2017, pp. 4424–4434.
 [15] A. Koppel, B. M. Sadler, and A. Ribeiro, “Proximity without consensus in online multiagent optimization,” IEEE Trans. Signal Process., vol. 65, no. 12, pp. 3062–3077, June 2017.
Comments
There are no comments yet.