1 Introduction
We consider the problem of Byzantine faulttolerance in synchronous parallelized learning that is founded on the parallelized stochastic gradient descent (parallelizedSGD) method.
The system comprises a master, workers, and () data points denoted by a set . The system architecture is shown in Figure 1. Let be a positive integer, and let denote the set of
dimensional realvalued vectors. For a global parameter
, each data pointhas a nonnegative loss function
. The goal for the master is to learn a parameter that is a minimum point^{1}^{1}1A local minimum point if the average loss function is nonconvex, or a global minimum point if the average loss function is convex. of the average loss evaluated for the data points. Formally, minimizesin a neighbourhood of . Although may not be the only minimum point, for simplicity denotes a minimum point for the average loss throughout this report.
The optimization framework forms the basis for most contemporary learning methods, including neural networks and support vector machines
[4].1.1 Overview of the parallelizedSGD method
ParallelizedSGD method is an expedited variant of the stochastic gradient descent method, an iterative learning algorithm [24]. In each iteration
, the master maintains an estimate
of , and updates it using gradients of the loss functions for a certain number of randomly chosen data points at . The details of the algorithm are as follows.In each iteration , the master randomly chooses a set of data points, denoted by , and assigns data points to th worker for , such that . Let the data points assigned to the th worker in th iteration be denoted by . Each worker computes the gradients for the loss functions of its assigned points at ,
and sends a symbol , which is a function of its computed gradients , to the master. The master obtains the average value of the gradients for all the data points in ,
as a function of the symbols received from the workers. For example, if each worker sends symbol
then
Upon obtaining , the master updates the parameter estimate as
(1) 
where is a positive real value commonly referred as the ‘stepsize’. An illustration of the parallelizedSGD method is presented in Figure 1 for the case when .
1.2 Vulnerability against Byzantine workers
The above parallelizedSGD method is not robust against Byzantine faulty workers. Byzantine workers need not follow the master’s instructions correctly, and might send malicious incorrect (or faulty) symbols. The identity of the Byzantine workers remains fixed throughout the learning algorithm, and is unknown a priori to the master.
We consider a case where up to () of the workers are Byzantine faulty. Our objective is design a parallelizedSGD method that has exact faulttolerance, which is defined as follows.
Definition 1.
A parallelizedSGD method has exact faulttolerance if the Master asymptotically converges to a minimum point exactly, despite the presence of Byzantine workers.
2 Proposed Solutions and Contributions
We propose two coding schemes, one of which is deterministic and the other is randomized, for guaranteeing exact faulttolerance if . Obviously, the master cannot tolerate more than or equal to Byzantine workers [5]. Overviews of each these schemes are presented below. Before we proceed with the summary of our contribution and overviews of proposed coding schemes, let us define the computation efficiency of a coding scheme.
Definition 2.
The computation efficiency of a coding scheme is the ratio of the number of gradients used for parameter update, given in (1), to the number of gradients computed by the workers in total.
For example, in each iteration of the parallelizedSGD method presented above, the total number of gradients computed by the workers is equal to , and the master uses the average of all the gradients to update the parameter estimate (1). Therefore, the computation efficiency of a coding scheme (used for computing the symbols ) in the traditional parallelizedSGD method is equal to .
Summary of contributions:
The computation efficiency of our deterministic coding scheme is twice as high as that of a faultcorrection code based scheme proposed by Chen et al., 2018 [5]. To improve upon the computation efficiency of the deterministic coding scheme, we propose a randomization technique. The computation efficiency of the randomized scheme is optimal in expectation, and compares favorably to any coding scheme for tolerating Byzantine workers in the considered parallelized learning setting.2.1 Overview of the deterministic scheme
For each iteration , after choosing the data points, the master assigns each data point to workers. Each worker computes gradients for all its data points, and sends a symbol to the master such that, the collection of symbols forms an faultdetection code, i.e. the master can detect up to faulty symbols, and the average of the gradients (for all the data points) is a function of the nonfaulty symbols. Upon detecting any fault(s), the master imposes reactive redundancy where each data point (or data point specific to the detected fault(s)) is assigned to additional workers. Each worker now computes gradients for the additional data points assigned, and send symbols that enables the master to identify up to faulty symbols in . Upon identifying the Byzantine workers that sent faulty symbols, the master can recover the correct average of the gradients. Hence, the scheme guarantees exact faulttolerance.
A simple example illustrating the scheme is presented in Figure 2. A replication code for the generic case is presented in Section 4.1.
We note the following generalizations, and drawback of the scheme.

Generalizations:

In general, any suitable fault detection code may be used in this scheme, we use a replication code as an example. The choice of the code will have impact on the communication and computation efficiency of the scheme. However, a deterministic scheme, that obtains exact faulttolerance, cannot have computation efficiency greater than in all iterations.

Drawback: In the deterministic scheme, each gradient is computed by workers even when all the Byzantine workers send nonfaulty (or correct) symbols. In other words,
even when all the workers send correct symbols. This unnecessary redundancy can be significantly reduced by using a randomized approach presented below.
2.2 Overview of the randomized scheme
The master checks for faults only in intermittent iterations chosen at random, instead of all the iterations. Alternately, in each iteration, the master does a faultcheck with some nonzero probability less than . By doing so, the master significantly reduces the redundancy in gradients’ computations whilst almost surely identifying the Byzantine workers that send faulty symbols eventually^{2}^{2}2As the parallelizedSGD method converges to the learning parameter regardless of the initial parameter estimate, a Byzantine worker that eventually stops sending faulty gradients poses no harm to the learning process. Hence, the master only needs to identify Byzantine workers that send faulty gradient(s) eventually.. As in the deterministic scheme, upon detecting any fault(s) the master imposes reactive redundancy to identify the responsible Byzantine worker(s). However, correcting the detection fault(s) is optional. The identified Byzantine worker(s) are eliminated from the subsequent iterations.
An illustration of the scheme is presented in Figure 3. Additional details for the generic case is presented in Section 4.2.
Significant savings on redundancy: By reducing the probability of random faultchecks, the expected computation efficiency of the scheme can be made as close to as desirable. Note, a coding scheme that obtains exact faulttolerance against a nonzero number of Byzantine workers cannot have an expected computation efficiency of .
We note the following generalizations, and adaptation of the randomized scheme:

Generalization:

Obviously, as in the deterministic case, the randomized scheme can be easily generalized for compressed gradients.


Adaptation: A lower probability of faultchecks implies higher probability of using faulty gradients for parameter update, and viceversa. Higher probability of faulty updates means higher probability of slower convergence of the learning algorithm. To manage the tradeoff between the computation efficiency and the rate of learning, we present an adaptive approach in Section 4.3. Essentially, the master may vary the probability of faultchecks – depending upon the observed average loss at the current parameter estimate.
3 Related works
There has been some work on coding schemes for Byzantine faulttolerance in parallelized machine learning, such as
[5, 7, 17]. The scheme proposed by Data et al., 2018 [7], however, is only applicable for loss functions whose arguments are linear in the learning parameter. The scheme, named DRACO, by Chen et al., 2018 [5] relies on faultcorrection codes and so, has a computation efficiency of only . At the expense of exact faulttolerance, the computation efficiency of DRACO can be improved using gradientfilters [17]. Our randomized scheme has both; exact faulttolerance, and favourable computation efficiency.The faulttolerance properties of the known gradient filters – KRUM [3], trimmedmean [23], median [23], geometric median of means [6], norm clipping [11], SEVER [8], or others [14, 16] – rely on additional assumptions either on the distribution of the data points or the fraction of Byzantine workers. Moreover, the existing gradientfilters do not obtain exact faulttolerance unless there are redundant data points.
To the best of our knowledge, none of the prior works have proposed the idea of reactive redundancy for tolerating Byzantine workers efficiently in the context of parallelized learning. In other contexts, such as checkpointing and rollback recovery, mechanisms that combine proactive and reactive redundancy have been utilized. For instance, Pradhan and Vaidya [15] propose a mechanism where a small number of replicas are utilized proactively to allow detection of faulty replicas; when a faulty replica is detected, additional replicas are employed to isolate the faulty replicas.
4 Coding Schemes
In this section, we present a specific deterministic scheme for the generic case, and present further details for the randomized scheme.
4.1 Deterministic coding scheme
As an example of the deterministic scheme, we use a replication code. For simplicity, suppose that none of the Byzantine workers have been identified until iteration . Then, the scheme for the iteration is as follows.
The master (randomly) chooses data points, and assigns each data point to workers. Thus, each worker, on average, gets data points. Upon computing the gradients for all its data points (at ), each worker sends a symbol; a tuple of its computed gradients. Consequentially, the master receives copies (or replicas) of each data point’s loss function’s gradient. As there are at most Byzantine workers, the master can detect if the received copies of a gradient are faulty by simply comparing them with each other. Suppose that the copies of the gradient of a particular data point are faulty (i.e. they are not unanimous). Then, the master imposes reactive redundancy where it reassigns to additional workers that compute and send additional copies of the gradient for . Upon acquiring copies of ’s gradient, the master can not only obtain the correct gradient by majority voting, it also identify the responsible Byzantine worker(s). Ultimately, the master recovers the correct gradients for all the data points, and updates as (1).
The identified Byzantine worker(s) are eliminated from the subsequent iterations. Upon updating and , the above scheme is repeated for the iteration.
Computation efficiency
Let be the number of Byzantine workers identified until the th iteration. If the master does not detect a fault in the th iteration then the computation efficiency of the scheme is . Otherwise, the worstcase computation efficiency is .
As there are at most Byzantine workers, the master will detect faults and impose reactive redundancy in at most iterations. Thus, for iterations, the computation efficiency of the scheme is greater than or equal to for at least iterations. In case , the average computation efficiency of the scheme is effectively greater than or equal to .
Note: We would like to reiterate the fact that a deterministic coding scheme with computation efficiency greater than , in all iterations, cannot have exact faulttolerance against at most Byzantine workers [13]. However, communication efficiency can be improved using other codes.
4.2 Randomized coding scheme
In the randomized scheme, the master checks for faults (and does identification of Byzantine worker if needed) only for randomly chosen intermittent iterations. In each iteration, the master runs the traditional parallelizedSGD method by default. However, before updating the parameter estimate, the master decides to check for faults in the received symbols (or gradients) with probability . Faultchecks and identification of Byzantine workers (if needed) is done using the protocol outlined for the deterministic coding scheme in Section 4.1.
For the purpose of analysis, assume that each Byzantine worker tampers its gradient(s) independently in each iteration with probability at least . Then, remains unidentified by the master after iterations with probability less than or equal to , which approaches as approaches . In other words, gets identified almost surely. This holds for all Byzantine workers that tamper gradient(s) eventually.
Computation efficiency
As the master checks for faults with probability in each iteration, the expected computation efficiency of the randomized scheme is greater than or equal to
(2) 
The above lower bound for the expected computation efficiency is computed by assuming the worstcase where the master imposes redundancy for each gradient in the faultdetection phase. The actual computation efficiency will be larger than this lower bound. However, this lower bound suffices to understand the benefits of our coding scheme.
From above, the expected computational efficiency of the randomized coding scheme can be made as close to one as desirable by choosing appropriately. Specifically, for a , let
Then, the expected computational efficiency of the randomized coding scheme is greater than or equal to .
Efficiency versus convergencerate
Smaller probability of faultchecks implies higher efficiency, as is evident from (2). However, smaller also means higher probability of using faulty gradient(s) for updating the parameter estimate, which could result in slower convergence of the learning algorithm.
Suppose that each Byzantine worker chooses to tamper its gradient(s) independently with probability , then the probability of a faulty update in the th iteration (assuming none of the Byzantine workers have been identified yet) equals
(3) 
Therefore, determining an optimal value of is a multiobjective optimization problem where;
Obviously, the above objectives cannot be met simultaneously. That is, there does not exist a that maximizes and minimizes the expected computation efficiency and the probability of faulty updates, respectively, at the same time. This tradeoff between the computation efficiency and the reliability (or correctness) of the updates can be managed by the following adaptive approach.
4.3 Adaptive randomized coding
Let and denote the expected computation efficiency and the probability of faulty update in iteration , if the probability of doing a faultcheck equals . Let denote the number of identified Byzantine workers until iteration . By substituting by
Note, maximizing is equivalent to minimizing , and minimizing is equivalent to minimizing . Thus, the probability of faultcheck in the iteration, denoted by , is given by the minimum point of the weighted average of and , i.e.,
(4) 
where . Higher value of (greater than ) implies that minimizing takes precedence over maximising , and vice versa.
Choice of
We note that a suitable value of can be computed using the average loss, denoted by , computed over the chosen data points at the current parameter estimate. Specifically, if denotes the set of data points chosen and denotes the current parameter estimate in the iteration, then
Then,
(5) 
If is given by (5), then for higher observed loss minimizing the probability of faulty updates takes precedence. This is quite intuitive as the master would prefer the updates to faultfree when the observed loss is high, for improved convergencerate to the learning parameter.
The following boundary conditions further justify the choice of given by (5).

As approaches , approaches . In this extreme case,
Thus, the master checks for faults in almost all iterations when the observed loss is extremely high.

If , i.e. Byzantine workers do not tamper their gradients with certainty,
Obviously, if the gradients received from the Byzantine workers are correct with certainty then there is no need for faultchecks. Similarly, if , i.e. the master has identified all the Byzantine workers, then
Note: For saving on the computation cost, the master may use the workers for computing in parallel. However, in this case the master would only be able to obtain an approximation of , instead of the actual value, as up to of the workers are Byzantine. Nevertheless, approximate suffices for the above adaptation. An approximation of can be computed by taking the truncated or trimmed mean of the average loss evaluated by the workers for their respective data points [22].
5 Generalizations of the Randomized Coding Scheme
Our randomized scheme can be generalized as follows.

Selfchecks: Instead of imposing reactive redundancy, the master can compute the gradients on its own, and compare them with the gradients received from the workers to check for faults. Similarly as above, the master may optimize the additional workload by choosing the probability of faultchecks adaptively as presented in Section 4.3.

Selective faultchecks:
Gradients (or symbols) that are outliers amongst the received gradients (or symbols) should be checked for faults with relatively higher probability. Additionally, the master can assign
reliability scores to the workers, as done in the context of reliable crowdsourcing [18]. Symbols from workers with lower reliability scores should be checked for faults with higher probability. 
Gradientfilters: The master can further improve on the computation efficiency by combining the randomized coding scheme with lightweight gradientfilters [10, 14, 23]. When using gradientfilters, the master does not have to identify all the Byzantine workers. This idea has been explored in Rajput et al., 2019 [17] for a deterministic coding scheme.

Distributed learning framework: Our randomized scheme can also be used for Byzantine faulttolerance in distributed learning framework, where the data points are distributed amongst the workers, i.e. two workers may have different sets of data points [6, 23]. In this case, besides checking for faulty gradient(s), the master must also validate the data points used by the workers for computing the gradients in the first place. As most existing data validation tools are computationally expensive [9, 12, 18, 21], the master may use our randomized scheme to optimize the tradeoff between the cost of data validation and the convergencerate of a distributed learning algorithm.
6 Summary
In this report, we have presented two coding schemes, a deterministic scheme and a randomized scheme, for exact Byzantine faulttolerance in the parallelizedSGD learning algorithm.
In the deterministic scheme, the master uses a faultdetection code in each iteration. Upon detecting any fault(s), the master imposes reactive redundancy to correct the faults and identify the Byzantine worker(s) responsible for the fault(s).
The randomized scheme improves upon the computation efficiency of the deterministic scheme. Here, the master uses faultdetection codes only in randomly chosen intermittent iterations, instead of all the iterations. By doing so, the master is able to optimize the tradeoff between the expected computation efficiency, and the convergencerate of the parallelized learning algorithm.
Acknowledgements
Research reported in this paper was sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF 1720196, and by National Science Foundation award 1610543. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the the Army Research Laboratory, National Science Foundation or the U.S. Government.
References
 [1] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
 [2] Jeremy Bernstein, YuXiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: Compressed optimisation for nonconvex problems. arXiv preprint arXiv:1802.04434, 2018.
 [3] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 119–129, 2017.
 [4] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for largescale machine learning. Siam Review, 60(2):223–311, 2018.
 [5] Lingjiao Chen, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. DRACO: Byzantineresilient distributed training via redundant gradients. In International Conference on Machine Learning, pages 903–912, 2018.
 [6] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
 [7] Deepesh Data, Linqi Song, and Suhas Diggavi. Data encoding for Byzantineresilient distributed gradient descent. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 863–870. IEEE, 2018.
 [8] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. SEVER: A robust metaalgorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
 [9] Julie S Downs, Mandy B Holbrook, Steve Sheng, and Lorrie Faith Cranor. Are your participants gaming the system?: screening mechanical turk workers. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 2399–2402. ACM, 2010.
 [10] Nirupam Gupta and Nitin H Vaidya. Byzantine faulttolerant distributed linear regression. arXiv preprint arXiv:1903.08752, 2019.

[11]
Nirupam Gupta and Nitin H Vaidya.
Byzantine faulttolerant parallelized stochastic gradient descent for linear regression.
57th Annual Allerton Conference on Communication, Control, and Computing, 2019.  [12] Srikanth Jagabathula, Lakshminarayanan Subramanian, and Ashwin Venkataraman. Identifying unreliable and adversarial workers in crowdsourced labeling tasks. Journal of Machine Learning Research, 18(93):1–67, 2017.
 [13] Yehuda Lindell. Introduction to coding theory lecture notes. Department of Computer Science BarIlan University, Israel January, 25, 2010.
 [14] El Mahdi El Mhamdi, Rachid Guerraoui, and Arsany Guirguis. Fast machine learning with byzantine workers and servers. arXiv preprint arXiv:1911.07537, 2019.
 [15] Dhiraj K. Pradhan and Nitin H. Vaidya. Rollforward and rollback recovery: Performancereliability tradeoff. IEEE Trans. Computers, 46(3):372–378, 1997.
 [16] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
 [17] Shashank Rajput, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. DETOX: A redundancybased framework for faster and more robust gradient aggregation. In Advances in Neural Information Processing Systems, pages 10320–10330, 2019.
 [18] Vikas C Raykar and Shipeng Yu. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research, 13(Feb):491–518, 2012.
 [19] Navjot Singh, Deepesh Data, Jemin George, and Suhas Diggavi. SPARQSGD: Eventtriggered and compressed communication in decentralized stochastic optimization. arXiv preprint arXiv:1910.14280, 2019.
 [20] Hanlin Tang, Xiangru Lian, Tong Zhang, and Ji Liu. Doublesqueeze: Parallel stochastic gradient descent with doublepass errorcompensated compression. arXiv preprint arXiv:1905.05957, 2019.
 [21] Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proc. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR’11), pages 21–26, 2011.
 [22] Rand R Wilcox. Introduction to robust estimation and hypothesis testing. Academic press, 2011.
 [23] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5636–5645, 2018.
 [24] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.