We consider the problem of Byzantine fault-tolerance in synchronous parallelized learning that is founded on the parallelized stochastic gradient descent (parallelized-SGD) method.
The system comprises a master, workers, and () data points denoted by a set . The system architecture is shown in Figure 1. Let be a positive integer, and let denote the set of
-dimensional real-valued vectors. For a global parameter, each data point
has a non-negative loss function. The goal for the master is to learn a parameter that is a minimum point111A local minimum point if the average loss function is non-convex, or a global minimum point if the average loss function is convex. of the average loss evaluated for the data points. Formally, minimizes
in a neighbourhood of . Although may not be the only minimum point, for simplicity denotes a minimum point for the average loss throughout this report.
1.1 Overview of the parallelized-SGD method
Parallelized-SGD method is an expedited variant of the stochastic gradient descent method, an iterative learning algorithm . In each iteration
, the master maintains an estimateof , and updates it using gradients of the loss functions for a certain number of randomly chosen data points at . The details of the algorithm are as follows.
In each iteration , the master randomly chooses a set of data points, denoted by , and assigns data points to -th worker for , such that . Let the data points assigned to the -th worker in -th iteration be denoted by . Each worker computes the gradients for the loss functions of its assigned points at ,
and sends a symbol , which is a function of its computed gradients , to the master. The master obtains the average value of the gradients for all the data points in ,
as a function of the symbols received from the workers. For example, if each worker sends symbol
Upon obtaining , the master updates the parameter estimate as
where is a positive real value commonly referred as the ‘step-size’. An illustration of the parallelized-SGD method is presented in Figure 1 for the case when .
1.2 Vulnerability against Byzantine workers
The above parallelized-SGD method is not robust against Byzantine faulty workers. Byzantine workers need not follow the master’s instructions correctly, and might send malicious incorrect (or faulty) symbols. The identity of the Byzantine workers remains fixed throughout the learning algorithm, and is unknown a priori to the master.
We consider a case where up to () of the workers are Byzantine faulty. Our objective is design a parallelized-SGD method that has exact fault-tolerance, which is defined as follows.
A parallelized-SGD method has exact fault-tolerance if the Master asymptotically converges to a minimum point exactly, despite the presence of Byzantine workers.
2 Proposed Solutions and Contributions
We propose two coding schemes, one of which is deterministic and the other is randomized, for guaranteeing exact fault-tolerance if . Obviously, the master cannot tolerate more than or equal to Byzantine workers . Overviews of each these schemes are presented below. Before we proceed with the summary of our contribution and overviews of proposed coding schemes, let us define the computation efficiency of a coding scheme.
The computation efficiency of a coding scheme is the ratio of the number of gradients used for parameter update, given in (1), to the number of gradients computed by the workers in total.
For example, in each iteration of the parallelized-SGD method presented above, the total number of gradients computed by the workers is equal to , and the master uses the average of all the gradients to update the parameter estimate (1). Therefore, the computation efficiency of a coding scheme (used for computing the symbols ) in the traditional parallelized-SGD method is equal to .
Summary of contributions:The computation efficiency of our deterministic coding scheme is twice as high as that of a fault-correction code based scheme proposed by Chen et al., 2018 . To improve upon the computation efficiency of the deterministic coding scheme, we propose a randomization technique. The computation efficiency of the randomized scheme is optimal in expectation, and compares favorably to any coding scheme for tolerating Byzantine workers in the considered parallelized learning setting.
2.1 Overview of the deterministic scheme
For each iteration , after choosing the data points, the master assigns each data point to workers. Each worker computes gradients for all its data points, and sends a symbol to the master such that, the collection of symbols forms an fault-detection code, i.e. the master can detect up to faulty symbols, and the average of the gradients (for all the data points) is a function of the non-faulty symbols. Upon detecting any fault(s), the master imposes reactive redundancy where each data point (or data point specific to the detected fault(s)) is assigned to additional workers. Each worker now computes gradients for the additional data points assigned, and send symbols that enables the master to identify up to faulty symbols in . Upon identifying the Byzantine workers that sent faulty symbols, the master can recover the correct average of the gradients. Hence, the scheme guarantees exact fault-tolerance.
We note the following generalizations, and drawback of the scheme.
In general, any suitable fault detection code may be used in this scheme, we use a replication code as an example. The choice of the code will have impact on the communication and computation efficiency of the scheme. However, a deterministic scheme, that obtains exact fault-tolerance, cannot have computation efficiency greater than in all iterations.
Drawback: In the deterministic scheme, each gradient is computed by workers even when all the Byzantine workers send non-faulty (or correct) symbols. In other words,
even when all the workers send correct symbols. This unnecessary redundancy can be significantly reduced by using a randomized approach presented below.
2.2 Overview of the randomized scheme
The master checks for faults only in intermittent iterations chosen at random, instead of all the iterations. Alternately, in each iteration, the master does a fault-check with some non-zero probability less than . By doing so, the master significantly reduces the redundancy in gradients’ computations whilst almost surely identifying the Byzantine workers that send faulty symbols eventually222As the parallelized-SGD method converges to the learning parameter regardless of the initial parameter estimate, a Byzantine worker that eventually stops sending faulty gradients poses no harm to the learning process. Hence, the master only needs to identify Byzantine workers that send faulty gradient(s) eventually.. As in the deterministic scheme, upon detecting any fault(s) the master imposes reactive redundancy to identify the responsible Byzantine worker(s). However, correcting the detection fault(s) is optional. The identified Byzantine worker(s) are eliminated from the subsequent iterations.
Significant savings on redundancy: By reducing the probability of random fault-checks, the expected computation efficiency of the scheme can be made as close to as desirable. Note, a coding scheme that obtains exact fault-tolerance against a non-zero number of Byzantine workers cannot have an expected computation efficiency of .
We note the following generalizations, and adaptation of the randomized scheme:
Obviously, as in the deterministic case, the randomized scheme can be easily generalized for compressed gradients.
Adaptation: A lower probability of fault-checks implies higher probability of using faulty gradients for parameter update, and vice-versa. Higher probability of faulty updates means higher probability of slower convergence of the learning algorithm. To manage the trade-off between the computation efficiency and the rate of learning, we present an adaptive approach in Section 4.3. Essentially, the master may vary the probability of fault-checks – depending upon the observed average loss at the current parameter estimate.
3 Related works
There has been some work on coding schemes for Byzantine fault-tolerance in parallelized machine learning, such as
There has been some work on coding schemes for Byzantine fault-tolerance in parallelized machine learning, such as[5, 7, 17]. The scheme proposed by Data et al., 2018 , however, is only applicable for loss functions whose arguments are linear in the learning parameter. The scheme, named DRACO, by Chen et al., 2018  relies on fault-correction codes and so, has a computation efficiency of only . At the expense of exact fault-tolerance, the computation efficiency of DRACO can be improved using gradient-filters . Our randomized scheme has both; exact fault-tolerance, and favourable computation efficiency.
The fault-tolerance properties of the known gradient filters – KRUM , trimmed-mean , median , geometric median of means , norm clipping , SEVER , or others [14, 16] – rely on additional assumptions either on the distribution of the data points or the fraction of Byzantine workers. Moreover, the existing gradient-filters do not obtain exact fault-tolerance unless there are redundant data points.
To the best of our knowledge, none of the prior works have proposed the idea of reactive redundancy for tolerating Byzantine workers efficiently in the context of parallelized learning. In other contexts, such as checkpointing and rollback recovery, mechanisms that combine proactive and reactive redundancy have been utilized. For instance, Pradhan and Vaidya  propose a mechanism where a small number of replicas are utilized proactively to allow detection of faulty replicas; when a faulty replica is detected, additional replicas are employed to isolate the faulty replicas.
4 Coding Schemes
In this section, we present a specific deterministic scheme for the generic case, and present further details for the randomized scheme.
4.1 Deterministic coding scheme
As an example of the deterministic scheme, we use a replication code. For simplicity, suppose that none of the Byzantine workers have been identified until iteration . Then, the scheme for the -iteration is as follows.
The master (randomly) chooses data points, and assigns each data point to workers. Thus, each worker, on average, gets data points. Upon computing the gradients for all its data points (at ), each worker sends a symbol; a tuple of its computed gradients. Consequentially, the master receives copies (or replicas) of each data point’s loss function’s gradient. As there are at most Byzantine workers, the master can detect if the received copies of a gradient are faulty by simply comparing them with each other. Suppose that the copies of the gradient of a particular data point are faulty (i.e. they are not unanimous). Then, the master imposes reactive redundancy where it re-assigns to additional workers that compute and send additional copies of the gradient for . Upon acquiring copies of ’s gradient, the master can not only obtain the correct gradient by majority voting, it also identify the responsible Byzantine worker(s). Ultimately, the master recovers the correct gradients for all the data points, and updates as (1).
The identified Byzantine worker(s) are eliminated from the subsequent iterations. Upon updating and , the above scheme is repeated for the -iteration.
Let be the number of Byzantine workers identified until the -th iteration. If the master does not detect a fault in the -th iteration then the computation efficiency of the scheme is . Otherwise, the worst-case computation efficiency is .
As there are at most Byzantine workers, the master will detect faults and impose reactive redundancy in at most iterations. Thus, for iterations, the computation efficiency of the scheme is greater than or equal to for at least iterations. In case , the average computation efficiency of the scheme is effectively greater than or equal to .
Note: We would like to reiterate the fact that a deterministic coding scheme with computation efficiency greater than , in all iterations, cannot have exact fault-tolerance against at most Byzantine workers . However, communication efficiency can be improved using other codes.
4.2 Randomized coding scheme
In the randomized scheme, the master checks for faults (and does identification of Byzantine worker if needed) only for randomly chosen intermittent iterations. In each iteration, the master runs the traditional parallelized-SGD method by default. However, before updating the parameter estimate, the master decides to check for faults in the received symbols (or gradients) with probability . Fault-checks and identification of Byzantine workers (if needed) is done using the protocol outlined for the deterministic coding scheme in Section 4.1.
For the purpose of analysis, assume that each Byzantine worker tampers its gradient(s) independently in each iteration with probability at least . Then, remains unidentified by the master after iterations with probability less than or equal to , which approaches as approaches . In other words, gets identified almost surely. This holds for all Byzantine workers that tamper gradient(s) eventually.
As the master checks for faults with probability in each iteration, the expected computation efficiency of the randomized scheme is greater than or equal to
The above lower bound for the expected computation efficiency is computed by assuming the worst-case where the master imposes redundancy for each gradient in the fault-detection phase. The actual computation efficiency will be larger than this lower bound. However, this lower bound suffices to understand the benefits of our coding scheme.
From above, the expected computational efficiency of the randomized coding scheme can be made as close to one as desirable by choosing appropriately. Specifically, for a , let
Then, the expected computational efficiency of the randomized coding scheme is greater than or equal to .
Efficiency versus convergence-rate
Smaller probability of fault-checks implies higher efficiency, as is evident from (2). However, smaller also means higher probability of using faulty gradient(s) for updating the parameter estimate, which could result in slower convergence of the learning algorithm.
Suppose that each Byzantine worker chooses to tamper its gradient(s) independently with probability , then the probability of a faulty update in the -th iteration (assuming none of the Byzantine workers have been identified yet) equals
Therefore, determining an optimal value of is a multi-objective optimization problem where;
Obviously, the above objectives cannot be met simultaneously. That is, there does not exist a that maximizes and minimizes the expected computation efficiency and the probability of faulty updates, respectively, at the same time. This trade-off between the computation efficiency and the reliability (or correctness) of the updates can be managed by the following adaptive approach.
4.3 Adaptive randomized coding
Let and denote the expected computation efficiency and the probability of faulty update in iteration , if the probability of doing a fault-check equals . Let denote the number of identified Byzantine workers until iteration . By substituting by
Note, maximizing is equivalent to minimizing , and minimizing is equivalent to minimizing . Thus, the probability of fault-check in the -iteration, denoted by , is given by the minimum point of the weighted average of and , i.e.,
where . Higher value of (greater than ) implies that minimizing takes precedence over maximising , and vice versa.
We note that a suitable value of can be computed using the average loss, denoted by , computed over the chosen data points at the current parameter estimate. Specifically, if denotes the set of data points chosen and denotes the current parameter estimate in the -iteration, then
If is given by (5), then for higher observed loss minimizing the probability of faulty updates takes precedence. This is quite intuitive as the master would prefer the updates to fault-free when the observed loss is high, for improved convergence-rate to the learning parameter.
The following boundary conditions further justify the choice of given by (5).
As approaches , approaches . In this extreme case,
Thus, the master checks for faults in almost all iterations when the observed loss is extremely high.
If , i.e. Byzantine workers do not tamper their gradients with certainty,
Obviously, if the gradients received from the Byzantine workers are correct with certainty then there is no need for fault-checks. Similarly, if , i.e. the master has identified all the Byzantine workers, then
Note: For saving on the computation cost, the master may use the workers for computing in parallel. However, in this case the master would only be able to obtain an approximation of , instead of the actual value, as up to of the workers are Byzantine. Nevertheless, approximate suffices for the above adaptation. An approximation of can be computed by taking the truncated or trimmed mean of the average loss evaluated by the workers for their respective data points .
5 Generalizations of the Randomized Coding Scheme
Our randomized scheme can be generalized as follows.
Self-checks: Instead of imposing reactive redundancy, the master can compute the gradients on its own, and compare them with the gradients received from the workers to check for faults. Similarly as above, the master may optimize the additional workload by choosing the probability of fault-checks adaptively as presented in Section 4.3.
Gradients (or symbols) that are outliers amongst the received gradients (or symbols) should be checked for faults with relatively higher probability. Additionally, the master can assignreliability scores to the workers, as done in the context of reliable crowdsourcing . Symbols from workers with lower reliability scores should be checked for faults with higher probability.
Gradient-filters: The master can further improve on the computation efficiency by combining the randomized coding scheme with lightweight gradient-filters [10, 14, 23]. When using gradient-filters, the master does not have to identify all the Byzantine workers. This idea has been explored in Rajput et al., 2019  for a deterministic coding scheme.
Distributed learning framework: Our randomized scheme can also be used for Byzantine fault-tolerance in distributed learning framework, where the data points are distributed amongst the workers, i.e. two workers may have different sets of data points [6, 23]. In this case, besides checking for faulty gradient(s), the master must also validate the data points used by the workers for computing the gradients in the first place. As most existing data validation tools are computationally expensive [9, 12, 18, 21], the master may use our randomized scheme to optimize the trade-off between the cost of data validation and the convergence-rate of a distributed learning algorithm.
In this report, we have presented two coding schemes, a deterministic scheme and a randomized scheme, for exact Byzantine fault-tolerance in the parallelized-SGD learning algorithm.
In the deterministic scheme, the master uses a fault-detection code in each iteration. Upon detecting any fault(s), the master imposes reactive redundancy to correct the faults and identify the Byzantine worker(s) responsible for the fault(s).
The randomized scheme improves upon the computation efficiency of the deterministic scheme. Here, the master uses fault-detection codes only in randomly chosen intermittent iterations, instead of all the iterations. By doing so, the master is able to optimize the trade-off between the expected computation efficiency, and the convergence-rate of the parallelized learning algorithm.
Research reported in this paper was sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF- 17-2-0196, and by National Science Foundation award 1610543. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the the Army Research Laboratory, National Science Foundation or the U.S. Government.
-  Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
-  Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: Compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434, 2018.
-  Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 119–129, 2017.
-  Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. Siam Review, 60(2):223–311, 2018.
-  Lingjiao Chen, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. DRACO: Byzantine-resilient distributed training via redundant gradients. In International Conference on Machine Learning, pages 903–912, 2018.
-  Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):44, 2017.
-  Deepesh Data, Linqi Song, and Suhas Diggavi. Data encoding for Byzantine-resilient distributed gradient descent. In 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 863–870. IEEE, 2018.
-  Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. SEVER: A robust meta-algorithm for stochastic optimization. arXiv preprint arXiv:1803.02815, 2018.
-  Julie S Downs, Mandy B Holbrook, Steve Sheng, and Lorrie Faith Cranor. Are your participants gaming the system?: screening mechanical turk workers. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 2399–2402. ACM, 2010.
-  Nirupam Gupta and Nitin H Vaidya. Byzantine fault-tolerant distributed linear regression. arXiv preprint arXiv:1903.08752, 2019.
Nirupam Gupta and Nitin H Vaidya.
Byzantine fault-tolerant parallelized stochastic gradient descent for linear regression.57th Annual Allerton Conference on Communication, Control, and Computing, 2019.
-  Srikanth Jagabathula, Lakshminarayanan Subramanian, and Ashwin Venkataraman. Identifying unreliable and adversarial workers in crowdsourced labeling tasks. Journal of Machine Learning Research, 18(93):1–67, 2017.
-  Yehuda Lindell. Introduction to coding theory lecture notes. Department of Computer Science Bar-Ilan University, Israel January, 25, 2010.
-  El Mahdi El Mhamdi, Rachid Guerraoui, and Arsany Guirguis. Fast machine learning with byzantine workers and servers. arXiv preprint arXiv:1911.07537, 2019.
-  Dhiraj K. Pradhan and Nitin H. Vaidya. Roll-forward and rollback recovery: Performance-reliability trade-off. IEEE Trans. Computers, 46(3):372–378, 1997.
-  Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
-  Shashank Rajput, Hongyi Wang, Zachary Charles, and Dimitris Papailiopoulos. DETOX: A redundancy-based framework for faster and more robust gradient aggregation. In Advances in Neural Information Processing Systems, pages 10320–10330, 2019.
-  Vikas C Raykar and Shipeng Yu. Eliminating spammers and ranking annotators for crowdsourced labeling tasks. Journal of Machine Learning Research, 13(Feb):491–518, 2012.
-  Navjot Singh, Deepesh Data, Jemin George, and Suhas Diggavi. SPARQ-SGD: Event-triggered and compressed communication in decentralized stochastic optimization. arXiv preprint arXiv:1910.14280, 2019.
-  Hanlin Tang, Xiangru Lian, Tong Zhang, and Ji Liu. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. arXiv preprint arXiv:1905.05957, 2019.
-  Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In Proc. ACM SIGIR Workshop on Crowdsourcing for Information Retrieval (CIR’11), pages 21–26, 2011.
-  Rand R Wilcox. Introduction to robust estimation and hypothesis testing. Academic press, 2011.
-  Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5636–5645, 2018.
-  Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.