# Randomized Reactive Redundancy for Byzantine Fault-Tolerance in Parallelized Learning

This report considers the problem of Byzantine fault-tolerance in synchronous parallelized learning that is founded on the parallelized stochastic gradient descent (parallelized-SGD) algorithm. The system comprises a master, and n workers, where up to f of the workers are Byzantine faulty. Byzantine workers need not follow the master's instructions correctly, and might send malicious incorrect (or faulty) information. The identity of the Byzantine workers remains fixed throughout the learning process, and is unknown a priori to the master. We propose two coding schemes, a deterministic scheme and a randomized scheme, for guaranteeing exact fault-tolerance if 2f < n. The coding schemes use the concept of reactive redundancy for isolating Byzantine workers that eventually send faulty information. We note that the computation efficiencies of the schemes compare favorably with other (deterministic or randomized) coding schemes, for exact fault-tolerance.

• 21 publications
• 21 publications
08/26/2021

### Byzantine Fault-Tolerance in Federated Local SGD under 2f-Redundancy

We consider the problem of Byzantine fault-tolerance in federated machin...
10/14/2019

### Election Coding for Distributed Learning: Protecting SignSGD against Byzantine Attacks

Recent advances in large-scale distributed learning algorithms have enab...
02/04/2022

### SignSGD: Fault-Tolerance to Blind and Byzantine Adversaries

Distributed learning has become a necessity for training ever-growing mo...
07/27/2021

### Verifiable Coded Computing: Towards Fast, Secure and Private Distributed Machine Learning

Stragglers, Byzantine workers, and data privacy are the main bottlenecks...
01/26/2018

### Revisiting Fast Practical Byzantine Fault Tolerance: Thelma, Velma, and Zelma

In a previous note (arXiv:1712.01367 [cs.DC]) , we observed a safety vio...
12/04/2017

### Revisiting Fast Practical Byzantine Fault Tolerance

In this note, we observe a safety violation in Zyzzyva and a liveness vi...
01/22/2021

### Approximate Byzantine Fault-Tolerance in Distributed Optimization

We consider the problem of Byzantine fault-tolerance in distributed mult...

## 1 Introduction

We consider the problem of Byzantine fault-tolerance in synchronous parallelized learning that is founded on the parallelized stochastic gradient descent (parallelized-SGD) method.

The system comprises a master, workers, and () data points denoted by a set . The system architecture is shown in Figure 1. Let be a positive integer, and let denote the set of

-dimensional real-valued vectors. For a global parameter

, each data point

has a non-negative loss function

. The goal for the master is to learn a parameter that is a minimum point111A local minimum point if the average loss function is non-convex, or a global minimum point if the average loss function is convex. of the average loss evaluated for the data points. Formally, minimizes

 1N∑z∈Zℓ(w,z)

in a neighbourhood of . Although may not be the only minimum point, for simplicity denotes a minimum point for the average loss throughout this report.

The optimization framework forms the basis for most contemporary learning methods, including neural networks and support vector machines

[4].

### 1.1 Overview of the parallelized-SGD method

Parallelized-SGD method is an expedited variant of the stochastic gradient descent method, an iterative learning algorithm [24]. In each iteration

, the master maintains an estimate

of , and updates it using gradients of the loss functions for a certain number of randomly chosen data points at . The details of the algorithm are as follows.

In each iteration , the master randomly chooses a set of data points, denoted by , and assigns data points to -th worker for , such that . Let the data points assigned to the -th worker in -th iteration be denoted by . Each worker computes the gradients for the loss functions of its assigned points at ,

 gtij=∇ℓ(w,ztij)|w=wt, j=1,…,mi ,

and sends a symbol , which is a function of its computed gradients , to the master. The master obtains the average value of the gradients for all the data points in ,

 gt=1m∑z∈Zt∇ℓ(w,z)|w=wt ,

as a function of the symbols received from the workers. For example, if each worker sends symbol

 ci(gti1,…,gtimi)=1mimi∑j=1gtij ,

then

 gt=1mn∑i=1mici=1m∑z∈Zt∇ℓ(w,z)|w=wt .

Upon obtaining , the master updates the parameter estimate as

 wt+1=wt−ηt(1NN∑i=1gti), (1)

where is a positive real value commonly referred as the ‘step-size’. An illustration of the parallelized-SGD method is presented in Figure 1 for the case when .

### 1.2 Vulnerability against Byzantine workers

The above parallelized-SGD method is not robust against Byzantine faulty workers. Byzantine workers need not follow the master’s instructions correctly, and might send malicious incorrect (or faulty) symbols. The identity of the Byzantine workers remains fixed throughout the learning algorithm, and is unknown a priori to the master.

We consider a case where up to () of the workers are Byzantine faulty. Our objective is design a parallelized-SGD method that has exact fault-tolerance, which is defined as follows.

###### Definition 1.

A parallelized-SGD method has exact fault-tolerance if the Master asymptotically converges to a minimum point exactly, despite the presence of Byzantine workers.

## 2 Proposed Solutions and Contributions

We propose two coding schemes, one of which is deterministic and the other is randomized, for guaranteeing exact fault-tolerance if . Obviously, the master cannot tolerate more than or equal to Byzantine workers [5]. Overviews of each these schemes are presented below. Before we proceed with the summary of our contribution and overviews of proposed coding schemes, let us define the computation efficiency of a coding scheme.

###### Definition 2.

The computation efficiency of a coding scheme is the ratio of the number of gradients used for parameter update, given in (1), to the number of gradients computed by the workers in total.

For example, in each iteration of the parallelized-SGD method presented above, the total number of gradients computed by the workers is equal to , and the master uses the average of all the gradients to update the parameter estimate (1). Therefore, the computation efficiency of a coding scheme (used for computing the symbols ) in the traditional parallelized-SGD method is equal to .

### Summary of contributions:

The computation efficiency of our deterministic coding scheme is twice as high as that of a fault-correction code based scheme proposed by Chen et al., 2018 [5]. To improve upon the computation efficiency of the deterministic coding scheme, we propose a randomization technique. The computation efficiency of the randomized scheme is optimal in expectation, and compares favorably to any coding scheme for tolerating Byzantine workers in the considered parallelized learning setting.

### 2.1 Overview of the deterministic scheme

For each iteration , after choosing the data points, the master assigns each data point to workers. Each worker computes gradients for all its data points, and sends a symbol to the master such that, the collection of symbols forms an fault-detection code, i.e. the master can detect up to faulty symbols, and the average of the gradients (for all the data points) is a function of the non-faulty symbols. Upon detecting any fault(s), the master imposes reactive redundancy where each data point (or data point specific to the detected fault(s)) is assigned to additional workers. Each worker now computes gradients for the additional data points assigned, and send symbols that enables the master to identify up to faulty symbols in . Upon identifying the Byzantine workers that sent faulty symbols, the master can recover the correct average of the gradients. Hence, the scheme guarantees exact fault-tolerance.

A simple example illustrating the scheme is presented in Figure 2. A replication code for the generic case is presented in Section 4.1.

We note the following generalizations, and drawback of the scheme.

• Generalizations:

• The workers may send symbols that are function of compressed gradients, proposed for improved communication efficiency in the non-Byzantine case [1, 2, 19, 20], instead of the original gradients.

• In general, any suitable fault detection code may be used in this scheme, we use a replication code as an example. The choice of the code will have impact on the communication and computation efficiency of the scheme. However, a deterministic scheme, that obtains exact fault-tolerance, cannot have computation efficiency greater than in all iterations.

• Drawback: In the deterministic scheme, each gradient is computed by workers even when all the Byzantine workers send non-faulty (or correct) symbols. In other words,

even when all the workers send correct symbols. This unnecessary redundancy can be significantly reduced by using a randomized approach presented below.

### 2.2 Overview of the randomized scheme

The master checks for faults only in intermittent iterations chosen at random, instead of all the iterations. Alternately, in each iteration, the master does a fault-check with some non-zero probability less than . By doing so, the master significantly reduces the redundancy in gradients’ computations whilst almost surely identifying the Byzantine workers that send faulty symbols eventually222As the parallelized-SGD method converges to the learning parameter regardless of the initial parameter estimate, a Byzantine worker that eventually stops sending faulty gradients poses no harm to the learning process. Hence, the master only needs to identify Byzantine workers that send faulty gradient(s) eventually.. As in the deterministic scheme, upon detecting any fault(s) the master imposes reactive redundancy to identify the responsible Byzantine worker(s). However, correcting the detection fault(s) is optional. The identified Byzantine worker(s) are eliminated from the subsequent iterations.

An illustration of the scheme is presented in Figure 3. Additional details for the generic case is presented in Section 4.2.

Significant savings on redundancy: By reducing the probability of random fault-checks, the expected computation efficiency of the scheme can be made as close to as desirable. Note, a coding scheme that obtains exact fault-tolerance against a non-zero number of Byzantine workers cannot have an expected computation efficiency of .

We note the following generalizations, and adaptation of the randomized scheme:

• Generalization:

• Obviously, as in the deterministic case, the randomized scheme can be easily generalized for compressed gradients.

• Instead of checking for faults for all the workers with equal probability, the master may use different probabilities for different workers. For doing so, workers can be assigned reliability scores as in the context of reliable crowdsourcing [18]. Other generalizations are presented in Section 5.

• Adaptation: A lower probability of fault-checks implies higher probability of using faulty gradients for parameter update, and vice-versa. Higher probability of faulty updates means higher probability of slower convergence of the learning algorithm. To manage the trade-off between the computation efficiency and the rate of learning, we present an adaptive approach in Section 4.3. Essentially, the master may vary the probability of fault-checks – depending upon the observed average loss at the current parameter estimate.

## 3 Related works

There has been some work on coding schemes for Byzantine fault-tolerance in parallelized machine learning, such as

[5, 7, 17]. The scheme proposed by Data et al., 2018 [7], however, is only applicable for loss functions whose arguments are linear in the learning parameter. The scheme, named DRACO, by Chen et al., 2018 [5] relies on fault-correction codes and so, has a computation efficiency of only . At the expense of exact fault-tolerance, the computation efficiency of DRACO can be improved using gradient-filters [17]. Our randomized scheme has both; exact fault-tolerance, and favourable computation efficiency.

The fault-tolerance properties of the known gradient filters – KRUM [3], trimmed-mean [23], median [23], geometric median of means [6], norm clipping [11], SEVER [8], or others [14, 16] – rely on additional assumptions either on the distribution of the data points or the fraction of Byzantine workers. Moreover, the existing gradient-filters do not obtain exact fault-tolerance unless there are redundant data points.

To the best of our knowledge, none of the prior works have proposed the idea of reactive redundancy for tolerating Byzantine workers efficiently in the context of parallelized learning. In other contexts, such as checkpointing and rollback recovery, mechanisms that combine proactive and reactive redundancy have been utilized. For instance, Pradhan and Vaidya [15] propose a mechanism where a small number of replicas are utilized proactively to allow detection of faulty replicas; when a faulty replica is detected, additional replicas are employed to isolate the faulty replicas.

## 4 Coding Schemes

In this section, we present a specific deterministic scheme for the generic case, and present further details for the randomized scheme.

### 4.1 Deterministic coding scheme

As an example of the deterministic scheme, we use a replication code. For simplicity, suppose that none of the Byzantine workers have been identified until iteration . Then, the scheme for the -iteration is as follows.

The identified Byzantine worker(s) are eliminated from the subsequent iterations. Upon updating and , the above scheme is repeated for the -iteration.

#### Computation efficiency

Let be the number of Byzantine workers identified until the -th iteration. If the master does not detect a fault in the -th iteration then the computation efficiency of the scheme is . Otherwise, the worst-case computation efficiency is .

As there are at most Byzantine workers, the master will detect faults and impose reactive redundancy in at most iterations. Thus, for iterations, the computation efficiency of the scheme is greater than or equal to for at least iterations. In case , the average computation efficiency of the scheme is effectively greater than or equal to .

Note: We would like to reiterate the fact that a deterministic coding scheme with computation efficiency greater than , in all iterations, cannot have exact fault-tolerance against at most Byzantine workers [13]. However, communication efficiency can be improved using other codes.

### 4.2 Randomized coding scheme

In the randomized scheme, the master checks for faults (and does identification of Byzantine worker if needed) only for randomly chosen intermittent iterations. In each iteration, the master runs the traditional parallelized-SGD method by default. However, before updating the parameter estimate, the master decides to check for faults in the received symbols (or gradients) with probability . Fault-checks and identification of Byzantine workers (if needed) is done using the protocol outlined for the deterministic coding scheme in Section 4.1.

For the purpose of analysis, assume that each Byzantine worker tampers its gradient(s) independently in each iteration with probability at least . Then, remains unidentified by the master after iterations with probability less than or equal to , which approaches as approaches . In other words, gets identified almost surely. This holds for all Byzantine workers that tamper gradient(s) eventually.

#### Computation efficiency

As the master checks for faults with probability in each iteration, the expected computation efficiency of the randomized scheme is greater than or equal to

 (1−q)×1+q×12f+1=1−q(2f2f+1). (2)

The above lower bound for the expected computation efficiency is computed by assuming the worst-case where the master imposes redundancy for each gradient in the fault-detection phase. The actual computation efficiency will be larger than this lower bound. However, this lower bound suffices to understand the benefits of our coding scheme.

From above, the expected computational efficiency of the randomized coding scheme can be made as close to one as desirable by choosing appropriately. Specifically, for a , let

 q=δ(2f+12f)≤1.

Then, the expected computational efficiency of the randomized coding scheme is greater than or equal to .

#### Efficiency versus convergence-rate

Smaller probability of fault-checks implies higher efficiency, as is evident from (2). However, smaller also means higher probability of using faulty gradient(s) for updating the parameter estimate, which could result in slower convergence of the learning algorithm.

Suppose that each Byzantine worker chooses to tamper its gradient(s) independently with probability , then the probability of a faulty update in the -th iteration (assuming none of the Byzantine workers have been identified yet) equals

 (probability of faulty gradients)×(probability % of \text@underline{not} checking for fault(s)) =(1−(1−p)f)×(1−q) (3)

Therefore, determining an optimal value of is a multi-objective optimization problem where;

• Objective 1: maximize the expected computation efficiency, given by (2).

• Objective 2: minimize the probability of faulty updates, given by (3).

Obviously, the above objectives cannot be met simultaneously. That is, there does not exist a that maximizes and minimizes the expected computation efficiency and the probability of faulty updates, respectively, at the same time. This trade-off between the computation efficiency and the reliability (or correctness) of the updates can be managed by the following adaptive approach.

Let and denote the expected computation efficiency and the probability of faulty update in iteration , if the probability of doing a fault-check equals . Let denote the number of identified Byzantine workers until iteration . By substituting by

 ft=f−κt

in (2) and (3), we obtain

 comEfft(q)=2ft(1−q)+12ft+1, and % probFt(q)=(1−(1−p)ft)×(1−q)

Note, maximizing is equivalent to minimizing , and minimizing is equivalent to minimizing . Thus, the probability of fault-check in the -iteration, denoted by , is given by the minimum point of the weighted average of and , i.e.,

 (4)

where . Higher value of (greater than ) implies that minimizing takes precedence over maximising , and vice versa.

#### Choice of λt

We note that a suitable value of can be computed using the average loss, denoted by , computed over the chosen data points at the current parameter estimate. Specifically, if denotes the set of data points chosen and denotes the current parameter estimate in the -iteration, then

 ℓt=1|Zt|∑z∈Ztℓ(wt,z).

Then,

 λt=(1−e−ℓt). (5)

If is given by (5), then for higher observed loss minimizing the probability of faulty updates takes precedence. This is quite intuitive as the master would prefer the updates to fault-free when the observed loss is high, for improved convergence-rate to the learning parameter.

The following boundary conditions further justify the choice of given by (5).

• As approaches , approaches . In this extreme case,

 q∗t=argminq∈[0,1](probFt(q))2=1

Thus, the master checks for faults in almost all iterations when the observed loss is extremely high.

• If , i.e. Byzantine workers do not tamper their gradients with certainty,

 q∗t=argminq∈[0,1](comEfft(q))2=0.

Obviously, if the gradients received from the Byzantine workers are correct with certainty then there is no need for fault-checks. Similarly, if , i.e. the master has identified all the Byzantine workers, then

 q∗t=argminq∈[0,1](comEfft(q))2=0.

Note: For saving on the computation cost, the master may use the workers for computing in parallel. However, in this case the master would only be able to obtain an approximation of , instead of the actual value, as up to of the workers are Byzantine. Nevertheless, approximate suffices for the above adaptation. An approximation of can be computed by taking the truncated or trimmed mean of the average loss evaluated by the workers for their respective data points [22].

## 5 Generalizations of the Randomized Coding Scheme

Our randomized scheme can be generalized as follows.

• Variants of the parallelized-SGD method: We can use the randomized scheme even for different variants of the parallelized-SGD method where workers send compressed or communication-efficient gradients, as proposed in [1, 2, 19, 20].

• Self-checks: Instead of imposing reactive redundancy, the master can compute the gradients on its own, and compare them with the gradients received from the workers to check for faults. Similarly as above, the master may optimize the additional workload by choosing the probability of fault-checks adaptively as presented in Section 4.3.

• Selective fault-checks:

Gradients (or symbols) that are outliers amongst the received gradients (or symbols) should be checked for faults with relatively higher probability. Additionally, the master can assign

reliability scores to the workers, as done in the context of reliable crowdsourcing [18]. Symbols from workers with lower reliability scores should be checked for faults with higher probability.

• Gradient-filters: The master can further improve on the computation efficiency by combining the randomized coding scheme with lightweight gradient-filters [10, 14, 23]. When using gradient-filters, the master does not have to identify all the Byzantine workers. This idea has been explored in Rajput et al., 2019 [17] for a deterministic coding scheme.

• Distributed learning framework: Our randomized scheme can also be used for Byzantine fault-tolerance in distributed learning framework, where the data points are distributed amongst the workers, i.e. two workers may have different sets of data points [6, 23]. In this case, besides checking for faulty gradient(s), the master must also validate the data points used by the workers for computing the gradients in the first place. As most existing data validation tools are computationally expensive [9, 12, 18, 21], the master may use our randomized scheme to optimize the trade-off between the cost of data validation and the convergence-rate of a distributed learning algorithm.

## 6 Summary

In this report, we have presented two coding schemes, a deterministic scheme and a randomized scheme, for exact Byzantine fault-tolerance in the parallelized-SGD learning algorithm.

In the deterministic scheme, the master uses a fault-detection code in each iteration. Upon detecting any fault(s), the master imposes reactive redundancy to correct the faults and identify the Byzantine worker(s) responsible for the fault(s).

The randomized scheme improves upon the computation efficiency of the deterministic scheme. Here, the master uses fault-detection codes only in randomly chosen intermittent iterations, instead of all the iterations. By doing so, the master is able to optimize the trade-off between the expected computation efficiency, and the convergence-rate of the parallelized learning algorithm.

## Acknowledgements

Research reported in this paper was sponsored in part by the Army Research Laboratory under Cooperative Agreement W911NF- 17-2-0196, and by National Science Foundation award 1610543. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the the Army Research Laboratory, National Science Foundation or the U.S. Government.