Machine learning has received a lot of attention over the past few years. Its applications range all the way from images classification, financial trend prediction, disease diagnosis, to gaming and driving . Most major companies are currently investing in machine learning technologies to support their businesses . Roughly speaking, machine learning consists in giving a computer the ability to improve the way it solves a problem with the quantity and quality of information it can use . In short, the computer has a list of internal parameters, called the parameter vector, which allows the computer to formulate answers to several questions such as, “is there a cat on this picture?”. According to how many correct and incorrect answers are provided, a specific error cost is associated with the parameter vector. Learning is the process of updating this parameter vector in order to minimize the cost.
The increasing amount of data involved  as well as the growing complexity of models  has led to learning schemes that require a lot of computational resources. As a consequence, most industry-grade machine-learning implementations are now distributed 
. For example, as of 2012, Google reportedly used 16.000 processors to train an image classifier. However, distributing a computation over several machines induces a higher risk of failures, including crashes and computation errors. In the worst case, the system may undergo Byzantine failures , i.e., completely arbitrary behaviors of some of the machines involved. In practice, such failures may be due to stalled processes, or biases in the way the data samples are distributed among the processes.
A classical approach to mask failures in distributed systems is to use a state machine replication protocol , which requires however state transitions to be applied by all processes. In the case of distributed machine learning, this constraint can be seen in two ways: either (a) the processes agree on a sample of data based on which they update their local parameter vectors, or (b) they agree on how the parameter vector should be updated. In case (a), the sample of data has to be transmitted to each process, which then has to perform a heavyweight computation to update its local parameter vector. This entails communication and computational costs that defeat the entire purpose of distributing the work. In case (b), the processes have no way to check if the chosen update for the parameter vector has indeed been computed correctly on real data (a Byzantine process could have proposed the update). Byzantine failures may easily prevent the convergence of the learning algorithm. Neither of these solutions is satisfactory in a realistic distributed machine learning setting.
, whether for training neural networks, regression , matrix factorization 25]
. In all those cases, a cost function – depending on the parameter vector – is minimized based on stochastic estimates of its gradient. Distributed implementations of SGD typically take the following form: a single parameter server is in charge of updating the parameter vector, while worker processes perform the actual update estimation, based on the share of data they have access to. More specifically, the parameter server executes synchronous rounds, during each of which, the parameter vector is broadcast to the workers. In turn, each worker computes an estimate of the update to apply (an estimate of the gradient), and the parameter server aggregates their results to finally update the parameter vector. Today, this aggregation is typically implemented through averaging , or variants of it [24, 10, 23].
The question we address in this paper is how a distributed SGD can be devised to tolerate Byzantine processes among the workers.
We first show in this paper that no linear combination (current approaches) of the updates proposed by the workers can tolerate a single Byzantine worker. Basically, the Byzantine worker can force the parameter server to choose any arbitrary vector, even one that is too large in amplitude or too far in direction from the other vectors. Clearly, the Byzantine worker can prevent any classic averaging-based approach to converge. Choosing the appropriate update from the vectors proposed by the workers turns out to be challenging. A non-linear, distance-based choice function, that chooses, among the proposed vectors, the vector “closest to everyone else” (for example by taking the vector that minimizes the sum of the distances to every other vector), might look appealing. Yet, such a distance-based choice tolerates only a single Byzantine worker. Two Byzantine workers can collude, one helping the other to be selected, by moving the barycenter of all the vectors farther from the “correct area”.
We formulate a Byzantine resilience property capturing sufficient conditions for the parameter server’s choice to tolerate Byzantine workers. Essentially, to guarantee that the cost will decrease despite Byzantine workers, we require the parameter server’s choice (a) to point, on average, to the same direction as the gradient and (b)
to have statistical moments (up to the fourth moment) bounded above by a homogeneous polynomial in the moments of a correct estimator of the gradient. One way to ensure such a resilience property is to consider amajority-based approach, looking at every subset of vectors, and considering the subset with the smallest diameter. While this approach is more robust to Byzantine workers that propose vectors far from the correct area, its exponential computational cost is prohibitive. Interestingly, combining the intuitions of the majority-based and distance-based methods, we can choose the vector that is somehow the closest to its neighbors. Namely, the one that minimizes a distance-based criteria, but only within its neighbors. This is the main idea behind our choice function we call Krum111Krum, in Greek Κρούμος, was a Bulgarian Khan of the end of the eighth century, who undertook offensive attacks against the Byzantine empire. Bulgaria doubled in size during his reign.. Assuming , we show (using techniques from multi-dimensional stochastic calculus) that our Krum function satisfies the resilience property aforementioned and the corresponding machine learning scheme converges. An important advantage of the Krum function is that it requires local computation time, where is the dimension of the parameter vector. (In modern machine learning, the dimension of the parameter vector may take values in the hundreds of billions .) For simplicity of presentation, we first introduce a version of the Krum function that selects only one vector. Then we discuss how this method can be iterated to leverage the contribution of more than one single correct worker.
Section 2 recalls the classical model of distributed SGD. Section 3 proves that linear combinations (solutions used today) are not resilient even to a single Byzantine worker, then introduces the new concept of -Byzantine resilience. In Section 4, we introduce the Krum function, compute its computational cost and prove its -Byzantine resilience. In Section 5 we analyze the convergence of a distributed SGD using our Krum function. In Section 6 we discuss how Krum can be iterated to leverage the contribution of more workers. Finally, we discuss related work and open problems in Section 7.
We consider a general distributed system consisting of a parameter server222The parameter server is assumed to be reliable. Classical techniques of state-machine replication can be used to avoid this single point of failure. , and workers, of them possibly Byzantine (behaving arbitrarily). Computation is divided into (infinitely many) synchronous rounds. During round , the parameter server broadcasts its parameter vector to all the workers. Each correct worker computes an estimate of the gradient of the cost function , where
is a random variable representing, e.g., the sample drawn from the dataset. A Byzantine workerproposes a vector which can be arbitrary (see Figure 1).
Note that, since the communication is synchronous, if the parameter server does not receive a vector value from a given Byzantine worker , then the parameter server acts as if it had received the default value instead.
The parameter server computes a vector by applying a deterministic function to the vectors received. We refer to as the choice function of the parameter server. The parameter server updates the parameter vector using the following SGD equation
In this paper, we assume that the correct (non-Byzantine) workers compute unbiased estimates of the gradient. More precisely, in every round , the vectors ’s proposed by the correct workers are independent identically distributed random vectors, with . This can be achieved by ensuring that each sample of data used for computing the gradient is drawn uniformly and independently, as classically assumed in the literature of machine learning .
The Byzantine workers have full knowledge of the system, including the choice function , the vectors proposed by the other workers and can collaborate with each other .
3 Byzantine Resilience
In most SGD-based learning algorithms used today [3, 6, 25, 5], the choice function consists in computing the average of the input vectors. Lemma 1 below states that no linear combination of the vectors can tolerate a single Byzantine worker. In particular, averaging is not robust to Byzantine failures.
Consider a choice function of the form
where the ’s are non-zero scalars. Let be any vector in . A single Byzantine worker can make always select . In particular, a single Byzantine worker can prevent convergence.
If the Byzantine worker proposes vector , then . Note that the parameter server could cancel the effects of the Byzantine behavior by setting, for example, to 0, but this requires means to detect which worker is Byzantine. ∎
In the following, we define basic requirements on an appropriate robust choice function. Intuitively, the choice function should output a vector that is not too far from the “real” gradient , more precisely, the vector that points to the steepest direction of the cost function being optimized. This is expressed as a lower bound (condition (i)) on the scalar product of the (expected) vector and . Figure 2 illustrates the situation geometrically. If belongs to the ball centered at with radius , then the scalar product is bounded below by a term involving .
Condition (ii) is more technical, and states that the moments of should be controlled by the moments of the (correct) gradient estimator . The bounds on the moments of are classically used to control the effects of the discrete nature of the SGD dynamics . Condition (ii) allows to transfer this control to the choice function.
Definition 1 (-Byzantine Resilience).
Let be any angular value, and any integer . Let be any independent identically distributed random vectors in , , with . Let be any random vectors in , possibly dependent on the ’s. Choice function is said to be -Byzantine resilient if, for any , the vector
satisfies (i) and (ii) for , is bounded above by a linear combination of terms with .
4 The Krum Function
We now introduce Krum, our choice function, which, we show, satisfies the -Byzantine resilience condition. The barycentric choice function can be defined as the vector in that minimizes the sum of squared distances to the ’s . Lemma 1, however, states that this approach does not tolerate even a single Byzantine failure. One could try to define the choice function in order to select, among the ’s, the vector that minimizes the sum . Intuitively, vector would be close to every proposed vector, including the correct ones, and thus would be close to the “real” gradient. However, all Byzantine workers but one may propose vectors that are large enough to move the total barycenter far away from the correct vectors, while the remaining Byzantine worker proposes this barycenter. Since the barycenter always minimizes the sum of squared distance, this last Byzantine worker is certain to have its vector chosen by the parameter server. This situation is depicted in Figure 3. In other words, since this choice function takes into account all the vectors, including the very remote ones, the Byzantine workers can collude to force the choice of the parameter server.
Our approach to circumvent this issue is to preclude the vectors that are too far away. More precisely, we define our Krum choice function as follows. For any , we denote by the fact that belongs to the closest vectors to . Then, we define for each worker , the score where the sum runs over the closest vectors to . Finally, where refers to the worker minimizing the score, for all .333If two or more workers have the minimal score, we choose the vector of the worker with the smallest identifier.
The time complexity of the Krum Function , where are -dimensional vectors, is
For each , the parameter server computes the squared distances (time ). Then the parameter server sorts these distances (time ) and sums the first values (time ). Thus, computing the score of all the ’s takes . An additional term is required to find the minimum score, but is negligible relatively to . ∎
Proposition 1 below states that, if
and the gradient estimator is accurate enough, (its standard deviation is relatively small compared to the norm of the gradient), then the Krum function is-Byzantine-resilient, where angle
depends on the ratio of the deviation over the gradient. When the Krum function selects a correct vector (i.e., a vector proposed by a correct worker), the proof of this fact is relatively easy, since the probability distribution of this correct vector is that of the gradient estimator. The core difficulty occurs when the Krum function selects a Byzantine vector (i.e., a vector proposed by a Byzantine worker), because the distribution of this vector is completely arbitrary, and may even depend on the correct vectors. In a very general sense, this part of our proof is reminiscent of the median technique: the median of scalar values is always bounded below and above by values proposed by correct workers. Extending this observation to our multi-dimensional is not trivial. To do so, we notice that the chosen Byzantine vector has a score not greater than any score of a correct worker. This allows us to derive an upper bound on the distance between and the real gradient. This upper bound involves a sum of distances from correct to correct neighbor vectors, and distances from correct to Byzantine neighbor vectors. As explained above, the first term is relatively easy to control. For the second term, we observe that a correct vector has neighbors (the closest vectors to ), and non-neighbors. In particular, the distance from any (possibly Byzantine) neighbor to is bounded above by a correct to correct vector distance. In other words, we manage to control the distance between the chosen Byzantine vector and the real gradient by an upper bound involving only distances between vectors proposed by correct workers.
Let be any independent and identically distributed random -dimensional vectors s.t , with and . Let be any random vectors, possibly dependent on the ’s. If and , where
then the Krum function Kr is -Byzantine resilient where is defined by
The condition on the norm of the gradient, , can be satisfied, to a certain extent, by having the (correct) workers computing their gradient estimates on mini-batches . Indeed, averaging the gradient estimates over a mini-batch divides the deviation by the squared root of the size of the mini-batch.
Without loss of generality, we assume that the Byzantine vectors occupy the last positions in the list of arguments of Kr, i.e., . An index is correct if it refers to a vector among . An index is Byzantine if it refers to a vector among . For each index (correct or Byzantine) , we denote by (resp. ) the number of correct (resp. Byzantine) indices such that . We have
We focus first on the condition (i) of -Byzantine resilience. We determine an upper bound on the squared distance . Note that, for any correct , . We denote by the index of the vector chosen by the Krum function.
where denotes the indicator function444 equals if the predicate is true, and otherwise.. We examine the case for some correct index .
We now examine the case for some Byzantine index . The fact that minimizes the score implies that for all correct indices
Then, for all correct indices
We focus on the term . Each correct worker has neighbors, and non-neighbors. Thus there exists a correct worker which is farther from than any of the neighbors of . In particular, for each Byzantine index such that , . Whence
Putting everything back together, we obtain
By assumption, , i.e., belongs to a ball centered at with radius . This implies
To sum up, condition (i) of the -Byzantine resilience property holds. We now focus on condition (ii).
Denoting by a generic constant, when , we have for all correct indices
The second inequality comes from the equivalence of norms in finite dimension. Now
Since the ’s are independent, we finally obtain that is bounded above by a linear combination of terms of the form with . This completes the proof of condition (ii). ∎
5 Convergence Analysis
In this section, we analyze the convergence of the SGD using our Krum function defined in Section 4. The SGD equation is expressed as follows
where at least vectors among the ’s are correct, while the other ones may be Byzantine. For a correct index , where is the gradient estimator. We define the local standard deviation by
The following proposition considers an (a priori) non-convex cost function. In the context of non-convex optimization, even in the centralized case, it is generally hopeless to aim at proving that the parameter vector tends to a local minimum. Many criteria may be used instead. We follow , and we prove that the parameter vector almost surely reaches a “flat” region (where the norm of the gradient is small), in a sense explained below.
We assume that (i) the cost function is three times differentiable with continuous derivatives, and is non-negative, ; (ii) the learning rates satisfy and ; (iii) the gradient estimator satisfies and for some constants ; (iv) there exists a constant such that for all
(v) finally, beyond a certain horizon, , there exist and such that
Then the sequence of gradients converges almost surely to zero.
Conditions (i) to (iv) are the same conditions as in the non-convex convergence analysis in . Condition (v) is a slightly stronger condition than the corresponding one in , and states that, beyond a certain horizon, the cost function is “convex enough”, in the sense that the direction of the gradient is sufficiently close to the direction of the parameter vector . Condition (iv), however, states that the gradient estimator used by the correct workers has to be accurate enough, i.e., the local standard deviation should be small relatively to the norm of the gradient. Of course, the norm of the gradient tends to zero near, e.g., extremal and saddle points. Actually, the ratio controls the maximum angle between the gradient and the vector chosen by the Krum function. In the regions where , the Byzantine workers may take advantage of the noise (measured by ) in the gradient estimator to bias the choice of the parameter server. Therefore, Proposition 2 is to be interpreted as follows: in the presence of Byzantine workers, the parameter vector almost surely reaches a basin around points where the gradient is small (), i.e., points where the cost landscape is “almost flat”.
Note that the convergence analysis is based only on the fact that function Kr is -Byzantine resilient. Due to space limitation, the complete proof of Proposition 2 is deferred to the Appendix.
For the sake of simplicity, we write . Before proving the main claim of the proposition, we first show that the sequence is almost surely globally confined within the region .
This becomes an equality when . Applying this inequality to yields
Let denote the -algebra encoding all the information up to round . Taking the conditional expectation with respect to yields
Thanks to condition (ii) of -Byzantine resilience, and the assumption on the first four moments of , there exist positive constants such that
Thus, there exist positive constant such that
When , the first term of the right hand side is null because . When , this first term is negative because (see Figure 4)
We define two auxiliary sequences
Note that the sequence converges because . Then
Consider the indicator of the positive variations of the left-hand side
The right-hand side of the previous inequality is the summand of a convergent series. By the quasi-martingale convergence theorem , this shows that the sequence converges almost surely, which in turn shows that the sequence converges almost surely, .
Let us assume that . When is large enough, this implies that and are greater than . Inequality 1 becomes an equality, which implies that the following infinite sum converges almost surely
Note that the sequence converges to a positive value. In the region , we have
This contradicts the fact that . Therefore, the sequence converges to zero. This convergence implies that the sequence is bounded, i.e., the vector is confined in a bounded region containing the origin. As a consequence, any continuous function of is also bounded, such as, e.g., , and all the derivatives of the cost function . In the sequel, positive constants etc…are introduced whenever such a bound is used.
We proceed to show that the gradient converges almost surely to zero. We define
Using a first-order Taylor expansion and bounding the second derivative with , we obtain
By the properties of -Byzantine resiliency, this implies
which in turn implies that the positive variations of are also bounded
The right-hand side is the summand of a convergent infinite sum. By the quasi-martingale convergence theorem, the sequence converges almost surely, .
Taking the expectation of Inequality 2, and summing on , the convergence of implies that
We now define
Using a Taylor expansion, as demonstrated for the variations of , we obtain
Taking the conditional expectation, and bounding the second derivatives by ,
The positive expected variations of are bounded
The two terms on the right-hand side are the summands of convergent infinite series. By the quasi-martingale convergence theorem, this shows that converges almost surely.
This implies that the following infinite series converge almost surely
Since converges almost surely, and the series diverges, we conclude that the sequence converges almost surely to zero. ∎
So far, for the sake of simplicity, we defined our Krum function so that it selects only one vector among the vectors proposed. In fact, the parameter server could avoid wasting the contribution of the other workers by selecting vectors instead. This can be achieved, for instance, by selecting one vector using the Krum function, removing it from the list, and iterating this scheme times, as long as . We then define accordingly the -Krum function
where the ’s are the vectors selected as explained above. Note that the -Krum function is the Krum function defined in Section 4.
Let be any iid random -dimensional vectors, , with and . Let be any random vectors, possibly dependent on the ’s. Assume that and . Then, for large enough , the -Krum function is