Gradient Coding from Cyclic MDS Codes and Expander Graphs

by   Netanel Raviv, et al.

Gradient Descent, and its variants, are a popular method for solving empirical risk minimization problems in machine learning. However, if the size of the training set is large, a computational bottleneck is the computation of the gradient, and hence, it is common to distribute the training set among worker nodes. Doing this in a synchronous fashion faces yet another challenge of stragglers (i.e., slow or unavailable nodes) which might cause a considerable delay, and hence, schemes for mitigation of stragglers are essential. It was recently shown by Tandon et al. that stragglers can be avoided by carefully assigning redundant computations to the worker nodes and coding across partial gradients, and a randomized construction for the coding was given. In this paper we obtain a comparable deterministic scheme by employing cyclic MDS codes. In addition, we propose replacing the exact computation of the gradient with an approximate one; a technique which drastically increases the straggler tolerance, and stems from adjacency matrices of expander graphs.


page 1

page 2

page 3

page 4


The subfield codes and subfield subcodes of a family of MDS codes

Maximum distance separable (MDS) codes are very important in both theory...

Lifted MDS Codes over Finite Fields

MDS codes are elegant constructions in coding theory and have mode impor...

New LCD MDS codes of non-Reed-Solomon type

Both linear complementary dual (LCD) codes and maximum distance separabl...

Coded Gradient Aggregation: A Tradeoff Between Communication Costs at Edge Nodes and at Helper Nodes

The increasing amount of data generated at the edge/client nodes and the...

Near-Optimal Straggler Mitigation for Distributed Gradient Methods

Modern learning algorithms use gradient descent updates to train inferen...

Approximate Gradient Coding for Heterogeneous Nodes

In distributed machine learning (DML), the training data is distributed ...

Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Performance of distributed optimization and learning systems is bottlene...

Code Repositories


Implementation of

view repo

I Introduction

Data intensive machine learning tasks have become ubiquitous in many real-world applications, and with the increasing size of training data, distributed methods have gained increasing popularity. However, the performance of distributed methods (in synchronous settings) is strongly dictated by stragglers, i.e., nodes that are slow to respond or unavailable. In this paper, we focus on coding theoretic (and graph theoretic) techniques for mitigating stragglers in distributed synchronous gradient descent.

The coding theoretic framework for straggler mitigation called gradient coding was first introduced in [23]. It consists of a system with one master and  worker nodes, in which the data is partitioned into 

parts, and one or more parts is assigned to each one of the workers. In turn, each worker computes the partial gradient on each of its assigned partitions, linearly combines the results according to some predetermined vector of coefficients, and sends this linear combination back to the master node. Choosing the coefficients at each node judiciously, one can guarantee that the master node is capable of reconstructing the full gradient even if

any machines fail to perform their work. The storage overhead of the system, which is denoted by , refers to the amount of redundant computations, or alternatively, to the number of data parts that are sent to each node (see example in Fig. 1).

The importance of straggler mitigation was demonstrated in a series of recent studies (e.g., [13] and [25]). In particular, it was demonstrated in [23] that stragglers may run up to slower than the typical worker performance ( in [25]) on Amazon EC2, especially for the cheaper virtual machines; such erratic behavior is unpredictable and can significantly delay training. One can, of course, use more expensive instances but the goal here is to use coding theoretic methods to provide reliability out of cheap unreliable workers, overall reducing the cost of training.

Fig. 1: Gradient coding for , , , and  [23]. Each worker node  obtains two parts  of the data set , computes the partial gradients , and sends their linear combination back to the master node . By choosing the coefficients judiciously, the master node  can compute the full gradient from any two responses, providing robustness against any one straggler.

The work of [23] established the fundamental bound , provided a deterministic construction which achieves it with equality when , and a randomized one which applies to all  and . Subsequently, deterministic constructions were also obtained by [6] and [7]. These works have focused on the scenario where  is known prior to the construction of the system. Furthermore, the exact computation of the full gradient is guaranteed if the number of stragglers is at most , but no error bound is guaranteed if this number exceeds .

The contribution of this work is twofold. For the computation of the exact gradient we employ tools from classic coding theory, namely, cyclic MDS codes, in order to obtain a deterministic construction which compares favourably with existing solutions; both in the applicable range of parameters, and in the complexity of the involved algorithms. Some of these gains are a direct application of well known properties of these codes.

Second, we introduce an approximate variant of the gradient coding problem. In this variant, the requirement for exact computation of the full gradient is traded by an approximate one, where the  deviation of the given solution is a decreasing function of the number of stragglers. Note that by this approach, the parameter  is not a part of the system construction, and the system can provide an approximate solution for any , whose quality deteriorates gracefully as  increases. In the suggested solution, the coefficients at the worker nodes are based on an important family of graphs called expanders. In particular, it is shown that setting these coefficients according to a normalized adjacency matrix of an expander graph, a strong bound on the error term of the resulting solution is obtained. Moreover, this approach enables to break the aforementioned barrier , which is a substantial obstacle in gradient coding, and allows the master node to decode using a very simple algorithm.

This paper is organized as follows. Related work regarding gradient coding (and coded computation in general) is listed in Section II. A framework which encapsulates all the results in this paper is given in Section III. Necessary mathematical notions from coding theory and graph theory are given in Section IV. The former is used to obtain an algorithm for exact gradient computation in Section V, and the latter is used for the approximate one in Section VI. Experimental results are given in Section VII.

Ii Related Work

The work of Lee et al. [12]

initiated the use of coding theoretic methods for mitigating stragglers in large-scale learning. This work is focused on linear regression and therefore can exploit more structure compared to the general gradient coding problem that we study here. The work by Li et al. 

[16], investigates a generalized view of the coding ideas in [12], showing that their solution is a single operating point in a general scheme of trading off latency of computation to the load of communication.

Further closely related work has shown how coding can be used for distributed MapReduce, as well as a similar communication and computation tradeoff [15, 17]. We also mention the work of [10] which addresses straggler mitigation in linear regression by using a different approach, that is not mutually exclusive with gradient coding. In their work, the data is coded rather than replicated at the master node, and the nodes perform their computation on coded data.

The work by [6] generalizes previous work for linear models [12] but can also be applied to general models to yield explicit gradient coding constructions. Our results regarding the exact gradient are closely related to the work by [7, 8] which was obtained independently from our work. In [7], similar coding theoretic tools were employed in a fundamentally different fashion. Both [7] and [6] are comparable in parameters to the randomized construction of [23] and are outperformed by us in a wide range of parameters. A detailed comparison of the theoretical asymptotic behaviour is given in the sequel.

Remark 1.

None of the aforementioned works studies approximate gradient computations. However, we note that subsequent to this work, two unpublished manuscripts [3, 14] study a similar approximation setting and obtain related results albeit using randomized as opposed to deterministic approaches. Furthermore, the exact setting was also discussed subsequent to this work in [26] and [27]. In [26] it was shown that network communication can be reduced by increasing the replication factor, and respective bounds were given. The work of [27] discussed coded polynomial computation with low overhead, and applies to gradient coding whenever the gradient at hand is a polynomial.

Iii Framework

This section provides a unified framework which accommodates straggler mitigation in both the exact and approximate gradient computations which follow. The uninformed reader is referred to [22] for an introduction to machine learning, and in particular, to stochastic gradient descent [22, Sec. 14]. In order to distribute the execution of gradient descent from a master node  to  worker nodes  (Algorithm 1), the training set  is partitioned by  to  disjoint subsets  of size111For simplicity, assume that . The given scheme could be easily adapted to the case . Further, the assumption that the number of partitions equals to the number of nodes is a mere convenience, and all subsequent schemes can be adapted to the case where the number of partitions is at most the number of nodes. each, that are distributed among , and every node computes the partial gradients  of the empirical risks of the -s which it obtained. In iteration , every node evaluates its gradients in the current model , and sends to  some linear combination of them. After obtaining responses from at least  workers, where  is the number of responses that  should wait for in iteration ,  aggregates them to form the gradient  of the overall empirical risk at . In the exact setting the value of  will be fixed for every , whereas in the approximate setting this value is at the discretion of the master, in correspondence with the required approximation error.

To support mitigation of stragglers in this setting, the following notions are introduced. Let  be a matrix whose -th row  contains the coefficients of the linear combination that is sent to  by . Note that the support  contains the indices of the sets  that are to be sent to  by . Given a set of non-stragglers , where  is the set of all nonempty subsets of , a function  provides  with a vector by which the results from  are to be linearly combined to obtain the vector . For convenience of notation, assume that  for all . In most of the subsequent constructions, the matrix  and the function  will be defined over  rather than over .

1 Input: Data , number of iterations , learning rate schedule , straggler tolerance parameters , a matrix , and a function .
2 Initialize .
3 Partition  and send  to  for every .
4 for  to  do
5       broadcasts  to all nodes.
6       Each  sends to .
7       computes , where  is the response from  if it responded and  otherwise, for each , and is the set of non-stragglers in the current iteration.
8       updates
9 end for
return .
Algorithm 1 Gradient Coding

Different constructions of the matrix  and the function  in Algorithm 1 enable to compute the gradient either exactly (which requires the storage overhead  to be at least  for all ) or approximately. In what follows, the respective requirements and guarantees from  and  are discussed. In the following definitions, for an integer  let  be the vector of  ones, where the subscript is omitted if clear from context, and for  let .

Definition 2.

A matrix  and a function  satisfy the Exact Computation (EC) condition if for all  such that , we have .

Definition 3.

For a non-decreasing function  such that , and  satisfy the -Approximate Computation (-AC) condition, if for all , we have  (where  is Euclidean distance).

Notice that the error term  in Definition 3 is a function of the number of stragglers since it expected to increase if more stragglers are present. The conditions which are given in Definition 2 and Definition 3 guarantee the exact and approximate computation by the following lemmas. In the upcoming proofs, let  be the matrix of empirical losses –

Lemma 4.

If  and  satisfy the EC condition, then for all  we have .‘


For a given , let  be the matrix whose -th row  equals  if , and zero otherwise. By the definition of  it follows that , and since  it follows that . Therefore, we have

The next lemma bounds the deviance of  from the gradient of the empirical risk at the current model  by using the function  and the spectral norm  of . Recall that for a matrix  the spectral norm is defined as .

Lemma 5.

For a function  as above, if and  satisfy the -AC condition, then .


As in the proof of Lemma 4, we have that

Due to Lemma 4 and Lemma 5, in the remainder of this paper we focus on constructing  and  that satisfy either the EC condition (Section V) or the -AC condition (Section VI).

Remark 6.

In some settings [23], it is convenient to partition the data set  to , where . Notice that the definitions of  and  above extend verbatim to this case as well. If  and  satisfy the EC condition, we have that  for every large enough . Hence, by omitting any  columns of  to form a matrix , we have that , and hence a scheme for any partition of  to  parts emerges instantly. This new scheme  is resilient to an identical number of stragglers  and has lesser or equal storage overhead than . Similarly, if  and  satisfy the -EC condition for some , then the scheme  has lesser or equal storage overhead, and an identical error function , since  for any .

Iv Mathematical Notions

This section provides a brief overview on the mathematical notions that are essential for the suggested schemes. The exact computation (Sec. V) requires notions from coding theory, and the approximate one (Sec. VI) requires notions from graph theory. The coding theoretic material in this section is taken from [21], which focuses on finite fields, and yet the given results extend verbatim to the real or complex case (see also [19], Sec. 8.4).

For a field  an  (linear) code  over  is a subspace of dimension  of . The minimum distance  of  is , where  denotes the Hamming distance . Since the code is a linear subspace, it follows that the minimum distance of a code is equal to the minimum Hamming weight  among the nonzero codewords in . The well-known Singleton bound states that , and codes which attain this bound with equality are called Maximum Distance Separable (MDS) codes.

Definition 7.

[21, Sec. 8] A code  is called cyclic if the cyclic shift of every codeword is also a codeword, namely,

The dual code of  is the subspace . Several well-known and easy to prove properties of MDS codes are used throughout this paper.

Lemma 8.

If  is an  MDS code, then

  • is an  MDS code, and hence its minimum Hamming weight is .

  • For any subset  of size  there exists a codeword in  whose support (i.e., the set of nonzero indices) is .

  • The reverse code  is an  MDS code.


For A1 the reader is referred to [21, Prob. 4.1]. The proof of A3 is trivial since permuting the coordinates of a code does not alter its minimum distance. For A2, let  be the generator matrix of , i.e., a matrix whose rows are a basis to . By restricting  to the columns indexed by  we get a  matrix, which has a nonzero vector  in its left kernel, and hence  is a codeword in  which is zero in the entries that are indexed by . Since the minimum distance of  is , it follows that  has nonzero values in entries that are indexed by , and the claim follows. ∎

Two common families of codes are used in the sequel—Reed-Solomon (RS) codes and Bose-Chaudhuri-Hocquenghem (BCH) codes. An  RS code  is defined by a set of  distinct evaluation points  as , where  is the set of polynomials of degree less than  and coefficients from . Alternatively, RS codes can be defined as , where  is a Vandermonde matrix on , i.e., for every . It is widely known that RS codes are MDS codes, and in some cases, they are also cyclic.

In contrast with RS codes, where every codeword is a vector of evaluations of a polynomial, a codeword in a BCH code is a vector of coefficients of a polynomial; that is, a codeword is identified by . For a field  that contains  and a set , a BCH code  is defined as . The set  is called the roots of , or alternatively, is said to be a BCH code on  over . For example, a set of complex numbers  defines a BCH code on  over , which is the set of real vectors whose corresponding polynomials vanish on .

Lemma 9.

[19, 21] If all elements of  are roots of unity of order , then the BCH code  on  over  is cyclic.


If  is a codeword in , then its cyclic shift is given by . Since every  is a root of unity of order , it follows that

and hence  is a codeword in . ∎

Further, the structure of  may also imply a lower bound on the distance of .

Theorem 10.

(The BCH bound) [19, 21] If  contains a subset of  consecutive powers of a primitive root of unity (i.e., a subset of the form , where  is a primitive -th root of unity), then the minimum distance of  is at least .

In the remainder of this section, a brief overview on expander graphs is given. The interested reader is referred to [9] for further details. Let  be a -regular, undirected, and connected graph on  nodes. Let  be the adjacency matrix of , i.e., if , and  otherwise. Since is a real symmetric matrix, it follows that it has 

real eigenvalues 

, and denote . It is widely known [9] that , and that , where equality holds if and only if  is bipartite. Further, it also follows from 

being real and symmetric that it has a basis of orthogonal real eigenvectors

, and w.l.o.g assume that for every . The parameters  and  are related by the celebrated Alon-Boppana Theorem.

Theorem 11.

[9] An infinite sequence of  regular graphs on vertices satisfies that , where  is an expression which tends to zero as  tends to infinity.

Constant degree regular graphs (i.e., families of graphs with fixed degree  that does not depend on ) for which  is small in comparison with  are largely referred to as expanders. In particular, graphs which attain the above bound asymptotically (i.e., ) are called Ramanujan graphs, and several efficient constructions are known [18, 5].

V Exact Gradient Coding from Cyclic MDS Codes

For a given  and , let  be a cyclic MDS code over  that contains  (explicit constructions of such codes are given in the sequel). According to Lemma 8, there exists a codeword whose support is . Let  be all cyclic shifts of , which lie in  by its cyclic property. Finally, let  be the  matrix whose columns are , i.e., . The following lemma provides some important properties of .

Lemma 12.

The matrix  satisfies the following properties.

  • for every row  of .

  • Every row of  is a codeword in .

  • The column span of  is the code .

  • Every set of  rows of  are linearly independent over .


To prove B1 and B2, observe that  is of the following form, where .

To prove B3, notice that the leftmost  columns of  have leading coefficients in different positions, and hence they are linearly independent. Thus, the dimension of the column span of  is at least , and since , the claim follows.

To prove B4, assume for contradiction that there exist a set of  linearly dependent rows. Hence, there exists a vector  of Hamming weight such that . According to B3, the columns of  span , and hence the vector  lies in the dual code  of . Since  is an  MDS code by Lemma 8, it follows that the minimum Hamming weight of a codeword in  is , a contradiction. ∎

Since  is of dimension , it follows from parts B2 and B4 of Lemma 12 that every set of  rows of  are a basis to . Furthermore, since  it follows that . Therefore, there exists a function  such that for any set  of size  we have that  and .

Theorem 13.

. The above  and  satisfy the EC condition (Definition 2).

In the remainder of this section, two cyclic MDS codes over the complex numbers and the real numbers are suggested, from which the construction in Theorem 13 can be obtained. These constructions are taken from [19] (Sec. II.B), and are given with a few adjustments to our case. The contributions of these codes is summarized in the following theorem.

Theorem 14.

For any given  and  there exist explicit complex matrices  and  that satisfy the EC-condition with optimal . The respective encoding (i.e., constructing ) and decoding (i.e., constructing  given ) complexities are and , respectively. In addition, for any given  and  such that there exist explicit real matrices  and  that satisfy the EC-condition with optimal . The encoding and decoding complexities are  and , where  is the complexity of inverting a generalized Vandermonde matrix.

V-a Cyclic-MDS Codes Over the Complex Numbers

For a given  and , let , and let be the set of  complex roots of unity of order , i.e., . Let  be a complex Vandermonde matrix over , i.e., for any  and any . Finally, let . It is readily verified that  is an  MDS code that contains , whose codewords may be seen as the evaluations of all polynomials in  on the set .

Lemma 15.

is a cyclic code.


Let  be a codeword, and let be the corresponding polynomial. Consider the polynomial , and notice that . Further, it is readily verified that any  satisfies that , where the indices are taken modulo . Hence, the evaluation of the polynomial  on the set of roots  results in the cyclic shift of the codeword , and lies in  itself. ∎

Corollary 16.

The code  is a cyclic MDS code which contains , and hence it can be used to obtain the matrices  and , as described in Theorem 13.

Given a set  of  non-stragglers, an algorithm for computing the encoding vector  in  operations over  (after a one-time initial computation of ), is given in Appendix A. The complexity of this algorithm is asymptotically smaller than the corresponding algorithm in [6] and [7] whenever . Furthermore, the cyclic structure of the matrix  enables a very simple algorithm for its construction; this algorithm compares favorably with previous works for any , and is given in Appendix A as well.

Remark 17.

Note that the use of complex rather than real matrix  may potentially double the required bandwidth, since every complex number contains two real numbers. A simple manipulation of Algorithm 1 which resolves this issue is given in Appendix C. This optimal bandwidth is also attained by the scheme in the next section, which uses a smaller number of multiplication operations. However, it is applicable only if .

V-B Cyclic-MDS Codes Over the Real Numbers

If one wishes to abstain from using complex numbers, e.g., in order to reduce bandwidth, we suggest the following construction, which provides a cyclic MDS code over the reals. This construction relies on [19] (Property 3), with an additional specialized property.

Construction 18.

For a given  and  such that , define the following BCH codes over the reals. In both cases denote .

  1. If  is even and 

    is odd let 

    , and let  be a BCH code which consists of all polynomials in  that vanish over the set .

  2. If  is odd and  is even let , and let  be a BCH code which consists of all polynomials in  that vanish over the set .

Lemma 19.

The codes  and  from Construction 18 are cyclic  MDS codes that contain .


According to Lemma 9, it is clear that  and  are cyclic. According to the BCH bound (Theorem 10), it is also clear that the minimum distance of  is at least , and the minimum distance of  is at least . Hence, to prove that  and  are MDS codes, it is shown that their code dimensions are .

Since the sets  and  are closed under conjugation (i.e., is in  if and only if the conjugate of  is in ) it follows that the polynomials and have real coefficients. Hence, by the definition of BCH codes it follows that


and hence, and . Let  and  be the minimum distances of  and , respectively, and notice that by the Singleton bound [21] (Sec. 4.1) it follows that

and thus  and  satisfy the Singleton bound with equality, or equivalently, they are MDS codes. To prove that  is in  and , we ought to show that  for every  and , which amounts to showing that . It is well-known that the sum of the ’th to ’th power of any root of unity of order , other than , equals zero. Since

it follows that  and that . Hence, we have that  for every  and , which concludes the claim. ∎

Algorithms for computing the matrix  and the vector  for the codes in this subsection are given in Appendix B. The algorithm for construction  outperforms previous works whenever , and the algorithm for computing  outperforms previous works for a smaller yet wide range of  values.

Vi Approximate Gradient Coding from Expander Graphs

Recall that in order to retrieve that exact gradient, one must have , an undesirable overhead in many cases. To break this barrier, we suggest to renounce the exact retrieval of the gradient, and settle for an approximation of it. Note that trading the exact gradient for an approximate one is a necessity in many variants of gradient descent (such as the acclaimed stochastic gradient descent [20, Sec. 14.3]), and hence our techniques are aligned with common practices in machine learning.


as the identity matrix and 

as the function which maps every  to its binary characteristic vector , clearly satisfies the -AC scheme for , since


It is readily verified that this approach (termed hereafter as “trivial”) amount to ignoring the stragglers, which is essentially equivalent to the scheme given in [4]. We show that this can be outperformed by setting  to be a normalized adjacency matrix of a connected regular graph on  nodes, which is constructed by the master before dispersing the data, and setting  to be some carefully chosen yet simple function.

The resulting error function  depends on the parameters of the graph, whereas the resulting storage overhead  is given by its degree (i.e., the fixed number of neighbors of each node). The error function is given below for a general connected and regular graph, and particular examples with their resulting errors are given in the sequel. In particular, it is shown that taking the graph to be an expander graph provides an error term  which is asymptotically less than  (Eq. (3)) whenever . In some cases, smaller error term is also obtained for larger values of .

For a given  let  be a connected -regular graph on  nodes, with eigenvalues  and corresponding (column) eigenvectors  as described in Subsection IV. Let , and for a given  of size , define  as

Lemma 20.

For any  of size ,  .


First, observe that is exactly the subspace of all vectors whose sum of entries is zero. This follows from the fact that is an orthogonal basis, hence for every , and from the fact that  are linearly independent. Since the sum of entries of  is zero, the result follows. ∎

Corollary 21.

For any  there exists such that , and .


The first part follows immediately from Lemma 20. The second part follows by computing the  norm of  in two ways, once by its definition (4) and again by using the representation of  as a linear combination of the orthonormal set . ∎

Now, define  as , and observe that  for all . Note that computing  given  is done by a straightforward  algorithm. The error function  is given by the following lemma.

Lemma 22.

For every set  of size , .


Notice that the eigenvalues of  are , and hence  equals . Further, the eigenvectors are identical to those of . Therefore, it follows from Corollary 21 that

and since are orthonormal, it follows that

Corollary 23.

The above  and  satisfy the -AC condition for . The storage overhead of this scheme equals the degree  of the underlying regular graph .

It is evident that in order to obtain small deviation , it is essential to have a small  and a large . However, most constructions of expanders have focused in the case were  is constant (i.e., ). On one hand, constant  serves our purpose well since it implies a constant storage overhead. On the other hand, a constant  does not allow  to tend to zero as  tends to infinity due to Theorem 11.

To present the contribution of the suggested scheme, it is compared to the trivial one. Clearly, for any given number of stragglers , it follows from (3) and from Lemma 22 that our scheme outperforms the trivial one if


Since any connected and non-bipartite graph satisfies that , it follows that Eq. (5) holds asymptotically for any . The following example shows the improved error rate for Margulis graphs (given in [9], Sec. 8), that are rather easy to construct.

Example 24.

For any integer  there exists an -regular graph on  nodes with . For example, by using these graphs with the parameters , , , we have that , whereas , an improvement of approximately .

Several additional examples for Ramanujan graphs, which attain much better error rate but are harder to construct, are as follows.

Example 25.

[18] Let  and  be distinct primes such that , , and such that the Legendre symbol is  (i.e.,  is a quadratic residue modulo ). Then, there exist a non-bipartite Ramanujan graph on  nodes and constant degree .

  1. If  and  then  and . E.g., for  we have , whereas , an improvement of approximately .

  2. If  and  then  and . E.g., for  we have , whereas , an improvement of approximately .

Restricting  to be a constant (i.e., not to grow with ) is detrimental to the error term in (5) due to Theorem 11, but allows lower storage overhead. If one wishes a lower error term at the price of higher overhead, the following is useful.

Example 26.

[1] There exists a polynomial algorithm (in ) to produce a graph  with the parameters . For this family of graphs, the relative error term (5) goes to zero as  goes to infinity for .

Vi-a Bipartite expanders.

The above approximation scheme can be applied over graphs  that are bipartite. However, such graphs satisfy that , and hence the resulting error function  is superseded by the error function of the trivial scheme (3). However, in what follows it is shown that bipartite graphs on  nodes can be employed in a slightly different fashion, and obtain