approx_gradient_coding
Implementation of https://arxiv.org/abs/1707.03858
view repo
Gradient Descent, and its variants, are a popular method for solving empirical risk minimization problems in machine learning. However, if the size of the training set is large, a computational bottleneck is the computation of the gradient, and hence, it is common to distribute the training set among worker nodes. Doing this in a synchronous fashion faces yet another challenge of stragglers (i.e., slow or unavailable nodes) which might cause a considerable delay, and hence, schemes for mitigation of stragglers are essential. It was recently shown by Tandon et al. that stragglers can be avoided by carefully assigning redundant computations to the worker nodes and coding across partial gradients, and a randomized construction for the coding was given. In this paper we obtain a comparable deterministic scheme by employing cyclic MDS codes. In addition, we propose replacing the exact computation of the gradient with an approximate one; a technique which drastically increases the straggler tolerance, and stems from adjacency matrices of expander graphs.
READ FULL TEXT VIEW PDFImplementation of https://arxiv.org/abs/1707.03858
Data intensive machine learning tasks have become ubiquitous in many real-world applications, and with the increasing size of training data, distributed methods have gained increasing popularity. However, the performance of distributed methods (in synchronous settings) is strongly dictated by stragglers, i.e., nodes that are slow to respond or unavailable. In this paper, we focus on coding theoretic (and graph theoretic) techniques for mitigating stragglers in distributed synchronous gradient descent.
The coding theoretic framework for straggler mitigation called gradient coding was first introduced in [23]. It consists of a system with one master and worker nodes, in which the data is partitioned into
parts, and one or more parts is assigned to each one of the workers. In turn, each worker computes the partial gradient on each of its assigned partitions, linearly combines the results according to some predetermined vector of coefficients, and sends this linear combination back to the master node. Choosing the coefficients at each node judiciously, one can guarantee that the master node is capable of reconstructing the full gradient even if
any machines fail to perform their work. The storage overhead of the system, which is denoted by , refers to the amount of redundant computations, or alternatively, to the number of data parts that are sent to each node (see example in Fig. 1).The importance of straggler mitigation was demonstrated in a series of recent studies (e.g., [13] and [25]). In particular, it was demonstrated in [23] that stragglers may run up to slower than the typical worker performance ( in [25]) on Amazon EC2, especially for the cheaper virtual machines; such erratic behavior is unpredictable and can significantly delay training. One can, of course, use more expensive instances but the goal here is to use coding theoretic methods to provide reliability out of cheap unreliable workers, overall reducing the cost of training.
The work of [23] established the fundamental bound , provided a deterministic construction which achieves it with equality when , and a randomized one which applies to all and . Subsequently, deterministic constructions were also obtained by [6] and [7]. These works have focused on the scenario where is known prior to the construction of the system. Furthermore, the exact computation of the full gradient is guaranteed if the number of stragglers is at most , but no error bound is guaranteed if this number exceeds .
The contribution of this work is twofold. For the computation of the exact gradient we employ tools from classic coding theory, namely, cyclic MDS codes, in order to obtain a deterministic construction which compares favourably with existing solutions; both in the applicable range of parameters, and in the complexity of the involved algorithms. Some of these gains are a direct application of well known properties of these codes.
Second, we introduce an approximate variant of the gradient coding problem. In this variant, the requirement for exact computation of the full gradient is traded by an approximate one, where the deviation of the given solution is a decreasing function of the number of stragglers. Note that by this approach, the parameter is not a part of the system construction, and the system can provide an approximate solution for any , whose quality deteriorates gracefully as increases. In the suggested solution, the coefficients at the worker nodes are based on an important family of graphs called expanders. In particular, it is shown that setting these coefficients according to a normalized adjacency matrix of an expander graph, a strong bound on the error term of the resulting solution is obtained. Moreover, this approach enables to break the aforementioned barrier , which is a substantial obstacle in gradient coding, and allows the master node to decode using a very simple algorithm.
This paper is organized as follows. Related work regarding gradient coding (and coded computation in general) is listed in Section II. A framework which encapsulates all the results in this paper is given in Section III. Necessary mathematical notions from coding theory and graph theory are given in Section IV. The former is used to obtain an algorithm for exact gradient computation in Section V, and the latter is used for the approximate one in Section VI. Experimental results are given in Section VII.
The work of Lee et al. [12]
initiated the use of coding theoretic methods for mitigating stragglers in large-scale learning. This work is focused on linear regression and therefore can exploit more structure compared to the general gradient coding problem that we study here. The work by Li et al.
[16], investigates a generalized view of the coding ideas in [12], showing that their solution is a single operating point in a general scheme of trading off latency of computation to the load of communication.Further closely related work has shown how coding can be used for distributed MapReduce, as well as a similar communication and computation tradeoff [15, 17]. We also mention the work of [10] which addresses straggler mitigation in linear regression by using a different approach, that is not mutually exclusive with gradient coding. In their work, the data is coded rather than replicated at the master node, and the nodes perform their computation on coded data.
The work by [6] generalizes previous work for linear models [12] but can also be applied to general models to yield explicit gradient coding constructions. Our results regarding the exact gradient are closely related to the work by [7, 8] which was obtained independently from our work. In [7], similar coding theoretic tools were employed in a fundamentally different fashion. Both [7] and [6] are comparable in parameters to the randomized construction of [23] and are outperformed by us in a wide range of parameters. A detailed comparison of the theoretical asymptotic behaviour is given in the sequel.
None of the aforementioned works studies approximate gradient computations. However, we note that subsequent to this work, two unpublished manuscripts [3, 14] study a similar approximation setting and obtain related results albeit using randomized as opposed to deterministic approaches. Furthermore, the exact setting was also discussed subsequent to this work in [26] and [27]. In [26] it was shown that network communication can be reduced by increasing the replication factor, and respective bounds were given. The work of [27] discussed coded polynomial computation with low overhead, and applies to gradient coding whenever the gradient at hand is a polynomial.
This section provides a unified framework which accommodates straggler mitigation in both the exact and approximate gradient computations which follow. The uninformed reader is referred to [22] for an introduction to machine learning, and in particular, to stochastic gradient descent [22, Sec. 14]. In order to distribute the execution of gradient descent from a master node to worker nodes (Algorithm 1), the training set is partitioned by to disjoint subsets of size^{1}^{1}1For simplicity, assume that . The given scheme could be easily adapted to the case . Further, the assumption that the number of partitions equals to the number of nodes is a mere convenience, and all subsequent schemes can be adapted to the case where the number of partitions is at most the number of nodes. each, that are distributed among , and every node computes the partial gradients of the empirical risks of the -s which it obtained. In iteration , every node evaluates its gradients in the current model , and sends to some linear combination of them. After obtaining responses from at least workers, where is the number of responses that should wait for in iteration , aggregates them to form the gradient of the overall empirical risk at . In the exact setting the value of will be fixed for every , whereas in the approximate setting this value is at the discretion of the master, in correspondence with the required approximation error.
To support mitigation of stragglers in this setting, the following notions are introduced. Let be a matrix whose -th row contains the coefficients of the linear combination that is sent to by . Note that the support contains the indices of the sets that are to be sent to by . Given a set of non-stragglers , where is the set of all nonempty subsets of , a function provides with a vector by which the results from are to be linearly combined to obtain the vector . For convenience of notation, assume that for all . In most of the subsequent constructions, the matrix and the function will be defined over rather than over .
Different constructions of the matrix and the function in Algorithm 1 enable to compute the gradient either exactly (which requires the storage overhead to be at least for all ) or approximately. In what follows, the respective requirements and guarantees from and are discussed. In the following definitions, for an integer let be the vector of ones, where the subscript is omitted if clear from context, and for let .
A matrix and a function satisfy the Exact Computation (EC) condition if for all such that , we have .
For a non-decreasing function such that , and satisfy the -Approximate Computation (-AC) condition, if for all , we have (where is Euclidean distance).
Notice that the error term in Definition 3 is a function of the number of stragglers since it expected to increase if more stragglers are present. The conditions which are given in Definition 2 and Definition 3 guarantee the exact and approximate computation by the following lemmas. In the upcoming proofs, let be the matrix of empirical losses –
(1) |
If and satisfy the EC condition, then for all we have .‘
For a given , let be the matrix whose -th row equals if , and zero otherwise. By the definition of it follows that , and since it follows that . Therefore, we have
The next lemma bounds the deviance of from the gradient of the empirical risk at the current model by using the function and the spectral norm of . Recall that for a matrix the spectral norm is defined as .
For a function as above, if and satisfy the -AC condition, then .
As in the proof of Lemma 4, we have that
Due to Lemma 4 and Lemma 5, in the remainder of this paper we focus on constructing and that satisfy either the EC condition (Section V) or the -AC condition (Section VI).
In some settings [23], it is convenient to partition the data set to , where . Notice that the definitions of and above extend verbatim to this case as well. If and satisfy the EC condition, we have that for every large enough . Hence, by omitting any columns of to form a matrix , we have that , and hence a scheme for any partition of to parts emerges instantly. This new scheme is resilient to an identical number of stragglers and has lesser or equal storage overhead than . Similarly, if and satisfy the -EC condition for some , then the scheme has lesser or equal storage overhead, and an identical error function , since for any .
This section provides a brief overview on the mathematical notions that are essential for the suggested schemes. The exact computation (Sec. V) requires notions from coding theory, and the approximate one (Sec. VI) requires notions from graph theory. The coding theoretic material in this section is taken from [21], which focuses on finite fields, and yet the given results extend verbatim to the real or complex case (see also [19], Sec. 8.4).
For a field an (linear) code over is a subspace of dimension of . The minimum distance of is , where denotes the Hamming distance . Since the code is a linear subspace, it follows that the minimum distance of a code is equal to the minimum Hamming weight among the nonzero codewords in . The well-known Singleton bound states that , and codes which attain this bound with equality are called Maximum Distance Separable (MDS) codes.
[21, Sec. 8] A code is called cyclic if the cyclic shift of every codeword is also a codeword, namely,
The dual code of is the subspace . Several well-known and easy to prove properties of MDS codes are used throughout this paper.
If is an MDS code, then
is an MDS code, and hence its minimum Hamming weight is .
For any subset of size there exists a codeword in whose support (i.e., the set of nonzero indices) is .
The reverse code is an MDS code.
For A1 the reader is referred to [21, Prob. 4.1]. The proof of A3 is trivial since permuting the coordinates of a code does not alter its minimum distance. For A2, let be the generator matrix of , i.e., a matrix whose rows are a basis to . By restricting to the columns indexed by we get a matrix, which has a nonzero vector in its left kernel, and hence is a codeword in which is zero in the entries that are indexed by . Since the minimum distance of is , it follows that has nonzero values in entries that are indexed by , and the claim follows. ∎
Two common families of codes are used in the sequel—Reed-Solomon (RS) codes and Bose-Chaudhuri-Hocquenghem (BCH) codes. An RS code is defined by a set of distinct evaluation points as , where is the set of polynomials of degree less than and coefficients from . Alternatively, RS codes can be defined as , where is a Vandermonde matrix on , i.e., for every . It is widely known that RS codes are MDS codes, and in some cases, they are also cyclic.
In contrast with RS codes, where every codeword is a vector of evaluations of a polynomial, a codeword in a BCH code is a vector of coefficients of a polynomial; that is, a codeword is identified by . For a field that contains and a set , a BCH code is defined as . The set is called the roots of , or alternatively, is said to be a BCH code on over . For example, a set of complex numbers defines a BCH code on over , which is the set of real vectors whose corresponding polynomials vanish on .
If is a codeword in , then its cyclic shift is given by . Since every is a root of unity of order , it follows that
and hence is a codeword in . ∎
Further, the structure of may also imply a lower bound on the distance of .
In the remainder of this section, a brief overview on expander graphs is given. The interested reader is referred to [9] for further details. Let be a -regular, undirected, and connected graph on nodes. Let be the adjacency matrix of , i.e., if , and otherwise. Since is a real symmetric matrix, it follows that it has
real eigenvalues
, and denote . It is widely known [9] that , and that , where equality holds if and only if is bipartite. Further, it also follows frombeing real and symmetric that it has a basis of orthogonal real eigenvectors
, and w.l.o.g assume that for every . The parameters and are related by the celebrated Alon-Boppana Theorem.[9] An infinite sequence of regular graphs on vertices satisfies that , where is an expression which tends to zero as tends to infinity.
Constant degree regular graphs (i.e., families of graphs with fixed degree that does not depend on ) for which is small in comparison with are largely referred to as expanders. In particular, graphs which attain the above bound asymptotically (i.e., ) are called Ramanujan graphs, and several efficient constructions are known [18, 5].
For a given and , let be a cyclic MDS code over that contains (explicit constructions of such codes are given in the sequel). According to Lemma 8, there exists a codeword whose support is . Let be all cyclic shifts of , which lie in by its cyclic property. Finally, let be the matrix whose columns are , i.e., . The following lemma provides some important properties of .
The matrix satisfies the following properties.
for every row of .
Every row of is a codeword in .
The column span of is the code .
Every set of rows of are linearly independent over .
To prove B1 and B2, observe that is of the following form, where .
To prove B3, notice that the leftmost columns of have leading coefficients in different positions, and hence they are linearly independent. Thus, the dimension of the column span of is at least , and since , the claim follows.
To prove B4, assume for contradiction that there exist a set of linearly dependent rows. Hence, there exists a vector of Hamming weight such that . According to B3, the columns of span , and hence the vector lies in the dual code of . Since is an MDS code by Lemma 8, it follows that the minimum Hamming weight of a codeword in is , a contradiction. ∎
Since is of dimension , it follows from parts B2 and B4 of Lemma 12 that every set of rows of are a basis to . Furthermore, since it follows that . Therefore, there exists a function such that for any set of size we have that and .
. The above and satisfy the EC condition (Definition 2).
In the remainder of this section, two cyclic MDS codes over the complex numbers and the real numbers are suggested, from which the construction in Theorem 13 can be obtained. These constructions are taken from [19] (Sec. II.B), and are given with a few adjustments to our case. The contributions of these codes is summarized in the following theorem.
For any given and there exist explicit complex matrices and that satisfy the EC-condition with optimal . The respective encoding (i.e., constructing ) and decoding (i.e., constructing given ) complexities are and , respectively. In addition, for any given and such that there exist explicit real matrices and that satisfy the EC-condition with optimal . The encoding and decoding complexities are and , where is the complexity of inverting a generalized Vandermonde matrix.
For a given and , let , and let be the set of complex roots of unity of order , i.e., . Let be a complex Vandermonde matrix over , i.e., for any and any . Finally, let . It is readily verified that is an MDS code that contains , whose codewords may be seen as the evaluations of all polynomials in on the set .
is a cyclic code.
Let be a codeword, and let be the corresponding polynomial. Consider the polynomial , and notice that . Further, it is readily verified that any satisfies that , where the indices are taken modulo . Hence, the evaluation of the polynomial on the set of roots results in the cyclic shift of the codeword , and lies in itself. ∎
The code is a cyclic MDS code which contains , and hence it can be used to obtain the matrices and , as described in Theorem 13.
Given a set of non-stragglers, an algorithm for computing the encoding vector in operations over (after a one-time initial computation of ), is given in Appendix A. The complexity of this algorithm is asymptotically smaller than the corresponding algorithm in [6] and [7] whenever . Furthermore, the cyclic structure of the matrix enables a very simple algorithm for its construction; this algorithm compares favorably with previous works for any , and is given in Appendix A as well.
Note that the use of complex rather than real matrix may potentially double the required bandwidth, since every complex number contains two real numbers. A simple manipulation of Algorithm 1 which resolves this issue is given in Appendix C. This optimal bandwidth is also attained by the scheme in the next section, which uses a smaller number of multiplication operations. However, it is applicable only if .
If one wishes to abstain from using complex numbers, e.g., in order to reduce bandwidth, we suggest the following construction, which provides a cyclic MDS code over the reals. This construction relies on [19] (Property 3), with an additional specialized property.
For a given and such that , define the following BCH codes over the reals. In both cases denote .
If is even and
is odd let
, and let be a BCH code which consists of all polynomials in that vanish over the set .If is odd and is even let , and let be a BCH code which consists of all polynomials in that vanish over the set .
The codes and from Construction 18 are cyclic MDS codes that contain .
According to Lemma 9, it is clear that and are cyclic. According to the BCH bound (Theorem 10), it is also clear that the minimum distance of is at least , and the minimum distance of is at least . Hence, to prove that and are MDS codes, it is shown that their code dimensions are .
Since the sets and are closed under conjugation (i.e., is in if and only if the conjugate of is in ) it follows that the polynomials and have real coefficients. Hence, by the definition of BCH codes it follows that
(2) |
and hence, and . Let and be the minimum distances of and , respectively, and notice that by the Singleton bound [21] (Sec. 4.1) it follows that
and thus and satisfy the Singleton bound with equality, or equivalently, they are MDS codes. To prove that is in and , we ought to show that for every and , which amounts to showing that . It is well-known that the sum of the ’th to ’th power of any root of unity of order , other than , equals zero. Since
it follows that and that . Hence, we have that for every and , which concludes the claim. ∎
Algorithms for computing the matrix and the vector for the codes in this subsection are given in Appendix B. The algorithm for construction outperforms previous works whenever , and the algorithm for computing outperforms previous works for a smaller yet wide range of values.
Recall that in order to retrieve that exact gradient, one must have , an undesirable overhead in many cases. To break this barrier, we suggest to renounce the exact retrieval of the gradient, and settle for an approximation of it. Note that trading the exact gradient for an approximate one is a necessity in many variants of gradient descent (such as the acclaimed stochastic gradient descent [20, Sec. 14.3]), and hence our techniques are aligned with common practices in machine learning.
Setting
as the identity matrix and
as the function which maps every to its binary characteristic vector , clearly satisfies the -AC scheme for , since(3) |
It is readily verified that this approach (termed hereafter as “trivial”) amount to ignoring the stragglers, which is essentially equivalent to the scheme given in [4]. We show that this can be outperformed by setting to be a normalized adjacency matrix of a connected regular graph on nodes, which is constructed by the master before dispersing the data, and setting to be some carefully chosen yet simple function.
The resulting error function depends on the parameters of the graph, whereas the resulting storage overhead is given by its degree (i.e., the fixed number of neighbors of each node). The error function is given below for a general connected and regular graph, and particular examples with their resulting errors are given in the sequel. In particular, it is shown that taking the graph to be an expander graph provides an error term which is asymptotically less than (Eq. (3)) whenever . In some cases, smaller error term is also obtained for larger values of .
For a given let be a connected -regular graph on nodes, with eigenvalues and corresponding (column) eigenvectors as described in Subsection IV. Let , and for a given of size , define as
(4) |
For any of size , .
First, observe that is exactly the subspace of all vectors whose sum of entries is zero. This follows from the fact that is an orthogonal basis, hence for every , and from the fact that are linearly independent. Since the sum of entries of is zero, the result follows. ∎
For any there exists such that , and .
Now, define as , and observe that for all . Note that computing given is done by a straightforward algorithm. The error function is given by the following lemma.
For every set of size , .
Notice that the eigenvalues of are , and hence equals . Further, the eigenvectors are identical to those of . Therefore, it follows from Corollary 21 that
and since are orthonormal, it follows that
The above and satisfy the -AC condition for . The storage overhead of this scheme equals the degree of the underlying regular graph .
It is evident that in order to obtain small deviation , it is essential to have a small and a large . However, most constructions of expanders have focused in the case were is constant (i.e., ). On one hand, constant serves our purpose well since it implies a constant storage overhead. On the other hand, a constant does not allow to tend to zero as tends to infinity due to Theorem 11.
To present the contribution of the suggested scheme, it is compared to the trivial one. Clearly, for any given number of stragglers , it follows from (3) and from Lemma 22 that our scheme outperforms the trivial one if
(5) |
Since any connected and non-bipartite graph satisfies that , it follows that Eq. (5) holds asymptotically for any . The following example shows the improved error rate for Margulis graphs (given in [9], Sec. 8), that are rather easy to construct.
For any integer there exists an -regular graph on nodes with . For example, by using these graphs with the parameters , , , we have that , whereas , an improvement of approximately .
Several additional examples for Ramanujan graphs, which attain much better error rate but are harder to construct, are as follows.
[18] Let and be distinct primes such that , , and such that the Legendre symbol is (i.e., is a quadratic residue modulo ). Then, there exist a non-bipartite Ramanujan graph on nodes and constant degree .
If and then and . E.g., for we have , whereas , an improvement of approximately .
If and then and . E.g., for we have , whereas , an improvement of approximately .
Restricting to be a constant (i.e., not to grow with ) is detrimental to the error term in (5) due to Theorem 11, but allows lower storage overhead. If one wishes a lower error term at the price of higher overhead, the following is useful.
The above approximation scheme can be applied over graphs that are bipartite. However, such graphs satisfy that , and hence the resulting error function is superseded by the error function of the trivial scheme (3). However, in what follows it is shown that bipartite graphs on nodes can be employed in a slightly different fashion, and obtain