# Robust Gradient Descent via Moment Encoding with LDPC Codes

This paper considers the problem of implementing large-scale gradient descent algorithms in a distributed computing setting in the presence of straggling processors. To mitigate the effect of the stragglers, it has been previously proposed to encode the data with an erasure-correcting code and decode at the master server at the end of the computation. We, instead, propose to encode the second-moment of the data with a low density parity-check (LDPC) code. The iterative decoding algorithms for LDPC codes have very low computational overhead and the number of decoding iterations can be made to automatically adjust with the number of stragglers in the system. We show that for a random model for stragglers, the proposed moment encoding based gradient descent method can be viewed as the stochastic gradient descent method. This allows us to obtain convergence guarantees for the proposed solution. Furthermore, the proposed moment encoding based method is shown to outperform the existing schemes in a real distributed computing setup.

## Authors

• 8 publications
• 16 publications
• 27 publications
• ### Data Encoding for Byzantine-Resilient Distributed Optimization

We study distributed optimization in the presence of Byzantine adversari...
07/05/2019 ∙ by Deepesh Data, et al. ∙ 0

• ### Iterative Pre-Conditioning for Expediting the Gradient-Descent Method: The Distributed Linear Least-Squares Problem

This paper considers the multi-agent linear least-squares problem in a s...
08/06/2020 ∙ by Kushal Chakrabarti, et al. ∙ 7

• ### Approximate Gradient Coding with Optimal Decoding

In distributed optimization problems, a technique called gradient coding...
06/17/2020 ∙ by Margalit Glasgow, et al. ∙ 0

• ### Distributed Stochastic Gradient Descent Using LDGM Codes

We consider a distributed learning problem in which the computation is c...
01/15/2019 ∙ by Shunsuke Horii, et al. ∙ 0

• ### Coded Iterative Computing using Substitute Decoding

In this paper, we propose a new coded computing technique called "substi...
05/15/2018 ∙ by Yaoqing Yang, et al. ∙ 0

• ### Gradient Coding via the Stochastic Block Model

Gradient descent and its many variants, including mini-batch stochastic ...
05/25/2018 ∙ by Zachary Charles, et al. ∙ 0

• ### Serverless Straggler Mitigation using Local Error-Correcting Codes

Inexpensive cloud services, such as serverless computing, are often vuln...
01/21/2020 ∙ by Vipul Gupta, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The vast volumes of data at our disposal today has made many useful data-driven tasks possible, which were otherwise thought to be infeasible. The large scale of the data necessitates working with distributed computing setups as the computational power available at a single location is not sufficient to meet the strict performance requirements in many real-life systems. Moreover, in many settings, local compute and storage resources cannot simply accommodate the entire data being processed.

As a general principle, the large-scale distributed computing setups (e.g., [5, 35]) divide the original problem at hand into many small tasks, which are assigned to many servers, namely workers. The master server then collects the outcomes of local computation at the workers (potentially over multiple rounds) and computes the final result. In practical systems, the process of collecting the outcomes from the workers is prone to unpredictable delays [4]. Such delays arise due to various reasons, including the slow-down at the workers and the congestion in the communication networks in the system. The workers that cannot provide the outcome of their local computation within a reasonable deadline due to these delays are termed stragglers. The presence of the stragglers can significantly degrade the performance of the system. Therefore, it is imperative that we address the variability in the response times of different components of the setup during the design of the computations tasks.

Multiple recent works explore the problem of mitigating the effect of stragglers. The replication schemes assign each task to multiple servers [26, 1, 32]. This ensures that the task gets completed without significant delay if at least one of the servers processing the task is non-straggler. In [15]

, Lee et al. explore the coding theoretic ideas that go beyond the replication schemes to address the issue of stragglers. In particular, they focus on linear computation, namely a matrix-vector product, and proposes to encode the columns of the matrix by a maximum distance separable (MDS) code to obtain a taller encoded matrix. The rows of the encoded matrix are distributed among the workers, who are responsible for computing the inner product of the rows assigned to them with the vector in question. The redundancy among the rows of the encoded matrix allows for computation of the intended matrix-vector product even if some of the servers fail to respond with the computation assigned to them. In

[6], Dutta et al. further explore the problem of reliably computing a matrix-vector product while additionally requiring that the rows of the encoded matrix are sparse. This aims at reducing the computation at the workers and communication between the master and the workers which scale with the row-sparsity of the encoded matrix. The similar ideas for other computational tasks (e.g., matrix-matrix product and convolution between vectors) have been explored in [16, 34, 7]. Bitar et al. propose a scheme to securely compute matrix-vector product without revealing any information about the matrix to the workers [2]. Another line of work that aims at minimizing communication during data shuffling by using coding techniques is presented in [17, 19, 18, 15] and reference therein.

#### Our contributions.

We focus on the problem of fitting a structured linear model to the given data. In particular, given the features or data points and the associated labels , we want to learn the model parameter belonging to a structured set so that for small modeling errors . In many applications, the prior knowledge about the structure of the model parameter (such as, sparsity and group sparsity) can be expressed with the help of a regularizer so that , for . In such settings, the task of recovering can be realized by solving the following optimization problem.

 minθ12m∑i=1(yi−xTiθ)2 subject to θ∈Θ={θ′∈Rk:R(θ′)≤R}. (1)

Note that the square loss – employed in the optimization problem above– is one of the most pervasive loss functions in machine learning, optimization, and statistics. A large class of estimation problems arising in practice, such as compressed sensing

[8], dictionary learning [20], and matrix completion [14], can be solved as special cases of the general optimization problem outlined in (1[31].

Even though, we focus on the constrained optimization problem, our proposed solution easily extends to the unconstrained optimization problem where we incorporate the regularizer in the objective function with the help of a regularization parameter .

We note that our proposed solution can also be employed to recover the structured model parameters for single-index or generalized linear models [12], where the given data fits a model of the form: with denoting a possibly unknown nonlinear link function. In this setting, the model parameter can again be recovered by solving a generalized LASSO in (1[22].

We employ the projected gradient descent (PGD) method to solve the underlying optimization problem. In a distributed computing setup, the iterative optimization procedure is implemented as follows. The master server maintains an estimate for the model parameter. In each step, the master sends the current estimate to the workers. The workers then compute a partial value of the gradient based on the received estimate and send the outcome of their computations to the master. By combining the messages received from the non-straggling workers, the master computes the gradient and updates its current estimate for the model parameter. In this paper, our first contribution is to propose a preprocessing step which encodes the second moment of the data and then distributes the encoded moments among the workers. This way, there is some redundancy among the outcomes of the computation at the workers, which allows the master to obtain a good enough estimate of the gradient even when it does not receive the outcome of the computation assigned to the stragglers.

We employ the low-density parity check (LDPC) codes to encode the second moment of the data points. As a result, the task of calculating the gradient at the master reduces to the task of decoding an LDPC codeword in the presence of erasures, where the erased locations depend on the identities of the stragglers. The reason for working with LDPC codes is that the iterative decoding algorithms for these codes provide us three-fold benefits. The decoder has very low computational complexity and can automatically adjust to the number of the stragglers with a small number of decoding iterations required if there are not too many stragglers present. Additionally, we can use the number of decoding iterations as a tuning parameter. Depending on the number of stragglers, we can run only those many decoding iterations that are sufficient to ensure the desired quality of the estimate of the gradient. In our setup, the number of erased coordinates of the gradient vector serves as a measure of its quality. Note that this measure is a non-increasing function of the number of decoding iterations. Finally, the MDS code based solutions provided in prior literature (such as, [15, 34]) suffer from the issue of noise-stability resulting from the low condition number of Vandermonde matrices, which we bypass by considering LDPC matrices.

Furthermore, we show that for a random model for stragglers, the PGD method with the proposed moment encoding scheme can be viewed as the projected stochastic gradient descent (PSGD) method. We then use the convergence analysis for the PSGD method to establish the convergence guarantee for our proposed solution. This analysis clearly characterizes the advantage over non-redundant or replication based gradient descent method in terms of the decoding iterations employed in each step of the method. We also conduct a detailed performance evaluation of our solution on a real-life distributed computing framework (swarm2) at the University of Massachusetts Amherst [29]. The performance results show that, as compared to the existing schemes, our proposed solution requires a smaller number of gradient steps in order to converge to the correct model parameter.

#### Comparison with other relevant works.

In [15], Lee et al. focus on performing iterative gradient descent method in a distributed manner via repeatedly invoking their solution for coded computation of matrix-vector product. In this paper, we also rely on the coded computation of matrix-vector product to realize iterative gradient descent in a straggler tolerant manner. However, we encode the second moment matrix as opposed to the plain data matrix as done in [15]. This leads to reduced communication rounds. Furthermore, this also makes the analysis of the optimization procedure completely different from that in [15]. As another novel contribution, we utilize LDPC codes which, as discussed above, allow for both efficient decoding and control over the quality of the (approximate) gradient computed in each step of the optimization procedure. In [13], Karakus et al. also study the problem of recovering the model parameter of a linear model by solving an alternative optimization problem where they encode both data points and their labels by the matrices with maximal (pairwise) incoherent columns. Again, our approach differs from theirs as we solve the original optimization problem itself and rely on moment encoding as opposed to data encoding.

#### Organization.

We present the exact problem formulation along with the necessary background in Section 2. We present the main contribution of this paper in Section 3 where we describe the moment encoding based optimization scheme along with its convergence analysis. In Section 4, we perform an extensive evaluation of the proposed scheme in a real-life distributed computing setup and compare it with the prior work. We present a list of notations in Table 1 for ease of reading.

## 2 System model and background

Our distributed computing setup has worker servers and one master server. Performing large-scale computation in this setup involves dividing the desired computation problem into multiple small computation tasks that are assigned to the workers. The master then collects the outcomes of the tasks mapped to the workers and produces the final result. The overall computation may require multiple rounds of communication among the master and the workers.

We are given data samples or feature vectors and their labels . In this paper, we are mainly concerned with learning a structured linear model. In particular, we are interested in learning a vector , for some regularizer , such that the following total empirical loss is minimized.

 L(θ)=12∥y−Xθ∥22=12m∑i=1(yi−xTiθ)2, (2)

where and . Note that the gradient of the total empirical loss with respect to has the following form.

 ∇θL(θ)=(XTXθ−XTy). (3)

In this paper, we rely on the PGD method to solve the underlying constrained optimization problem, which iteratively updates an estimate of . Specifically, at the -th step, the estimate has the form

 θt=PΘ(θt−1−ηt∇θL(θt−1)), (4)

where is the learning rate at the -th step, which may potentially be independent of ; and the projection operator is defined to be

###### Remark 1.

In our proposed scheme, the master performs the projection step in (4). Thus, we are mainly interested in the regularizers with computationally efficient projection operations. This is particularly true for decomposable regularizers, such as sparsity constraints.

#### Preliminary: Linear codes.

In this paper, we rely on linear codes to perform the overall computation on a distributed computing setup in redundant manner. The redundancy allows the master to realize the original computation task in a straggler tolerant manner. An linear code is simply a subspace of dimension belonging to an -length vector space. In this paper, we focus on the vector space defined over the real numbers . Therefore, an linear code forms a -dimensional subspace in . Given an -length message vector , it can be encoded (or mapped) to a codeword from the code with the help of a generator matrix as Thus, a linear code can be defined as Alternatively, a linear code can also be defined by a parity check matrix as follows A generator matrix leads to a systematic encoding, if for each , the message vector exactly appears as coordinates of the associated codeword . The redundancy introduced by mapping a dimensional vector to an -dimensional vector with allows one to recover from even when some of the coordinates of are missing. In particular, if the code has minimum distance , then can be recovered even if any coordinates of are not available.

### 2.1 The data coding method of [15] and the gradient coding approach of [30]

An approach to run gradient descent in a distributed system using reliable distributed matrix multiplication as a building block was recently presented by Lee et al. [15]

. Note that, in the linear regression problem, computing the gradient of the total empirical loss involves computation of two matrix-vector products in each iteration (see (

3)), namely: and In [15], an MDS-coded distributed algorithm for matrix multiplication was proposed. In this algorithm, to perform the matrix-vector product , the matrix is premultiplied by the generator matrix of an MDS code of proper dimensions to get . Each worker node then performs a single inner product (or a set of inner products) involving a row of and . The results of these local computations are then sent to the master node. As long as the number of workers that successfully deliver their local computations within the deadline is more than a specified threshold (in other words, as long as the number of stragglers is within the erasure correcting capability of the MDS code given by ), the product can be found at the master node. In each iteration of the gradient descent, the above matrix-vector product protocol is applied twice (see [15] for details) to compute and This facilitates computation of the gradient in each iteration in the presence of the stragglers.

In [30], Tandon et al. propose a novel framework to exactly compute gradient of the underlying loss function in a distributed computation setup. In particular, they consider a generic loss function which takes the following additive form. For such a loss function, its gradient can be obtained as

 ∇θL(θ)=m∑i=1∇θℓ((yi,xi),θ). (5)

In order to compute the gradient in a distributed manner, the samples and the corresponding labels are distributed among workers in a redundant manner. For , the samples and labels allocated to the -th worker server are indexed by the set . Given the samples and labels indexed by the set , the -th worker can compute the following components of the gradient (cf. (5)).

 Bi:={∇θℓ((yj,xj),θ)}j∈Ai⊂Rk. (6)

Now, the -th worker transmits a linear combination of the blocks in to the master. In particular, the transmitted block can be represented as follows.

 zi=∑j∈Aibi,j∇θℓ((yj,xj),θ)∈Rk. (7)

Equivalently, the transmitted blocks from all workers can be represented as the following matrix.

 Z=(z1,…,zw)T =B(∇θℓ((y1,x1),θ) ⋯ ∇θℓ((ym,xm),θ))T, (8)

where is an matrix containing the coefficients associated with the transmission from workers (cf. (7)). Note that, for , the support of the -th row of the matrix is contained in the set .

Let denote the set of indices of the workers that successfully deliver their local computations within the deadline. Assuming that we have straggling workers which do not respond with their intended transmission before the deadline, we have . Note that the master has following information at its disposal.

 ZS=BS(∇θℓ((y1,x1),θ) ⋯ ∇θℓ((ym,xm),θ))T, (9)

where and denote the sub-matrices formed by the rows indexed by in and , respectively. In order to be able to obtain the gradient

 ∇θL(θ) =m∑i=1∇θℓ((yi,xi),θ) =(1,…,1)⋅(∇θℓ((y1,x1),θ) ⋯ ∇θℓ((ym,xm),θ))T,

we require that the all ones vector belongs to the subspace spanned by the rows of the matrix . Therefore, the design criterion in the gradient coding approach [30] is to find an allocation of the samples and the associated transmission matrix such that for every with , all ones vector belongs to the row-space of the matrix .

Our computing method crucially differs from both of the schemes of [30] and [15]. Instead of encoding the matrices and with MDS codes we use a single code to encode the matrix , the second moment of the data.

## 3 Encoding second moment : Optimization with approximate gradient

We exploit the special structure of the gradient of the square loss (cf. (3)) to devise a scheme to deal with stragglers. The proposed scheme is more efficient as compared to the gradient coding approach [30] and the reliable distributed matrix multiplication based scheme [15] (cf. supplementary material). Recall the gradient of the total empirical loss associated with the square loss function from (3). Note that we need to compute the term only once at the beginning of the optimization procedure as it is independent of the optimization parameter . By using the notation and , the -th step of the PGD method takes the following form (cf. (4)).

 θt =PΘ(θt−1−ηt∇θL(θt−1))=PΘ(θt−1−ηt(Mθt−1−b)). (10)

where denotes the estimate of at the end of -th step.

### 3.1 Exact computation of gradient in each step

Now, in order to perform the projected gradient descent in a distributed computation setup, we distribute the task of computing matrix-vector product among the workers. In particular, we encode the matrix using a linear code. The encoded matrix can be used to generate redundant tasks for workers which subsequently enable us to mitigate the effect of stragglers.

###### Scheme 1 (Exact gradient computation using linear codes:).

Given the matrix and an linear code111For the ease of exposition, we assume that divides . , the gradient computation for each step of the optimization procedure is realized as below.

• [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

• Let denote the rows of the matrix . Let represent a partition of the set indices for these rows such that

• For each , we encode the matrix using the linear code as where is an generator matrix of . Note that the columns of the matrix form codewords of .

• In the distributed computation setup, for and , we now allocate -th row of to the -th worker. This way, the -th server is assigned the following sets of vectors.

 Tj={c(1)j,…,c(kK)j}⊂Rk, (11)

where denotes the -th row of the matrix .

• During the -th step of the gradient descent optimization procedure, -th worker is tasked with computing the inner product of the rows assigned to it with the current estimate , i.e., the -th worker sends inner products to the master.

• Straggler tolerant exact gradient computation: Assuming that the workers indexed by the set behave as stragglers during the -th step of the optimization procedure, the master has access to the following information received from the non-straggling workers.

 C(i)Stθt−1=GStMPiθt−1  for all i∈[k/K]. (12)

Since the code generated by is a linear code, it’s straightforward to verify that for each , corresponds to a codeword of . Moreover, the information available at the master (cf. (12)) is equivalent to observing these codewords with some of their coordinates erased. Assuming the code has large enough minimum distance, or equivalently, the matrix is full rank, the master can recover from the information received from the workers indexed by the set . This allows the master to construct and update the estimate for according to (10).

We now state the following result about the performance of Scheme 1, which follows from the description of the scheme in a straightforward manner.

###### Proposition 1.

Assume that the moment encoding based Scheme 1 employs an linear code with minimum distance . Then, the scheme implements exact gradient descent method as long as the number of the stragglers during each step of the optimization is strictly less than .

###### Remark 2.

Note that length of the code does not need to be equal to the number of workers. For the ease of exposition, we focus on the case here. This choice provides a simple natural allocation of computation tasks to the workers. However, suitable allocation can also be devised for the setting with .

Comparison with gradient coding approach [30]. Encoding the second moments offers an immediate advantage over the general gradient coding approach for the underlying optimization problem (cf. (1)). In Scheme 1, during a step of the optimization method, each worker communicates one scalar for each of the rows assigned to it. Whereas, in gradient coding, each worker needs to transmit a -dimensional vector to the master (cf. (7), in the supplementary material). Moreover, as for the local computation at a worker during each step, our approach requires computing a single inner product for every row assigned to the worker. In contrast, in the gradient coding framework, workers have to perform matrix-vector products between rank matrices and -dimensional vectors.

In Scheme 1, we employ linear codes with the objective that the master should be able to compute (decode) the exact gradient during every step of the optimization procedure. This can be achieved by utilizing any linear code with large enough minimum distance. However, for the PGD method to succeed, it’s not necessary to compute the exact gradient in every step. In particular, the stochastic gradient descent method is one of the most used versions of the gradient descent methods, where one employs an estimate of the gradient based on a randomly chosen sample and its label [27]. For the problem at hand, the -th step of projected stochastic gradient descent (PSGD) method is as follows.

 (13)

where denotes an integer that is picked uniformly at random from . Note that

indeed gives an unbiased estimate of the true gradient (cf. (

3)) as Next, we exploit this robustness of the gradient based procedures to the quality of the gradient.

### 3.2 Approximate recovery of gradient in every step

Here, we focus on implementing the gradient based optimization procedure in a distributed computing setup by constructing only an estimate of the true gradient during each step of the optimization procedure. This allows us to employ coding schemes that have low complexity encoding and decoding algorithms, which lowers the overall computational complexity of the coding based approach to mitigate the effect of stragglers. Before we describe our approximate gradient based optimization procedure, we specify the assumptions on the identity of the stragglers during each step of the optimization procedure.

###### Assumption 1 (Straggling behavior of the workers).

Let the indices of the stragglers during the

-th step of the optimization be distributed independent of the stragglers in the previous steps. Furthermore, let the distribution of the stragglers in each step be such that each worker independently behaves as a straggler with probability

.

The analysis of this section can be modified for the other random models for the identity of the stragglers. Here, we note that we do not ensure any such random model for the straggling behavior during our experimental evaluations of the proposed scheme in Section 4.

We are now in the position to describe the LDPC codes based optimization procedure that rely on approximate gradient during each step of the optimization procedure.

###### Scheme 2 (LDPC codes based optimization with approximate gradients).

Given the matrix , we take an LDPC code with as its (low-density) parity check matrix.222For the ease of exposition, in addition to , we assume that . The proposed scheme can be easily generalized to the setting with , as done in Scheme 1 by partitioning the rows of in the blocks of rows. The approximate gradient based optimization procedure in realized as follows.

• [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

• Encode using a systematic matrix of , say , as where without loss of generality we assume that constitutes the first rows of the matrix . Next, distribute the rows of among workers such that the -th row is assigned to the -th worker.

• During the -th step of the optimization procedure, -th worker computes the inner product of the row assigned to it with the current estimate and sends to the master.

• Assuming that the set denotes the indices of stragglers during -th step, the information received at the master takes the form:

 CStθt−1=GStMθt−1. (14)

Note that is a codeword of with appearing in its first coordinates.

• Computation of approximate gradient: Given , the master employs iterations of an iterative erasure correction algorithm for the LDPC code , where denotes the indices of the erased coordinates. Let be the estimate for the codeword after iterations of the erasure correction algorithm [24]. If a particular coordinate is not recovered by the end of iterations, we replace the coordinate with . Let denote the set of indices of the coordinates that are set to in this manner. Subsequently, we construct a vector by setting those coordinates of to that are in . During the -th step, the master updates the current estimate of as

 θt=PΘ(θt−1−ηl⋅((^c(t;D)1,…,^c(t;D)k)T−ˆbt)). (15)

In what follows, we establish that under Assumption 1, Scheme 2 indeed implements a variant of the PSGD method. As a result, under some natural requirements on the loss function and the initialization , we obtain a convergence result for Scheme 2 that is similar to those available in the literature for the PSGD method (cf. (13)).

However, before we analyze the convergence of Scheme 2, we need to characterize the quality of the gradient recovered at the end of iterations of the erasure correction algorithm of the underlying LDPC code . The LDPC codes have been extensively studied in the literature along with the performances of various decoding algorithms for such codes [9, 28, 24]. Under Assumption 1, where each worker independently behaves as a straggler with probability , the vector received by the master (cf. (14)) is equivalent to the outcome of an erasure channel. For a specific family of LDPC codes and a fixed iterative erasure correction algorithm, there have been many successful attempts to characterize the likelihood of an initially erased coordinate being recovered after a certain number of iterations. Here, we state a special case333In particular, we restrict ourselves to the LDPC codes with left and right regular Tanner graphs. We refer the readers to [24] for the general version of the result that applies to LDPC codes with irregular Tanner graphs. of the most prominent result in this direction which applies to various random ensembles of LDPC codes with sufficiently large length. This results is obtained by density evolution analysis [24].

###### Proposition 2.

Consider an ensemble of LDPC code defined by the random parity check matrix such that each of the rows ( columns) of the matrix have () nonzero entries.444There are multiple ways of generating a random ensembles of LDPC codes (see e.g., [24][Ch. 3]). Let each coordinate of a codeword from the ensemble be independently erased with the probability . Then, the probability that a coordinate of the codeword remains erased after iterations of the iterative erasure correction satisfies the relationship555The relation in here is shown to hold with very high probability, which involves application of bounded-difference concentration inequality on the random bipartite graphs corresponding to . Given that these are fairly standard results in the coding theory literature, we refer the readers to [25, 24] for the details.

###### Remark 3.

The key take away from Proposition 2 is that the probability of a coordinate of a codeword staying erased is a monotonically non-increasing function of the number of iterations as long as , where

is function of the row and column weights of the random matrix

.

The following lemma characterizes the quality of the gradient vector obtained at the master after iterations of the erasure correction algorithm of the underlying LDPC code.

###### Lemma 1.

Let the distribution of stragglers satisfy Assumption 1 and the master node employs iterations of the erasure correction algorithm. Then, during -th step of the optimization procedure, we have

 E[(^c(t;D)1,…,^c(t;D)k)T−ˆbt]=(1−qD)⋅∇L(θt−1),

which is a scaled version of the true gradient at .

###### Proof.

Recall that, during the -th step of the optimization procedure, denotes the probability that a particular coordinate of the codeword is not recovered by the master (cf. Scheme 2). The first coordinates of this vector correspond to the true gradient vector at . Therefore, for , we have

 P[^c(t;D)i=ci]=1−qD and%  P[^c(t;D)i=0]=qD. (16)

Similarly, for , we have,

 P[^bi=bi]=1−qD  and  P[^bi=0]=qD. (17)

By using (16) and (17), it is straightforward to verify that

 E[(^c(t;D)1,…,^c(t;D)k)T−ˆbt] =(1−qD)⋅((c1,…,ck)T−b) (i)=(1−qD)⋅(Mθl−1−XTy) =(1−qD)⋅∇L(θt−1),

where follows from the systematic form associated with the generator matrix . ∎

#### Convergence analysis of Scheme 2

Here, we formally argue that the proposed Scheme 2 enjoys the convergence guarantees similar to those available for the typical PSGD method. In fact, the proof of the convergence of our scheme heavily relies on the ideas employed in the proof of convergence for PSGD algorithm as described in [21]. Recall that the total empirical loss associated with the model parameter for given set of data samples and the corresponding labels takes the form.

 L(θ)=m∑i=1ℓ((yi,xi),θ)=12⋅m∑i=1(yi−xTiθ)2. (18)

We now state the convergence result for Scheme 2 which holds under natural assumptions on the loss function and the initialization for the optimization procedure .In what follows we use to denote the norm . We also note that the projection operator is non-expanding, i.e.,

 ∥PΘ(θ)−PΘ(θ′)∥≤∥θ−θ′∥  for all%  θ,θ′∈Rk.
###### Theorem 1.

Suppose for all and , the loss function satisfies Moreover, let the initial estimate satisfy Then, by setting the learning rate as in Scheme 2, when iterations of LDPC decoding are employed during each gradient descent step, ensures the following:

 E[L(¯θT)]−L(θ∗)≤RB/((1−qD)⋅√T), (19)

where and the expectation is taken over the distribution of the stragglers.

###### Proof.

It follows from the convexity of the loss function that

 L(¯θT)−L(θ∗)≤1TT∑t=1L(θt)−L(θ∗)≤1TT∑t=1∇L(θt)⋅(θt−θ∗). (20)

Recall from (15) that, for , we have

 θt+1=PΘ(θt−gt(θt)),

where . Now, consider

 ∥θt+1−θ∗∥2 ≤∥PΘ(θt−gt(θt))−θ∗∥2(i)=∥PΘ(θt−gt(θt))−PΘ(θ∗)∥2 ≤∥θt−gt(θt)−θ∗∥2 =∥θt−θ∗∥2−2η⋅⟨gt(θt),(θt−θ∗)⟩+η2∥gt(θt)∥2 ≤∥θt−θ∗∥2−2η⋅⟨gt(θt),(θt−θ∗)⟩+η2B2, (21)

where follows from the fact that and holds as the operator is non-expanding, i.e.,

 ∥PΘ(θ)−PΘ(θ′)∥≤∥θ−θ′∥  for all θ,θ′∈Rk.

Let denote the history, i.e., identity of the stragglers, before the -th step of the optimization procedure. Note that it follows from Lemma 1 that

 E[gt(θt) | Ht]=(1−qD)⋅∇L(θt). (22)

By combining (3.2) and (22), we obtain that

 E[∥θt+1−θ∗∥2 | Ht]≤∥θt−θ∗∥2−2η⋅(1−qD)⋅⟨∇L(θt),(θt−θ∗)⟩+η2B2. (23)

Now taking expectation on the both sides gives us that

 E[∥θt+1−θ∗∥2]≤E[∥θt−θ∗∥2]−2⋅E[η⋅(1−qD)⋅⟨∇L(θt),(θt−θ∗)⟩]+η2B2. (24)

or

 (1−qD)⋅E[⟨∇L(θt),(θt−θ∗)⟩] ≤12η⋅E[∥θt−θ∗∥2]−12η⋅E[∥θt+1−θ∗∥2]+ηB22. (25)

By taking the average of the aforementioned inequality over iteration, we obtain that

 E[1TT−1∑t=0⟨∇L(θt),(θt−θ∗)⟩] ≤12η(1−qD)⋅(E[∥θ0−θ∗∥2]T−E[∥θT−θ∗∥2]T+η2B2) ≤∥θ0−θ∗∥22ηT(1−qD)+ηB22(1−qD) ≤R22ηT(1−qD)+ηB22(1−qD) (i)≤11−qD⋅RB√T, (26)

where follows form the choice of . Now, Theorem 1 follows by combining (20) and (3.2). ∎

## 4 Simulation results

In this section, we conduct a detailed evaluation of our moment encoding based scheme (cf. Scheme 2) for distributed computation. In particular, we perform experiments on distributed setting to obtain solutions of two problems: 1) Least-square estimation, and 2) Sparse recovery. Recall that, for least-square estimation, given inputs and the task is to find Note that this problem does not require a projection step during the optimization procedure. In the sparse recovery problem, one seeks to find a -sparse vector (this means at most coordinates out of of the vector are nonzero) from linear samples , for some matrix . In this case, -th step of the projected gradient descent procedure takes the form [10] where is the gradient of the squared loss and is the thresholding operation that sets all except the largest coordinates in absolute value of to zero. To compute the gradient we again employ the moment encoding method with LDPC codes as outlined in Scheme 2. Note that the thresholding operation can be easily performed by the master node itself.

Figure 3 presents the results for the least-square estimation problem. In our experiments, the data samples are randomly generated with the dimensions and the number of total samples . The corresponding labels are created by multiplying the data matrix with randomly drawn vector . We implement Scheme 2 on a real-life distributed computing framework (swarm2) at the University of Massachusetts Amherst [29] using mpi4py Python package. The setup involves a cluster of computing nodes ( worker nodes and master nodes). Throughout this section, the plotted results are averaged over trials. We compare our LDPC codes based (rate) moment encoding scheme with the recently proposed data encoding (with MDS/Gaussian matrices) scheme of Karakus et al. (KSDY17 in the figures) [13], as well as with uncoded and replication-based schemes (-replication).666 Here, we do not compare our scheme with the approaches proposed in [30] and [15] as both of these schemes involve significantly different computation and communication requirements (cf. the supplementary material). For example, the gradient coding scheme [30] requires communicating -dimensional vectors; and the approach of [15] involves encoding of two different matrices and two rounds of communications per step of the optimization procedure. In all cases, we wait for either or workers to respond before the computations at the master node, i.e., the number of stragglers is or , respectively. In order to implement our scheme, we utilize a LDPC code. In the replication-based schemes, we partition the data and repeat each partition of the data twice. We use sub-sampled Hadamard and Gaussian matrices to implement the data encoding method from [13]. We sampled the columns of Hadamard matrix and generated random Gaussian matrices for the purpose of our experiments. For each case we record the number of steps until the Euclidean distance of the evaluated parameter from the actual parameter vector is within a small threshold.

For the sparse recovery problem, we consider both the overdetermined and the underdetermined cases. For , we adopt the same experimental setup as described above with , but restrict ourselves to the dimensions . For each , we consider different sparsities: for , entries in are nonzero. Figure 3 presents the results for the sparse recovery problem in this overdetermined setup. We only plot the number of steps of the optimization procedure. The total computation time shows a similar trend. For , we generate the matrix as a

matrix with i.i.d. entries distributed according to the standard normal distribution. The true parameter vector

is drawn randomly with sparsity levels . The results obtained from our experiments are presented in Figure 3. As it is evident from the plots in Figure 3, 3 and 3, our scheme requires smaller number of steps to converge to the true model parameters. Furthermore, our scheme also leads to smaller overall computation time.

#### Conclusion and future directions.

In this paper we have proposed to encode the second moment of data for the purpose of running a distributed gradient descent algorithm. However our encoding is tailored for the squared-loss function (indeed, otherwise the second moment of data would not appear in the gradient). Our approach can be generalized to other loss functions - such as logarithmic loss, or the Poisson loss function - relevant to various machine learning tasks. It would be an interesting future work to see what functional of the data needs to be encoded in those cases such that computation and communication overheads are minimized.

### Acknowledgements

This work is supported in part by National Science Foundation awards CCF 1642658 (CAREER) and CCF 1618512.

## References

• [1] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI), pages 185–198, 2013.
• [2] R. Bitar, P. Parag, and S. E. Rouayheb. Minimizing latency for secure distributed computing. In Proceedings of 2017 IEEE International Symposium on Information Theory (ISIT), pages 2900–2904, 2017.
• [3] Z. Charles, D. Papailiopoulos, and J. Ellenberg. Approximate gradient coding via sparse random graphs. arXiv preprint arXiv:1711.06771, 2017.
• [4] J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74–80, 2013.
• [5] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, Jan. 2008.
• [6] S. Dutta, V. Cadambe, and P. Grover.

Short-dot: Computing large linear transforms distributedly using coded short dot products.

In Advances in Neural Information Processing Systems, pages 2100–2108, 2016.
• [7] S. Dutta, V. Cadambe, and P. Grover. Coded convolution for parallel and distributed computing within a deadline. arXiv preprint arXiv:1705.03875, 2017.
• [8] M. Figueiredo, R. D. Nowak, and S. J. Wright. Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems. IEEE Journal of selected topics in signal processing, 1(4):586–597, 2007.
• [9] R. Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21–28, 1962.
• [10] R. Garg and R. Khandekar. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 337–344. ACM, 2009.
• [11] W. Halbawi, N. Azizan Ruhi, F. Salehi, and B. Hassibi. Improving distributed gradient descent using Reed-Solomon codes. arXiv preprint arXiv:1706.05436, 2017.
• [12] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, pages 927–935, 2011.
• [13] C. Karakus, Y. Sun, S. Diggavi, and W. Yin. Straggler mitigation in distributed optimization through data encoding. In Advances in Neural Information Processing Systems, pages 5440–5448, 2017.
• [14] V. Koltchinskii, K. Lounici, and A. B. Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329, 2011.
• [15] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. Speeding up distributed machine learning using codes. IEEE Transactions on Information Theory, 64(3):1514–1529, March 2018.
• [16] K. Lee, C. Suh, and K. Ramchandran. High-dimensional coded matrix multiplication. In Proceedings of IEEE International Symposium on Information Theory (ISIT), pages 2418–2422, 2017.
• [17] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. Coded Mapreduce. In Proceedings of 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 964–971, 2015.
• [18] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. A unified coding framework for distributed computing with straggling servers. In Proceedings of IEEE Globecom Workshops (GC Wkshps), pages 1–6, 2016.
• [19] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr. A fundamental tradeoff between computation and communication in distributed computing. IEEE Transactions on Information Theory, 64(1):109–128, Jan 2018.
• [20] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, pages 689–696. ACM, 2009.
• [21] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
• [22] Y. Plan and R. Vershynin. The generalized lasso with non-linear observations. IEEE Transactions on information theory, 62(3):1528–1537, 2016.
• [23] N. Raviv, I. Tamo, R. Tandon, and A. G. Dimakis. Gradient coding from cyclic MDS codes and expander graphs. arXiv preprint arXiv:1707.03858, 2017.
• [24] T. Richardson and R. L. Urbanke. Modern Coding Theory. Cambridge University Press, New York, NY, USA, 2008.
• [25] T. J. Richardson and R. L. Urbanke. The capacity of low-density parity-check codes under message-passing decoding. IEEE Transactions on Information Theory, 47(2):599–618, Feb 2001.
• [26] N. B. Shah, K. Lee, and K. Ramchandran. When do redundant requests reduce latency? IEEE Transactions on Communications, 64(2):715–722, Feb 2016.
• [27] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, New York, NY, USA, 2014.
• [28] M. Sipser and D. A. Spielman. Expander codes. IEEE Transactions on Information Theory, 42(6):1710–1722, Nov 1996.
• [29] Swarm2. Swarm user documentation. Accessed: 2018-01-05.
• [30] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis. Gradient coding: Avoiding stragglers in distributed learning. In Proceedings of the 34th International Conference on International Conference on Machine Learning (ICML), pages 3368–3376, 2017.
• [31] R. Tibshirani, M. Wainwright, and T. Hastie. Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC, 2015.
• [32] D. Wang, G. Joshi, and G. Wornell. Using straggler replication to reduce latency in large-scale parallel computing. ACM SIGMETRICS Performance Evaluation Review, 43(3):7–11, 2015.
• [33] Y. Yang, P. Grover, and S. Kar. Coding method for parallel iterative linear solver. arXiv preprint arXiv:1706.00163, 2017.
• [34] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr. Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. arXiv preprint arXiv:1705.10464, 2017.
• [35] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), pages 10–10, 2010.