# Redundancy Techniques for Straggler Mitigation in Distributed Optimization and Learning

Performance of distributed optimization and learning systems is bottlenecked by "straggler" nodes and slow communication links, which significantly delay computation. We propose a distributed optimization framework where the dataset is "encoded" to have an over-complete representation with built-in redundancy, and the straggling nodes in the system are dynamically left out of the computation at every iteration, whose loss is compensated by the embedded redundancy. We show that oblivious application of several popular optimization algorithms on encoded data, including gradient descent, L-BFGS, proximal gradient under data parallelism, and coordinate descent under model parallelism, converge to either approximate or exact solutions of the original problem when stragglers are treated as erasures. These convergence results are deterministic, i.e., they establish sample path convergence for arbitrary sequences of delay patterns or distributions on the nodes, and are independent of the tail behavior of the delay distribution. We demonstrate that equiangular tight frames have desirable properties as encoding matrices, and propose efficient mechanisms for encoding large-scale data. We implement the proposed technique on Amazon EC2 clusters, and demonstrate its performance over several learning problems, including matrix factorization, LASSO, ridge regression and logistic regression, and compare the proposed method with uncoded, asynchronous, and data replication strategies.

## Authors

• 12 publications
• 6 publications
• 14 publications
• 23 publications
• ### Straggler Mitigation in Distributed Optimization Through Data Encoding

Slow running or straggler tasks can significantly reduce computation spe...

11/14/2017 ∙ by Can Karakus, et al. ∙ 0

01/28/2019 ∙ by Hongyi Wang, et al. ∙ 0

Stochastic convex optimization algorithms are the most popular way to tr...

02/16/2018 ∙ by Ashok Cutkosky, et al. ∙ 0

• ### Fundamental Resource Trade-offs for Encoded Distributed Optimization

Dealing with the shear size and complexity of today's massive data sets ...

03/31/2018 ∙ by A. Salman Avestimehr, et al. ∙ 0

• ### Data Encoding for Byzantine-Resilient Distributed Optimization

We study distributed optimization in the presence of Byzantine adversari...

07/05/2019 ∙ by Deepesh Data, et al. ∙ 0

• ### Gradient Scheduling with Global Momentum for Non-IID Data Distributed Asynchronous Training

02/21/2019 ∙ by Chengjie Li, et al. ∙ 0

• ### Towards Complex Artificial Life

An object-oriented combinator chemistry was used to construct an artific...

05/16/2018 ∙ by Lance R. Williams, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Solving learning and optimization problems at present scale often requires parallel and distributed implementations to deal with otherwise infeasible computational and memory requirements. However, such distributed implementations often suffer from system-level issues such as slow communication and unbalanced computational nodes. The runtime of many distributed implementations are therefore throttled by that of a few slow nodes, called stragglers, or a few slow communication links, whose delays significantly encumber the overall learning task. In this paper, we propose a distributed optimization framework based on proceeding with each iteration without waiting for the stragglers, and encoding the dataset across nodes to add redundancy in the system in order to mitigate the resulting potential performance degradation due to lost updates.

We consider the master-worker architecture, where the dataset is distributed across a set of worker nodes, which directly communicate to a master node to optimize a global objective. The encoding framework consists of an efficient linear transformation (coding) of the dataset that results in an overcomplete representation, which is then partitioned and distributed across the worker nodes. The distributed optimization algorithm is then performed directly on the encoded data, with all worker nodes oblivious to the encoding scheme,

i.e., no explicit decoding of the data is performed, and nodes simply solve the effective optimization problem after encoding. In order to mitigate the effect of stragglers, in each iteration, the master node only waits for the first updates to arrive from the worker nodes (where is a design parameter) before moving on; the remaining node results are effectively erasures, whose loss is compensated by the data encoding.

The framework is applicable to both the data parallelism and model parallelism paradigms of distributed learning, and can be applied to distributed implementations of several popular optimization algorithms, including gradient descent, limited-memory-BFGS, proximal gradient, and block coordinate descent. We show that if the linear transformation is designed to satisfy a spectral condition resembling the restricted isometry property, the iterates resulting from the encoded version of these algorithms deterministically converge to an exact solution for the case of model paralellism, and an approximate one under data parallelism, where the approximation quality only depends on the properties of encoding and the parameter . These convergence guarantees are deterministic in the sense that they hold for any pattern of node delays, i.e., even if an adversary chooses which nodes to delay at every iteration. In addition, the convergence behavior is independent of the tail behavior of the node delay distribution. Such a worst-case guarantee is not possible for the asynchronous versions of these algorithms, whose convergence rates deteriorate with increasing node delays. We point out that our approach is particularly suited to computing networks with a high degree of variability and unpredictability, where a large number of nodes can delay their computations for arbitrarily long periods of time.

Our contributions are as follows: (i) We propose the encoded distributed optimization framework, and prove deterministic convergence guarantees under this framework for gradient descent, L-BFGS, proximal gradient and block coordinate descent algorithms; (ii) we provide three classes of encoding matrices, and discuss their properties, and describe how to efficiently encode with such matrices on large-scale data; (iii) we implement the proposed technique on Amazon EC2 clusters and compare their performance to uncoded, replication, and asynchronous strategies for problems such as ridge regression, collaborative filtering, logistic regression, and LASSO. In these tasks we show that in the presence of stragglers, the technique can result in significant speed-ups (specific amounts depend on the underlying system, and examples are provided in Section 5) compared to the uncoded case when all workers are waited for in each iteration, to achieve the same test error.

##### Related work.

The approaches to mitigating the effect of stragglers can be broadly classified into three categories: replication-based techniques, asynchronous optimization, and coding-based techniques.

Replication-based techniques consist of either re-launching a certain task if it is delayed, or pre-emptively assigning each task to multiple nodes and moving on with the copy that completes first. Such techniques have been proposed and analyzed in Gardner et al. (2015); Ananthanarayanan et al. (2013); Shah et al. (2016); Wang et al. (2015); Yadwadkar et al. (2016), among others. Our framework does not preclude the use of such system-level strategies, which can still be built on top of our encoded framework to add another layer of robustness against stragglers. However, it is not possible to achieve the worst-case guarantees provided by encoding with such schemes, since it is still possible for both replicas to be delayed.

Perhaps the most popular approach in distributed learning to address the straggler problem is asynchronous optimization, where each worker node asynchronously pushes updates to and fetches iterates from a parameter server independently of other workers, hence the stragglers do not hold up the entire computation. This approach was studied in Recht et al. (2011); Agarwal and Duchi (2011); Dean et al. (2012); Li et al. (2014) (among many others) for the case of data parallelism, and Liu et al. (2015); You et al. (2016); Peng et al. (2016); Sun et al. (2017)

for coordinate descent methods (model parallelism). Although this approach has been largely successful, all asynchronous convergence results depend on either a hard bound on the allowable delays on the updates, or a bound on the moments of the delay distribution, and the resulting convergence rates explicitly depend on such bounds. In contrast, our framework allows for completely unbounded delays. Further, as in the case of replication, one can still consider asynchronous strategies on top of the encoding, although we do not focus on such techniques within the scope of this paper.

A more recent line of work that address the straggler problem is based on coding-theory-inspired techniques Tandon et al. (2017); Lee et al. (2016); Dutta et al. (2016); Karakus et al. (2017a, b); Yang et al. (2017); Halbawi et al. (2017); Reisizadeh et al. (2017). Some of these works focus exclusively on coding for distributed linear operations, which are considerably simpler to handle. The works in Tandon et al. (2017); Halbawi et al. (2017) propose coding techniques for distributed gradient descent that can be applied more generally. However, the approach proposed in these works require a redundancy factor of in the code, to mitigate stragglers. Our approach relaxes the exact gradient recovery requirement of these works, consequently reducing the amount of redundancy required by the code.

The proposed technique, especially under data parallelism, is also closely related to randomized linear algebra and sketching techniques in Mahoney et al. (2011); Drineas et al. (2011); Pilanci and Wainwright (2015), used for dimensionality reduction of large convex optimization problems. The main difference between this literature and the proposed coding technique is that the former focuses on reducing the problem dimensions to lighten the computational load, whereas encoding increases the dimensionality of the problem to provide robustness. As a result of the increased dimensions, coding can provide a much closer approximation to the original solution compared to sketching techniques. In addition, unlike these works, our model allows for an arbitrary convex regularizer in addition to the encoded loss term.

## 2 Encoded Distributed Optimization

We will use the notation

. All vector norms refer to 2-norm, and all matrix norms refer to spectral norm, unless otherwise noted. The superscript

will refer to complement of a subset, i.e., for , . For a sequence of matrices and a set of indices, we will denote to mean the matrix formed by stacking the matrices vertically. The main notation used throughout the paper is provided in Table 1.

We consider a distributed computing network where the dataset is stored across a set of worker nodes, which directly communicate with a single master node. In practice the master node can be implemented using a fully-connected set of nodes, but this can still be abstracted as a single master node.

It is useful to distinguish between two paradigms of distributed learning and optimization; namely, data parallelism, where the dataset is partitioned across data samples, and model parallelism, where it is partitioned across features (see Figures 2 and 4). We will describe these two models in detail next.

### 2.1 Data parallelism

We focus on objectives of the form

 f(w)=12n∥Xw−y∥2+λh(w), (1)

where and are the data matrix and data vector, respectively. We assume each row of

corresponds to a data sample, and the data samples and response variables can be horizontally partitioned as

and . In the uncoded setting, machine stores the row-block (Figure 2). We denote the largest and smallest eigenvalues of with , and , respectively. We assume , and is a convex, extended real-valued function of that does not depend on data. Since can take the value , this model covers arbitrary convex constraints on the optimization.

The encoding consists of solving the proxy problem

 ˜f(w)=12n∥S(Xw−y)∥2+λh(w)=12nm∑i=1∥Si(Xw−y)∥2fi(w)+λh(w), (2)

instead, where is a designed encoding matrix with redundancy factor , partitioned as across machines. Based on this partition, worker node stores , and operates to solve the problem (2) in place of (1) (Figure 2). We will denote , and .

In general, the regularizer can be non-smooth. We will say that is -smooth if exists everywhere and satisfies

 h(w′)≤h(w)+⟨∇h(w),w′−w⟩+L2∥w′−w∥2

for some , for all . The objective is -strongly convex if, for all ,

 f(y)≥f(x)+⟨∇f(x),y−x⟩+ν2∥x−y∥2.

Once the encoding is done and appropriate data is stored in the nodes, the optimization process works in iterations. At iteration , the master node broadcasts the current iterate to the worker nodes, and wait for gradient updates to arrive, corresponding to that iteration, and then chooses a step direction and a step size (based on algorithm that maps the set of gradients updates to a step) to update the parameters. We will denote . We will also drop the time dependence of and whenever it is kept constant.

The set of fastest nodes to send gradients for iteration will be denoted as . Once updates have been collected, the remaining nodes, denoted , are interrupted by the master node111If the communication is already in progress at the time when faster gradient updates arrive, the communication can be finished without interruption, and the late update can be dropped upon arrival. Otherwise, such interruption can be implemented by having the master node send an interrupt signal, and having one thread at each worker node keep listening for such a signal.. Algorithms 1 and 2 describe the generic mechanism of the proposed distributed optimization scheme at the master node and a generic worker node, respectively.

The intuition behind the encoding idea is that waiting for only workers prevents the stragglers from holding up the computation, while the redundancy provided by using a tall matrix compensates for the information lost by proceeding without the updates from stragglers (the nodes in the subset ).

We next describe the three specific algorithms that we consider under data parallelism, to compute .

In this case, we assume that is -smooth. Then we simply set the descent direction

 dt=−⎛⎝12nη∑i∈At∇fi(wt)+λ∇h(wt)⎞⎠.

We keep constant, chosen based on the number of stragglers in the network, or based on the desired operating regime.

##### Limited-memory-BFGS.

We assume that , and assume . Although L-BFGS is traditionally a batch method, requiring updates from all nodes, its stochastic variants have also been proposed by Mokhtari and Ribeiro (2015); Berahas et al. (2016)

. The key modification to ensure convergence in this case is that the Hessian estimate must be computed via gradient components that are common in two consecutive iterations,

i.e., from the nodes in . We adapt this technique to our scenario. For , define , and

 rt :=m2n|At∩At−1|∑i∈At∩At−1(∇fi(wt)−∇fi(wt−1)).

Then once the gradient terms are collected, the descent direction is computed by , where , and is the inverse Hessian estimate for iteration , which is computed by

 B(ℓ+1)t=V⊤jℓ,tB(ℓ)tVjℓ,t+ρjℓ,tujℓ,tu⊤jℓ,t,ρj=1r⊤juj,Vj=I−ρjrju⊤j

with , , and with , where is the L-BFGS memory length. Once the descent direction is computed, the step size is determined through exact line search222Note that exact line search is not more expensive than backtracking line search for a quadratic loss, since it only requires a single matrix-vector multiplication.. To do this, each worker node computes , and sends it to the master node. Once again, the master node only waits for the fastest nodes, denoted by (where in general ), to compute the step size that minimizes the function along , given by

 αt=−ρd⊤t˜gtd⊤t˜X⊤D˜XDdt, (3)

where , and is a back-off factor of choice.

Here, we consider the general case of non-smooth . The descent direction is given by

 dt=argminw˜Ft(w)−wt,

where

 ˜Ft(w) :=12ηn∑i∈Atfi(wt)+⟨12ηn∑i∈At∇fi(wt),w−wt⟩+λh(w)+12α∥w−wt∥2.

We keep the step size and constant.

### 2.2 Model parallelism

Under the model parallelism paradigm, we focus on objectives of the form

 ming(w):=minwϕ(Xw)=minwϕ(m∑i=1Xiwi), (4)

where the data matrix is partitioned as , the parameter vector is partitioned as , is convex, and is -smooth. Note that the data matrix is partitioned horizontally, meaning that the dataset is split across features, instead of data samples (see Figure 4

). Common machine learning models, such as any regression problem with generalized linear models, support vector machine, and many other convex problems fit within this model.

We encode the problem (4) by setting , and solving the problem

 minv˜g(v):=ϕ(XS⊤v)=minvϕ(m∑i=1XS⊤ivi), (5)

where and (see Figure 4). As a result, worker stores the column-block , as well as the iterate partition . Note that we increase the dimensions of the parameter vector by multiplying the dataset with a wide encoding matrix from the right, and as a result we have redundant coordinates in the system. As in the case of data parallelism, such redundant coordinates provide robustness against erasures arising due to stragglers. Such increase in coordinates means that the problem is simply lifted onto a larger dimensional space, while preserving the original geometry of the problem. We will denote , where is the parameter iterates of worker at iteration . In order to compute updates to its parameters , worker needs the up-to-date value of , which is provided by the master node at every iteration.

Let , and given , let be the projection of onto . We will say that satisfies -restricted-strong convexity (Lai and Yin (2013)) if

 ⟨∇g(w),w−w∗⟩≥ν∥w−w∗∥2

for all . Note that this is weaker than (implied by) strong convexity since is restricted to be the projection of , but unlike strong convexity, it is satisfied under the case where is strongly convex, but has a non-trivial null space, e.g., when it has more columns than rows.

For a given , we define the level set of at as . We will say that the level set at has diameter if

 sup{∥w−w′∥:w,w′∈Dg(w0)}≤R.

As in the case of data parallelism, we assume that the master node waits for updates at every iteration, and then moves onto the next iteration (see Algorithms 3 and 4). We similarly define as the set of fastest nodes in iteration , and also define

 Ii,t={1i∈At0i∉At.

Under model parallelism, we consider block coordinate descent, described in Algorithm 3, where worker stores the current values of the partition , and performs updates on it, given the latest values of the rest of the parameters. The parameter estimate at time is denoted by , and we also define . The iterates are updated by

 vi,t−vi,t−1=Δi,t:={−α∇i˜g(vt−1),if i∈At0,otherwise,

for a step size parameter , where refers to gradient only with respect to the variables , i.e., . Note that if then does not get updated in worker , which ensures the consistency of parameter values across machines. This is achieved by lines 48 in Algorithm 3. Worker learns about this in the next iteration, when is sent by the master node.

## 3 Main Theoretical Results: Convergence Analysis

In this section, we prove convergence results for the algorithms described in Section 2. Note that since we modify the original optimization problem and solve it obliviously to this change, it is not obvious that the solution has any optimality guarantees with respect to the original problem. We show that, it is indeed possible to provide convergence guarantees in terms of the original objective under the encoded setup.

### 3.1 A spectral condition

In order to show convergence under the proposed framework, we require the encoding matrix to satisfy a certain spectral criterion on . Let denote the submatrix of associated with the subset of machines , i.e., . Then the criterion in essence requires that for any sufficiently large subset , behaves approximately like a matrix with orthogonal columns. We make this precise in the following statement.

Let , and be given. A matrix is said to satisfy the -block-restricted isometry property (-BRIP) if for any with ,

 (1−ϵ)In⪯1ηS⊤ASA⪯(1+ϵ)In. (6)

Note that this is similar to the restricted isometry property used in compressed sensing (Candes and Tao (2005)), except that we do not require (6) to hold for every submatrix of of size . Instead, (6) needs to hold only for the submatrices of the form , which is a less restrictive condition. In general, it is known to be difficult to analytically prove that a structured, deterministic matrix satisfies the general RIP condition. Such difficulty extends to the BRIP condition as well. However, it is known that i.i.d. sub-Gaussian ensembles and randomized Fourier ensembles satisfy this property (Candes and Tao (2006)). In addition, numerical evidence suggests that there are several families of constructions for whose submatrices have eigenvalues that mostly tend to concentrate around 1. We point out that although the strict BRIP condition is required for the theoretical analysis, in practice the algorithms perform well as long as the bulk of the eigenvalues of lie within a small interval , even though the extreme eigenvalues may lie outside of it (in the non-adversarial setting). In Section 4, we explore several classes of matrices and discuss their relation to this condition.

### 3.2 Convergence of encoded gradient descent

We first consider the algorithms described under data parallelism architecture. The following theorem summarizes our results on the convergence of gradient descent for the encoded problem. Let be computed using encoded gradient descent with an encoding matrix that satisfies -BRIP, with step size for some , for all . Let be an arbitrary sequence of subsets of with cardinality for all . Then, for as given in (1),

1.  1tt∑τ=1f(wτ)−κ1f(w∗)≤4ϵf(w0)+12α∥w0−w∗∥2(1−7ϵ)t
2. If is in addition -strongly convex, then

 f(wt)−κ22(κ2−γ)1−κ2γf(w∗)≤(κ2γ)tf(w0),t=1,2,…,

where , , and , where is assumed to be small enough so that . The proof is provided in Appendix A, which relies on the fact that the solution to the effective “instantaneous” problem corresponding to the subset lies in a bounded set (where depends on the encoding matrix and strong convexity assumption on ), and therefore each gradient descent step attracts the iterate towards a point in this set, which must eventually converge to this set. Theorem 3.2 shows that encoded gradient descent can achieve the standard convergence rate for the general case, and linear convergence rate for the strongly convex case, up to an approximate minimum. For the convex case, the convergence is shown on the running mean of past function values, whereas for the strongly convex case we can bound the function value at every step. Note that although the nodes actually minimize the encoded objective , the convergence guarantees are given in terms of the original objective .

Theorem 3.2 provides deterministic, sample path convergence guarantees under any (adversarial) sequence of active sets , which is in contrast to the stochastic methods, which show convergence typically in expectation. Further, the convergence rate is not affected by the tail behavior of the delay distribution, since the delayed updates of stragglers are not applied to the iterates.

Note that since we do not seek exact solutions under data parallelism, we can keep the redundancy factor fixed regardless of the number of stragglers. Increasing number of stragglers in the network simply results in a looser approximation of the solution, allowing for a graceful degradation. This is in contrast to existing work Tandon et al. (2017) seeking exact convergence under coding, which shows that the redundancy factor must grow linearly with the number of stragglers.

### 3.3 Convergence of encoded L-BFGS

We consider the variant of L-BFGS described in Section 2. For our convergence result for L-BFGS, we need another assumption on the matrix , in addition to (6). Defining for , we assume that for some ,

 δI⪯˘S⊤t˘St (7)

for all . Note that this requires that one should wait for sufficiently many nodes to send updates so that the overlap set has more than nodes, and thus the matrix can be full rank. When the columns of are linearly independent, this is satisfied if in the worst-case, and in the case where node delays are i.i.d. across machines, it is satisfied in expectation if . One can also choose adaptively so that . We note that although this condition is required for the theoretical analysis, the algorithm may perform well in practice even when this condition is not satisfied.

We first show that this algorithm results in stable inverse Hessian estimates under the proposed model, under arbitrary realizations of (of sufficiently large cardinality), which is done in the following lemma. Let . Then there exist constants such that for all , the inverse Hessian estimate satisfies . The proof, provided in Appendix A, is based on the well-known trace-determinant method. Using Lemma 3.3, we can show the following convergence result. Let , and let be computed using the L-BFGS method described in Section 2, with an encoding matrix that satisfies -BRIP. Let be arbitrary sequences of subsets of with cardinality for all . Then, for as described in Section 2,

 f(wt)−κ2(κ−γ)1−κγf(w∗)≤(κγ)tf(w0),

where , and , where and are the constants in Lemma 3.3. Similar to Theorem 3.2, the proof is based on the observation that the solution of the effective problem at time lies in a bounded set around the true solution . As in gradient descent, coding enables linear convergence deterministically, unlike the stochastic and multi-batch variants of L-BFGS, e.g., Mokhtari and Ribeiro (2015); Berahas et al. (2016).

### 3.4 Convergence of encoded proximal gradient

Next we consider the encoded proximal gradient algorithm, described in Section 2, for objectives with potentially non-smooth regularizers . The following theorem characterizes our convergence results under this setup. Let be computed using encoded proximal gradient with an encoding matrix that satisfies -BRIP, with step size , and where . Let be an arbitrary sequence of subsets of with cardinality for all . Then, for as described in Section 2,

1. For all ,

 1tt∑τ=1f(wτ)−κf(w∗)≤4ϵf(w0)+12α∥w0−w∗∥2(1−7ϵ)t,
2. For all ,

 f(wt+1)≤κf(wt),

where .

As in the previous algorithms, the convergence guarantees hold for arbitrary sequences of active sets . Note that as in the gradient descent case, the convergence is shown on the mean of past function values. Since this does not prevent the iterates from having a sudden jump at a given iterate, we include the second part of the theorem to complement the main convergence result, which implies that the function value cannot increase by more than a small factor of its current value.

### 3.5 Convergence of encoded block coordinate descent

Finally, we consider the convergence of encoded block coordinate descent algorithm. The following theorem characterizes our main convergence result for this case. Let , where is computed using encoded block coordinate descent as described in Section 2. Let satisfy -BRIP, and the step size satisfy . Let be an arbitrary sequence of subsets of with cardinality for all . Let the level set of at the first iterate have diameter . Then, for as described in Section 2, the following hold.

1. If is convex, then

 g(wt)−g(w∗)≤11π0+Ct,

where , and .

2. If is -restricted-strongly convex, then

 g(wt)−g(w∗)≤(1−1ξ)t(g(w0)−g(w∗)),

where .

Theorem 3.5 demonstrates that the standard rate for the general convex, and linear rate for the strongly convex case can be obtained under the encoded setup. Note that unlike the data parallelism setup, we can achieve exact minimum under model parallelism, since the underlying geometry of the problem does not change under encoding; the same objective is simply mapped onto a higher-dimensional space, which has redundant coordinates. Similar to the previous cases, encoding allows for deterministic convergence guarantees under adversarial failure patterns. This comes at the expense of a small penalty in the convergence rate though; one can observe that a non-zero slightly weakens the constants in the convergence expressions. Still, note that this penalty in convergence rate only depends on the encoding matrix and not on the delay profile in the system. This is in contrast to the asynchronous coordinate descent methods; for instance, in Liu et al. (2015), the step size is required to shrink exponentially in the maximum allowable delay, and thus the guaranteed convergence rate can exponentially degrade with increasing worst-case delay in the system. The same is true for the linear convergence guarantee in Peng et al. (2016).

## 4 Code Design

### 4.1 Block RIP condition and code design

We first discuss two classes of encoding matrices with regard to the BRIP condition; namely equiangular tight frames, and random matrices.

##### Tight frames.

A unit-norm frame for is a set of vectors with , where , such that there exist constants such that, for any ,

 ξ1∥u∥2≤nβ∑i=1|⟨u,ai⟩|2≤ξ2∥u∥2.

The frame is tight if the above satisfied with . In this case, it can be shown that the constants are equal to the redundancy factor of the frame, i.e., . If we form by rows that form a tight frame, then we have , which ensures . Then for any solution to the encoded problem (with ),

 ∇˜f(^w)=X⊤S⊤S(X^w−y)=βX⊤(X^w−y)=β∇f(^w).

Therefore, the solution to the encoded problem satisfies the optimality condition for the original problem as well:

 −∇˜f(^w)∈∂h(^w),⇔−∇f(^w)∈∂h(^w),

and if is also strongly convex, then is the unique solution. This means that for , obliviously solving the encoded problem results in the same objective value as in the original problem.

Define the maximal inner product of a unit-norm tight frame , where , by

 ω(F):=maxai,aj∈Fi≠j∣∣⟨ai,aj⟩∣∣.

A tight frame is called an equiangular tight frame (ETF) if for every . [Welch (1974)] Let be a tight frame. Then . Moreover, equality is satisfied if and only if is an equiangular tight frame. Therefore, an ETF minimizes the correlation between its individual elements, making each submatrix as close to orthogonal as possible. This, combined with the property that tight frames preserve the optimality condition when all nodes are waited for (), make ETFs good candidates for encoding, in light of the required property (6). We specifically evaluate the Paley ETF from Paley (1933) and Goethals and Seidel (1967); Hadamard ETF from Szöllősi (2013) (not to be confused with Hadamard matrix); and Steiner ETF from Fickus et al. (2012) in our experiments.

Although the derivation of tight eigenvalue bounds for subsampled ETFs is a long-standing problem, numerical evidence (see Figures 6, 6) suggests that they tend to have their eigenvalues more tightly concentrated around 1 than random matrices (also supported by the fact that they satisfy Welch bound, Proposition 4.1 with equality).

Note that our theoretical results focus on the extreme eigenvalues due to a worst-case analysis; in practice, most of the energy of the gradient lies on the eigen-space associated with the bulk of the eigenvalues, which the following proposition shows can be identically 1. If the rows of are chosen to form an ETF with redundancy , then for , has eigenvalues equal to 1. This follows immediately from Cauchy interlacing theorem, using the fact that and have the same spectra except zeros. Therefore for sufficiently large , ETFs have a mostly flat spectrum even for low redundancy, and thus in practice one would expect ETFs to perform well even for small amounts of redundancy. This is also confirmed by Figure 6, as well as our numerical results.

##### Random matrices.

Another natural choice of encoding could be to use i.i.d. random matrices. Although encoding with such random matrices can be computationally expensive and may not have the desirable properties of encoding with tight frames, their eigenvalue behavior can be characterized analytically. In particular, using the existing results on the eigenvalue scaling of large i.i.d. Gaussian matrices from Geman (1980); Silverstein (1985) and union bound, it can be shown that

 P⎛⎝maxA:|A|=kλmax(1βηnS⊤ASA)>(1+√1βη)2⎞⎠→0 (8) P⎛⎝minA:|A|=kλmin(1βηnS⊤ASA)<(1−√1βη)2⎞⎠→0, (9)

as , if the elements of are drawn i.i.d. from . Hence, for sufficiently large redundancy and problem dimension, i.i.d. random matrices are good candidates for encoding as well. However, for finite , even if , in general the optimum of the original problem is not recovered exactly, for such matrices.

### 4.2 Efficient encoding

In this section we discuss some of the possible practical approaches to encoding. Some of the practical issues involving encoding include the the computational complexity of encoding, as well as the loss of sparsity in the data due to the multiplication with , and the resulting increase in time and space complexity. We address these issues in this section.

#### 4.2.1 Efficient distributed encoding with sparse matrices

Let the dataset lie in a database, accessible to each worker node, where each node is responsible for computing their own encoded partitions and . We assume that has a sparse structure. Given , define as the set of indices of the non-zero elements of the th row of . For a set of rows, we define .

Let us partition the set of rows of , , into machines, and denote the partition of machine as , i.e., , where denotes disjoint union. Then the set of non-zero columns of is given by . Note that in order to compute , machine only requires the rows of in the set . In what follows, we will denote this submatrix of by , i.e., if is the th row of , . Similarly , where is the th element of .

Consider the specific computation that needs to be done by worker during the iterations, for each algorithm. Under the data parallelism setting, worker computes the following gradient:

 ∇fk(w)=X⊤S⊤kSk(Xw−y)(a)=˜X⊤kS⊤kSk(˜Xkw−˜yk) (10)

where (a) follows since the rows of that are not in get multiplied by zero vector. Note that the last expression can be computed without any matrix-matrix multiplication. This gives a natural storage and computation scheme for the workers. Instead of computing offline and storing it, which can result in a loss of sparsity in the data, worker can store in uncoded form, and compute the gradient through (10) whenever needed, using only matrix-vector multiplications. Since is sparse, the overhead associated with multiplications of the form and is small.

Similarly, under model parallelism, the computation required by worker is

 ∇k˜g(v)=SkX⊤∇kϕ(XS⊤kvk+˜zk)=Sk˜X⊤k∇kϕ(˜XkS⊤kvk+˜zk), (11)

and as in the data parallelism case, the worker can store uncoded, and compute (11) online through matrix-vector multiplications.

##### Example: Steiner ETF.

We illustrate the described technique through Steiner ETF, based on the construction proposed in Fickus et al. (2012), using -Steiner systems. Let be a power of 2, let be a real Hadamard matrix, and let be the th column of , for . Consider the matrix , where each column is the incidence vector of a distinct two-element subset of . For instance, for ,

 V=⎡⎢ ⎢ ⎢⎣111000100110010101001011⎤⎥ ⎥ ⎥⎦.

Note that each of the rows have exactly non-zero elements. We construct Steiner ETF as a matrix by replacing each 1 in a row with a distinct column of , and normalizing by . For instance, for the above example, we have

 S=1√3⎡⎢ ⎢ ⎢⎣h2h3h4000h200h3h400h20h30h400h20h3h4⎤⎥ ⎥ ⎥⎦.

We will call a set of rows of that arises from the same row of a block. In general, this procedure results in a matrix with redundancy factor . In full generality, Steiner ETFs can be constructed for larger redundancy levels; we refer the reader to Fickus et al. (2012) for a full discussion of these constructions.

We partition the rows of the matrix into machines, so that each machine gets assigned rows of , and thus the corresponding blocks of .

This construction and partitioning scheme is particularly attractive for our purposes for two reasons. First, it is easy to see that for any node , is upper bounded by , which means the memory overhead compared to the uncoded case is limited to a factor333In practice, we have observed that the convergence performance improves when the blocks are broken into multiple machines, so one can, for instance, assign half-blocks to each machine. of . Second, each block of consists of (almost) a Hadamard matrix, so the multiplication can be efficiently implemented through Fast Walsh-Hadamard Transform.

##### Example: Haar matrix.

Another possible choice of sparse matrix is column-subsampled Haar matrix, which is defined recursively by

 H2n=1√2[Hn⊗[11]In⊗[1−1]],H1=1,

where denotes Kronecker product. Given a redundancy level , one can obtain by randomly sampling columns of . It can be shown that in this case, we have , and hence encoding with Haar matrix incurs a memory cost by logarithmic factor.

#### 4.2.2 Fast transforms

Another computationally efficient method for encoding is to use fast transforms: Fast Fourier Transform (FFT), if

is chosen as a subsampled DFT matrix, and the Fast Walsh-Hadamard Transform (FWHT), if is chosen as a subsampled real Hadamard matrix. In particular, one can insert rows of zeroes at random locations into the data pair

, and then take the FFT or FWHT of each column of the augmented matrix. This is equivalent to a randomized Fourier or Hadamard ensemble, which is known to satisfy the RIP with high probability by

Candes and Tao (2006). However, such transforms do not have the memory advantages of the sparse matrices, and thus they are more useful for the setting where the dataset is dense, and the encoding is done offline.

### 4.3 Cost of encoding

Since encoding increases the problem dimensions, it clearly comes with the cost of increased space complexity. The memory and storage requirement of the optimization still increases by a factor of 2, if the encoding is done offline (for dense datasets), or if the techniques described in the previous subsection are applied (for sparse datasets)444Note that the increase in space complexity is not higher for sparse matrices, since the sparsity loss can be avoided using the technique described in Section 4.2.1. Note that the added redundancy can come by increasing the amount of effective data points per machine, by increasing the number of machines while keeping the load per machine constant, or a combination of the two. In the first case, the computational load per machine increases by a factor of . Although this can make a difference if the system is bottlenecked by the computation time, distributed computing systems are typically communication-limited, and thus we do not expect this additional cost to dominate the speed-up from the mitigation of stragglers.

## 5 Numerical Results

We implement the proposed technique on four problems: ridge regression, matrix factorization, logistic regression, and LASSO.

### 5.1 Ridge regression

We generate the elements of matrix i.i.d. , and the elements of are generated from and an i.i.d.