I Introduction
As the era of Big Data advances, massive parallelization has emerged as a natural approach to overcome limitations imposed by saturation of Moore’s law (and thereby of single processor compute speeds). However, massive parallelization leads to computational bottlenecks due to faulty nodes and stragglers [dean2013tail]. Stragglers refer to a few slow or delayprone processors that can bottleneck the entire computation because one has to wait for all the parallel nodes to finish. The issue of straggling [dean2013tail] and faulty nodes has been a topic of active interest in the emerging area of “coded computation” with several interesting works, e.g. [lee2018speeding, li2015coded, tandon2016gradient, polynomialcodes, joshi2014delay, gauristraggler, gauriefficient, dutta2016short, azian2017consensus, YaoqingAllerton16, Yaoqing2017ITTrans, Yang_ISTC_16, Salman1, Salman2, Salman3, Salman4, GC2, GC3, Emina1, Emina2, Virtualization, heterogeneousclusters, GC4, Suhas1, Suhas2, NIPS17Yaoqing, Ramtin1, multicore_setups, yu2017fft, jeongFFT, baharav2018straggler, suh2017matrix, mallick2018rateless, wang2018coded, wang2018fundamental, severinson2017block, ye2018communication]. Coded computation not only advances on coding approaches in classical works in AlgorithmBased Fault Tolerance (ABFT) [Huang_TC_84, faultbook], but also provides novel analyses of required computation time (e.g. expected time [lee2018speeding] and deadline exponents [SanghamitraISIT2017]). Perhaps most importantly, it brings an informationtheoretic lens to the problem by examining fundamental limits and comparing them with existing strategies. A broader survey of results and techniques of coded computation is provided in [NewsletterPaper].
In this paper, we focus on the problem of coded matrix multiplication. Matrix multiplication is central to many modern computing applications, including machine learning and scientific computing. There is a lot of interest in classical ABFT literature (starting from
[Huang_TC_84, faultbook]) and more recently in coded computation literature (e.g. [ProductCodes, polynomialcodes]) to make matrix multiplications resilient to faults and delays. In particular, Yu, MaddahAli, and Avestimehr [polynomialcodes] provide novel coded matrixmultiplication constructions called Polynomial codes that outperform classical work from ABFT literature in terms of the recovery threshold, the minimum number of successful (nondelayed, nonfaulty) processing nodes required for completing the computation.In this work, we consider the standard setup used in [polynomialcodes, ProductCodes] with worker nodes that perform the computation in a distributed manner and a master node that helps coordinate the computation by performing some low complexity preprocessing on the inputs, distributing the inputs to the workers, and aggregates the results of the workers possibly performing some low complexity postprocessing.^{1}^{1}1In this paper, we introduce a new type of node, “a fusion node”, and delegate master node’s result aggregation and postprocessing function to the fusion node. Hence, a master node is only responsible for preprocessing and job distribution as a fusion node performs aggregating results and postprocessing. However, this is only a conceptual separation that makes our explanation easier throughout the paper. One can think of a master node and a fusion node as one physical machine. We propose MatDot codes that advance on existing constructions in scaling sense under the setup: when th fraction of each matrix can be stored in each worker node, Polynomial codes have the recovery threshold of , while the recovery threshold of MatDot is only . However, as we note in Section IIIB, this comes at an increased perworker communication cost. We also propose PolyDot codes that interpolates between MatDot and Polynomial code constructions in terms of recovery thresholds and communication costs.
Our main contributions in this work are as follows:

We present a systematic version of MatDot codes, where the operations of the first worker nodes may be viewed as multiplication in uncoded form, in Section IV.

In Section V, we propose “PolyDot codes”, a unified view of MatDot and Polynomial codes that leads to a tradeoff between recovery threshold and communication costs.
We note that following the publication of an initial version of this paper [allerton17], the works of Yu, MaddahAli, and Avestimehr [entangledpolycodes] and Dutta, Bai, Jeong, Low and Grover [DNNPaperISIT] obtained constructions that outperform PolyDot codes in tradeoffs between communication cost and recovery threshold (although MatDot codes continue to have the smallest recovery threshold for given storage constraints). Importantly, Yu et al. [entangledpolycodes] also provide interesting converse results that show the optimality of MatDot codes.
Ii System model and problem statement
Iia System model
The system, illustrated in Fig. 1, consists of three different types of nodes, a master node, multiple worker nodes, and a fusion node. These are defined more formally below.
Definition 1 (Computational system).
A computational system consists of the following:

a master node that receives computational inputs, preprocesses them (e.g., encoding), and distributes them to the worker nodes

memoryconstrained worker nodes that perform predetermined computations on their respective inputs in parallel

a fusion node that receives outputs from successful worker nodes and performs postprocessing (e.g.,decoding) to recover the final computation output.
For practical utility, it is important to have the amount of processing that the worker nodes perform to be much smaller than the processing at the master and the fusion node.
We assume that any worker node can fail to complete its computation because of faults or delays. Thus, we define a subset of all workers as the “successful workers.”
Definition 2 (Successful workers).
Workers that finish their computation task successfully and send their output to the fusion node are called successful workers.
Definition 3 (Successful computation).
If the computational system on receiving the inputs produces the correct computational output, the computation is said to be successful.
Definition 4 (Recovery threshold).
The recovery threshold is the worstcase^{2}^{2}2The worstcase here is over all possible configurations of successful workers. minimum number of successful workers required by the fusion node to complete the computation successfully.
We will denote the total number of worker nodes by , and the recovery threshold by . We will be using the term “rowblock” to denote the submatrices formed when we split a matrix horizontally as follows: . Similarly, we will be using the term “columnblock” to denote the submatrices formed when we split a matrix vertically into submatrices as follows: .
IiB Problem statement
We are required to compute the multiplication of two square matrices (), i.e., using the computational system specified in Section IIA. Both the matrices are of dimension . Each worker can receive at most symbols from the master node, where each symbol is an element of . For the simplicity, we assume that divides and a worker node receives symbols from and each^{3}^{3}3We only consider symmetric distribution of matrices and in this work. A more general problem formulation where one can distribute different number of entries from and to each worker is an open problem.. The computational complexities of the master and fusion nodes, in terms of the matrix parameter , is required to be negligible in a scaling sense than the computational complexity at any worker node^{4}^{4}4If the master node or the fusion node is allowed to have higher computational complexity, the workers can simply store using Maximum Distance Separable (MDS) codes to get a recovery threshold of ; the fusion node simply recovers and then multiplies them, essentially performing the whole operation.. The goal is to perform this matrixmatrix multiplication utilizing faulty or delayprone workers with minimum recovery threshold.
Iii MatDot Codes
In this section we will describe the distributed matrixmatrix product strategy using MatDot codes, and then examine computation and communication costs of the proposed strategy. Before proceeding further into the detailed construction and analyses of MatDot codes, we will first give some motivating examples which contrast MatDot codes with existing techniques.
Iiia Motivating examples and summary of previous results
Consider the problem statement described in Section II. We describe three different strategies as possible solutions to the problem: (i) ABFT matrix multiplication [Huang_TC_84] (also called productcoded matrices in [ProductCodes]), (ii) Polynomial codes [polynomialcodes] and then (iii) our proposed construction, MatDot codes, each progressively improving, i.e., reducing the recovery threshold. We will evaluate the straggler tolerance of a strategy by its recovery threshold, . For all the examples, we consider the most simple case with . Let us begin by describing the first strategy, namely, ABFT matrix multiplication.
Example 1 (ABFT codes [Huang_TC_84] (, )).
Consider two matrices and that are split as follows:
where are submatrices (rowblocks) of of dimension and are submatrices (columnblocks) of of dimension . Using ABFT, it is possible to compute over nodes such that, each node uses linear combination of the entries of and linear combination of the entries of and the overall computation is tolerant to stragglers in the worst case. Thus, any worker nodes suffice to recover .
ABFT codes use the following strategy: processors are arranged in a grid. ABFT codes encode two rowblocks of and two columnblocks of separately using two systematic MDS codes. Then, we distribute the the encoded columnblock of to all the worker nodes on the th row of the grid, and the th encoded rowblocks to all the worker nodes on the th column of the grid. Note that here the grid indexing is and . An example for is shown in Fig. 2. The worst case arises when all but one worker node in the lower right part of the grid fail. Thus, the worst case recovery threshold is . For the example given in Fig. 2 where , recovery threshold is .
In the previous example, the recovery threshold was a function of and thus it required more successful worker nodes as we use more processors. However, as we will show in the next example, Polynomial codes [polynomialcodes] provide a superior recovery threshold that does not depend on .
Example 2 (Polynomial codes [polynomialcodes] (, )).
Consider two matrices and that are split as follows:
Polynomial codes computes over nodes such that, each node uses linear combination of the entries of and linear combination of the entries of and the overall computation is tolerant to stragglers, i.e., any nodes suffice to recover , Polynomial codes use the following strategy: Node computes so that from any of the nodes, the polynomial can be interpolated. Having interpolated the polynomial, as can be obtained from the coefficients (matrices) of the polynomial.
Our novel MatDot construction achieves a smaller recovery threshold as compared with Polynomial codes. Unlike ABFT and polynomial codes, MatDot divides matrix vertically into columnblocks and matrix horizontally into rowblocks.
Example 3.
[MatDot codes (, )]
MatDot codes compute over nodes such that, each node uses linear combination of the entries of and linear combination of the entries of and the overall computation is tolerant to stragglers, i.e., nodes suffice to recover . The proposed MatDot codes use the following strategy: Matrix is split vertically and is split horizontally as follows:
(1) 
where are submatrices (or columnblocks) of of dimension and are submatrices (or rowblocks) of of dimension .
Let and . Let be distinct real numbers, the master node sends and to the th worker node where the th worker node performs the multiplication and sends the output to the fusion node. The exact computations at each worker node are depicted in Fig. 4. We can observe that the fusion node can obtain the product using the output of any three successful workers as follows: Let the worker nodes and be the first three successful worker nodes, then the fusion node obtains the following three matrices:
Since these three matrices can be seen as three evaluations of the matrix polynomial of degree at three distinct evaluation points , the fusion node can obtain the coefficients of in using polynomial interpolation. This includes the coefficient of , which is . Therefore, the fusion node can recover the matrix product .
In the example, we have seen that for , the recovery threshold of MatDot codes is which is lower than Polynomial codes as well as ABFT matrix multiplication. The following theorem shows that for any integer , the recovery threshold of MatDot codes is .
Theorem 1.
Before we prove Theorem 1, we first describe the construction of MatDot codes.
Construction 1.
[MatDot Codes]
Splitting of input matrices: The matrix is split vertically into equal columnblocks (of symbols each) and is split horizontally into equal row blocks (of symbols each) as follows:
(2) 
where, for , and are and dimensional submatrices, respectively.
Master node (encoding): Let be distinct elements in . Let and The master node sends to the th worker evaluations of at , that is, it sends to the th worker.
Worker nodes: For , the th worker node computes the matrix product and sends it to the fusion node on successful completion.
Fusion node (decoding): The fusion node uses outputs of any successful workers to compute the coefficient of in the product (the feasibility of this step will be shown later in the proof of Theorem 1). If the number of successful workers is smaller than , the fusion node declares a failure.
Notice that in MatDot codes, we have
(3) 
where and are as defined in (2). The simple observation of (3) leads to a different way of computing the matrix product as compared with Polynomial codes based computation. In particular, to compute the product, we only require, for each , the product of and We do not require products of the form for unlike Polynomial codes, where, after splitting the matrices in to parts, all crossproducts are required to evaluate the overall matrix product. This leads to a significantly smaller recovery threshold for our construction.
Proof of Theorem 1.
To prove the theorem, it suffices to show that in the MatDot code construction described above, the fusion node is able to construct from any worker nodes. Observe that the coefficient of in:
(4) 
is (from (3)), which is the desired matrixmatrix product. Thus it is sufficient to compute this coefficient at the fusion node as the computation output for successful computation. Now, because the polynomial has degree , evaluation of the polynomial at any distinct points is sufficient to compute all of the coefficients of powers of in using polynomial interpolation. This includes , the coefficient of . The next section has a complexity analysis that shows that the master and fusion nodes have a lower computational complexity as compared to the worker nodes. ∎
IiiB Complexity analyses of MatDot codes
Encoding/decoding complexity: Decoding requires interpolating a degree polynomial for elements. Using polynomial interpolation algorithms of complexity [kung1973fast], where , the decoding complexity per matrix element is . Thus, for elements, the decoding complexity is
Encoding for each worker requires performing two additions, each adding scaled matrices of size , for an overall encoding complexity for each worker of . Thus, the overall computational complexity of encoding for workers is .
Each worker’s computational cost: Each worker multiplies two matrices of dimensions and , requiring operations (using straightforward multiplication algorithms^{5}^{5}5More sophisticated algorithms [strassen] also require superquadratic complexity in , so the conclusion will hold if those algorithms are used at workers as well.). Hence, the computational complexity for each worker is . Thus, as long as (and hence ), encoding and decoding complexity is smaller than perworker computational complexity.
Communication cost:
The master node communicates symbols, and the fusion node receives symbols from the successful worker nodes. While the master node communication is identical to that in Polynomial codes, the fusion node there only receives symbols.
IiiC Why does MatDot exceed the fundamental limits in [polynomialcodes]
The fundamental limit in [polynomialcodes] concludes that the recovery threshold is , whereas our recovery threshold is lower: . To understand why this is possible, one needs to carefully examine the derivation of the fundamental limit in [polynomialcodes], which uses a cutset argument to count the number of bits/symbols required for computing the product . In doing so, the authors make the assumption that the number of symbols communicated by each worker to the fusion node is , which is a fallout of a horizontal division of matrix , and a vertical division of matrix (the opposite of the division used here).
The bound does not apply to our construction because each worker now communicates symbols to the fusion node. Note that while the amount of information in each worker’s transmissions is less, i.e., (because the matrices communicated by the workers can have rank ), this is still significantly larger than assumption made in the fundamental limits in [polynomialcodes].
From a communication viewpoint, MatDot requires communicating a total of symbols, which is larger than the symbols in the product . This is suggestive of a tradeoff between minimal number of workers and minimal (sumrate) communication from nonstraggling workers. Section V describes a unified view of MatDot and Polynomial codes, which describes the tradeoff between workerfusion communication cost and recovery threshold achieved by our construction.
In practice, whether this increased workerfusion node communication cost using MatDot codes is worth paying for will depend on the computational fabric and system implementation choices. Even in systems where communication costs may be significant, it is possible that more communication from fewer successful workers is less expensive than requiring more successful workers as required in Polynomial codes. Also note that if (e.g. when the system is highly fault prone or the deadline [SanghamitraISIT2017] is very short), communication complexity at the master node will dominate, and hence MatDot codes may not impose a substantial computing overhead.
Iv Systematic Code Constructions
In this section, we provide a systematic code construction for MatDot codes. As the notion of systematic codes in the context of matrix multiplication problem is ambiguous, we will first define systematic codes in our context.
Definition 5 (Systematic code in distributed matrixmatrix multiplication).
For the problem stated in Section IIB computed on the system defined in Definition 1 such that the matrices and are split as in (2), a code is called systematic if the output of the th worker node is the product , for all . We refer to the first worker nodes, that output for , as systematic worker nodes.
Note that the final output can be obtained by summing up the outputs from the systematic worker nodes:
The presented systematic code, named “systematic MatDot code”, is advantageous over MatDot codes in two aspects. Firstly, even though both MatDot and systematic MatDot codes have the same recovery threshold, systematic MatDot codes can recover the output as soon as the systematic worker nodes successfully finish unlike MatDot codes which always require workers to successfully finish to recover the final result. Furthermore, when the systematic worker nodes successfully finish first, the decoding complexity using systematic MatDot codes is , which is less than the decoding complexity of MatDot codes, i.e., . Another advantage for systematic MatDot codes over MatDot codes is that the systematic MatDot approach may be useful for backwardcompatibility with current practice. What this means is that, for systems that are already established and operating with no straggler tolerance, but do an way parallelization, it is easier to apply the systematic approach as the infrastructure could be appended to additional worker nodes without modifying what the first nodes are doing.
The following theorem shows that there exists a systematic MatDot code construction that achieves the same recovery threshold as MatDot codes.
Theorem 2.
For the matrixmatrix multiplication problem specified in Section IIB computed on the system defined in Definition 1, there exists a systematic code, where the product is the summation of the output of the first worker nodes, that solves this problem with a recovery threshold of , where is any positive integer that divides .
Before we describe the construction of systematic MatDot codes, that will be used to prove Theorem 2, we first present a simple example to illustrate the idea of systematic MatDot codes.
Example 4.
[Systematic MatDot code, ]
Matrix is split vertically into two submatrices (columnblocks) and , each of dimension and matrix is split horizontally into two submatrices (rowblocks) and , each of dimension as follows:
(5) 
Now, we define the encoding functions and as and , for distinct . Let be elements of such that are distinct, the master node sends and to the th worker node, , where the th worker node performs the multiplication and sends the output to the fusion node. The exact computations at each worker node are depicted in Fig. 5.
We can observe that the outputs of the worker nodes are , respectively, and hence this code is systematic. Let us consider a scenario where the systematic worker nodes, i.e., worker nodes and complete their computations first. In this scenario, the fusion node does not require a decoding step and can obtain the product by simply performing the summation of the two outputs it has received: . Now let us consider a different scenario where worker nodes are the first three successful workers. Then, the fusion node receives three matrices, and . Since these three matrices can be seen as three evaluations of the polynomial of degree at three distinct evaluation points , the coefficients of the polynomial can be obtained using polynomial interpolation. Finally, to obtain the product , we evaluate at and sum them up:
The following describes the general construction of the systematic MatDot codes for matrixmatrix multiplication.
Construction 2.
[Systematic MatDot codes]
Splitting of input matrices: and are split as in (2).
Master node (encoding): Let be arbitrary distinct elements of . Let and where is defined as follows for :
(6) 
The master node sends to the th worker the evaluations of at , i.e., it sends to the th worker for .
Worker nodes: For , the th worker node computes the matrix product and sends it to the fusion node on successful completion.
Fusion node (decoding): For any such that , whenever the outputs of the first successful workers contain the outputs of the systematic worker nodes , i.e., is contained in the set of the first outputs received by the fusion node, the fusion node performs the summation . Otherwise, if is not contained in the set of the first evaluations received by the fusion node, the fusion node performs the following steps: (i) interpolates the polynomial (the feasibility of this step will be shown later in the proof of Theorem 2), (ii) evaluates at , (iii) performs the summation .
If the number of successful worker nodes is smaller than and the first worker nodes are not included in the successful worker nodes, the fusion node declares a failure.
The following lemma proves that the construction given here is systematic.
Lemma 1.
Proof of Lemma 1.
The lemma follows from noting that the output of the th worker, for , can be written as
(7) 
where the last equality follows from the property of :
(8) 
for . ∎
Now, we proceed with the proof of Theorem 2.
Proof of Theorem 2.
Since Construction 2 is a systematic code for matrixmatrix multiplication (Lemma 1), in order to prove the theorem, it suffices to show that Construction 2 is a valid construction with a recovery threshold . From (6), observe that the polynomials , , have degrees each. Therefore, each of and has a degree of as well. Consequently, has a degree of . Now, because the polynomial has degree , evaluation of the polynomial at any distinct points is sufficient to interpolate using polynomial interpolation algorithm. Now, since Construction 2 is systematic (Lemma 1), the product is the summation of the outputs of the first m workers, i.e., . Therefore, after the fusion node interpolates , evaluating at , and performing the summation yields the product . ∎
Iva Complexity analyses of the systematic codes
Apart from the encoding/decoding complexity, the complexity analyses of sytstematic MatDot codes are the same as their MatDot codes counterparts. In the following, we investigate the encoding/decoding complexity of Construction 2.
Encoding/decoding Complexity: Encoding for each worker first requires performing evaluations of polynomials for all , with each evaluation requiring operations. This gives operations for all polynomial evaluations. Afterwards, two additions are performed, each adding scaled matrices of size , with complexity . Therefore, the overall encoding complexity for each worker is . Thus, the overall computational complexity of encoding for workers is .
For decoding, first, for the interpolation step, we interpolate a degree polynomial for elements. Using polynomial interpolation algorithms of complexity [kung1973fast], where , the interpolation complexity per matrix element is . Thus, for elements, the interpolation complexity is For the evaluation of at , each evaluation involves adding scaled matrices of size with a complexity of . Hence, for all evaluations the complexity is . Finally, the complexity of the final addition of matrices of size is . Hence, the overall decoding complexity is .
V Unifying Matdot and Polynomial Codes: Tradeoff between communication cost and recovery threshold
In this section, we present a code construction, named PolyDot, that provides a tradeoff between communication costs and recovery thresholds. Polynomial codes [polynomialcodes] have a higher recovery threshold of , but have a lower communication cost of per worker node. On the other hand, MatDot codes have a lower recovery threshold of , but have a higher communication cost of per worker node. Here, our goal is to construct a code that bridges the gap between Polynomial codes and MatDot codes so that we can get intermediate communication costs and recovery thresholds, with Polynomial and MatDot codes as two special cases. For this goal, we propose PolyDot codes, which may be viewed as an interpolation of MatDot codes and Polynomial codes – one extreme of the interpolation is MatDot codes and the other extreme is Polynomial codes.
We follow the same problem setup and system assumptions in IIB. The following theorem shows the recovery threhsold of PolyDot codes.
Theorem 3.
Before describing the PolyDot code construction and prove Theorem 3, we first introduce the following simple PolyDot code example with and .
Example 5.
[PolyDot codes ()]
Matrix is split into submatrices each of dimension . Similarly, matrix is split into submatrices each of dimension as follows:
(9) 
Notice that, from (9), the product can be written as
(10) 
Now, we define the encoding functions and as
Observe the following:

[label=()]

the coefficient of in is ,

the coefficient of in is ,

the coefficient of in is , and

the coefficient of in is .
Let be distinct elements of . The master node sends and to the th worker node, , and the th worker node performs the multiplication and sends the result to the fusion node.
Let worker nodes be the first worker nodes to send their computation outputs to the fusion node, then the fusion node obtains the matrices for all . Since these matrices can be seen as twelve evaluations of the matrix polynomial of degree at twelve distinct points , the coefficients of the matrix polynomial can be obtained using polynomial interpolation. This includes the coefficients of for all , i.e., for all . Once the matrices for all are obtained, the product is obtained by (10).
The recovery threshold for in Example 5 is . This is larger than the recovery threshold of MatDot codes, which is , and smaller then the recovery threshold of Polynomial codes, which is . Hence, we can see that the recovery thresholds of PolyDot codes are somewhere between those of MatDot codes and Polynomial codes.
The following describes the general construction of PolyDot() codes. Note that although two parameters and are sufficient to characterize a PolyDot code, we include in the parameters for better readability.
Construction 3.
[PolyDot() codes]
Splitting of input matrices: and are split both horizontally and vertically:
(11) 
where, for , ’s are submatrices of and ’s are submatrices of . We choose and such that both and divide and .
Master node (encoding): Define the encoding polynomials as:
(12) 
The master node sends to the th worker the evaluations of at where ’s are all distinct for . By this substitution, we are transforming the threevariable polynomial to a singlevariable polynomial as follows^{6}^{6}6An alternate substitution can reduce the recovery threshold further as mentioned in subsequent works [DNNPaperISIT, entangledpolycodes]. We will clarify this in Remark 1.:
and evaluate the polynomial at for . In Lemma 2 that this transformation is onetoone.
Worker nodes: For , the th worker node computes the matrix product and sends it to the fusion node on successful completion.
Fusion node (decoding): The fusion node uses outputs of the first successful workers to compute the coefficient of in . That is, it computes the coefficient of of the transformed singlevariable polynomial. The proof of Theorem 3 shows that this is indeed possible. If the number of successful workers is smaller than , the fusion node declares a failure.
Before we prove the theorem, let us discuss the utility of PolyDot codes. By choosing different and , we can trade off communication cost and recovery threshold. For and , PolyDot code is a MatDot code which has a low recovery threshold but high communication cost. At the other extreme, for and , PolyDot code is a Polynomial code. Now let us consider a code with intermediate and values such as and . A PolyDot code has a recovery threshold of , and the total number of symbols to be communicated to the fusion node is , which is smaller than required by MatDot codes but larger than required by Polynomial codes. This tradeoff is illustrated in Fig. 6 for .
To prove Theorem 3, we need the following lemma.
Lemma 2.
The following function
(13) 
is a bijection.
Proof.
Let us assume, for the sake of contradiction, that for some , . Then and hence . Similarly, gives , and thus (because ). Now, because and , as we just established, from our assumption, it follows that . This contradicts our assumption that . ∎
Proof of Theorem 3.
The product of and can be written as follows: