Matrix multiplication is one of the key operations underlying many data analytics applications in various fields such as machine learning, scientific computing, and graph processing. Many such applications require processing terabytes or even petabytes of data, which needs massive computation and storage resources that cannot be provided by a single machine. Hence, deploying matrix computation tasks on large-scale distributed systems has received wide interests[1, 2, 3, 4].
There is, however, a major performance bottleneck that arises as we scale out computations across many distributed nodes: stragglers’ delay bottleneck, which is due to the unpredictable latency in waiting for slowest nodes (or stragglers) to finish their tasks . The conventional approach for mitigating straggler effects involves injecting some form of “computation redundancy" such as repetition (e.g., ). Interestingly, it has been shown recently that coding theoretic concepts can also play a transformational role in this problem, by efficiently creating “computational redundancy” to mitigate the stragglers [7, 8, 9, 10].
In this paper, we consider a general formulation of distributed matrix multiplication, study information-theoretic limits, and develop optimal coding designs for straggler effect mitigation. We consider the canonical master-worker distributed setting, where a group of workers aim to collaboratively compute the product of two large matrices and , and return the result to the master. As shown in Figure 1, the two input matrices are partitioned (arbitrarily) into -by- and -by- blocks of submatrices respectively, where all submatrices within the same input are of equal size. Each worker has a local memory that can be used to store any coded function of each matrix, denoted by ’s and ’s, each with a size equal to that of the corresponding submatrices. The workers then multiply their two stored (coded) submatrices and return the results to the master. By carefully designing the coding functions, the master can decode the final result without having to wait for the slowest workers, which providing robustness against stragglers.
Note that by allowing different values of parameters , , and , we allow flexible partitioning of input matrices, which in return enables different utilization of system resources (i.e., the required amount of storage at each worker and the amount of communication from worker to master). Hence, considering the system constraints on available storage and communication resources, one can choose , , and accordingly. We aim to find optimal coding and computation designs for any choice of parameters , and , to provide optimum straggler effect mitigation for various situations.
With a careful design of the coded submatrices and at each worker, the master only needs results from the fastest workers before it can recover the final output, which effectively mitigates straggler issues. To measure the robustness against straggler effects of a given coding strategy, we use the metric recovery threshold, defined previously in , which is equal to the minimum number of workers that the master needs to wait for in order to compute the output . Given this terminology, our main problem is as follows: What is the minimum possible recovery threshold and the corresponding coding scheme, for any choice of parameters , , , and ?
We propose a novel coding technique, referred to as entangled polynomial code, which achieves the recovery threshold of for all possible parameter values. The construction of the entangled polynomial code is based on the observation that when multiplying an -by- matrix and a -by- matrix, we essentially evaluate a subspace of bilinear functions, spanned by the pairwise product of the elements from the two matrices. Although potentially there are a total of pairs of elements, at most pairs are directly related to the matrix product, which is an order of less. The particular structure of the proposed code entangles the input matrices to the output such that the system almost avoids unnecessary multiplications and achieves a recovery threshold in the order of , while allowing robust straggler mitigation for arbitrarily large systems. This allows orderwise improvement upon conventional uncoded approaches, random linear codes, and MDS-coding type approaches for straggler mitigation [7, 8].
Entangled polynomial code generalizes our previously proposed polynomial code for distributed matrix multiplication , which was designed for the special case of (i.e., allowing only column-wise partitioning of matrices and ). However, as we move to arbitrary partitioning of the input matrices (i.e., arbitrary values of , , and ), a key challenge is to design the coding strategy at each worker such that its computation best aligns with the final computation . In particular, to recover the product , the master needs components that each involve summing products of submatrices of and . Entangled polynomial code effectively aligns the workers’ computations with the master’s need, which is its key distinguishing feature from polynomial code.
We show that entangled polynomial code achieves the optimal recovery threshold among all linear coding strategies in the cases of or . It also achieves the optimal recovery threshold among all possible schemes within a factor of when or .
Furthermore, for all partitionings of input matrices (i.e., all values of , , , and ), we characterize the optimal recovery threshold among all linear coding strategies within a factor of of , which denotes the bilinear complexity of multiplying an -by- matrix to a -by- matrix (see Definition 3 later in the paper). While evaluating bilinear complexity is a well-known challenging problem in the computer science literature (see ), we show that the optimal recovery threshold for linear coding strategies can be approximated within a factor of of this fundamental quantity.
We establish this result by constructing a computation strategy, which achieves a recovery threshold of using a variation of the entangled polynomial code. Specifically, this coding construction exploits the fact that any matrix multiplication problem can be converted into a problem of computing the element-wise product of two arrays of length . Then we show that this augmented computing task can be optimally handled using a variation of the entangled polynomial code, and the corresponding optimal code achieves the recovery threshold .
Finally, we show that the coding construction and converse bounding techniques developed for proving the above results can also be directly extended to several other problems. For example, we show that the converse bounding technique can be extended to the problem of coded convolution, which was originally considered in . We prove that the state-of-the-art scheme we proposed in  for this problem is in fact optimal among all linear coding schemes. These techniques can also be applied in the context of fault tolerance computing, where we provide tight characterizations on the maximum number of detectable or correctable errors.
Ii System Model and Problem Formulation
We consider a problem of matrix multiplication with two input matrices and , for some integers , , and a sufficiently large field .111Here we consider the general class of fields, which includes finite fields, the field of real numbers, and the field of complex numbers. We are interested in computing the product in a distributed computing environment with a master node and worker nodes, where each worker can store fraction of and fraction of , based on some integer parameters , , and (see Fig. 1).
Specifically, each worker can store two coded matrices and , computed based on and respectively. Each worker can compute the product , and return it to the master. The master waits only for the results from a subset of workers before proceeding to recover the final output using certain decoding functions.
Given the above system model, we formulate the distributed matrix multiplication problem based on the following terminology: We define the computation strategy as a collection of encoding functions, denoted by
that are used by the workers to compute each and , and a class of decoding functions, denoted by
that are used by the master to recover given results from any subset of the workers. Each worker stores matrices
and the master can compute an estimateof matrix using results from a subset of the workers by computing
For any integer , we say a computation strategy is -recoverable if the master can recover given the computing results from any workers. Specifically, a computation strategy is -recoverable if for any subset of users, the final output from the master equals for all possible input values. We define the recovery threshold of a computation strategy, denoted by , as the minimum integer such that computation strategy is -recoverable.
We aim to find a computation strategy that requires the minimum possible recovery threshold and allows efficient decoding at the master. Among all possible computation strategies, we are particularly interested in a certain class of designs, referred to as the linear codes and defined as follows:
For a distributed matrix multiplication problem of computing using workers, we say a computation strategy is a linear code given parameters , , and , if there is a partitioning of the input matrices and where each matrix is divided into the following submatrices of equal sizes
such that the encoding functions of each worker can be written as
for some tensorsand , and the decoding function given each subset can be written as 222Here denotes the master’s estimate of the subblock of that corresponds to .
for some tensor . For brevity, we denote the set of linear codes as .
The major advantage of linear codes is that they guarantee both the encoding and the decoding complexities of the scheme scale linearly with respect to the size of the input matrices. Furthermore, as we have proved in , linear codes are optimal for . Given the above terminology, we define the following concept.
For a distributed matrix multiplication problem of computing using workers, we define the optimum linear recovery threshold as a function of the problem parameters , , , and , denoted by , as the minimum achievable recovery threshold among all linear codes. Specifically,
Our goal is to characterize the optimum linear recovery threshold , and to find computation strategies to achieve such optimum threshold. Note that if the number of workers is too small, obviously no valid computation strategy exists even without requiring straggler tolerance. Hence, in the rest of the paper, we only consider the meaningful case where is large enough to support at least one valid computation strategy. Equivalently, has to be at least the bilinear complexity of multiplying an -by- matrix and a -by- matrix.
We are also interested in characterizing the minimum recovery threshold achievable using general coding strategies (including non-linear codes). Similar to , we define this value as the optimum recovery threshold and denote it by .
Iii Main Results
We state our main results in the following theorems:
For a distributed matrix multiplication problem of computing using workers, with parameters , , and , the following recovery threshold can be achieved by a linear code, referred to as the entangled polynomial code.333For , we define .
Furthermore, the entangled polynomial code can be decoded at the master node with at most the complexity of polynomial interpolation givenpoints.
Compared to some other possible approaches, our proposed entangled polynomial code provides orderwise improvement in the recovery threshold (see Fig. 2). One conventional approach (referred to as the uncoded repetition scheme) is to let each worker store and multiply uncoded submatrices. With the additional computation redundancy through repetition, the scheme can robustly tolerate some stragglers. However, its recovery threshold grows linearly with respect to the number of workers. Another approach is to let each worker store two random linear combinations of the input submatrices (referred to as the random linear code
). With high probability, this achieves recovery threshold, which does not scale with . However, to calculate , we need the result of sub-matrix multiplications. Indeed, the lack of structure in the random coding forces the system to wait for times more than what is essentially needed. One surprising aspect of the proposed entangled polynomial code is that, due to its particular structure which aligns the workers’ computations with the master’s need, it avoids unnecessary multiplications of submatrices. As a result, it achieves a recovery threshold that does not scale with , and is orderwise smaller than that of the random linear code. Furthermore, it allows efficient decoding at the master, which requires at most an almost linear complexity.
There have been several works in prior literature investigating the case [7, 13, 10]. For this special case, the entangled polynomial code reduces to our previously proposed polynomial code, which achieves the optimum recovery threshold
and orderwise improves upon other designs. On the other hand, there has been some investigation on matrix-by-vector type multiplication[7, 8], which can be viewed as the special case of or in our proposed problem. The short-MDS code (or short-dot) has been proposed, achieving a recovery threshold of , which scales linearly with . Our proposed entangled polynomial code also strictly and orderwise improves upon that (see Fig. 2).
Our second result is the optimality of the entangled polynomial code when or . Specifically, we prove that entangled polynomial code is optimal in this scenario among all linear codes. Furthermore, if the base field is finite, it also achieves the optimum recovery threshold within a factor of , with non-linear coding strategies into account.
For a distributed matrix multiplication problem of computing using workers, with parameters , , and , if or , we have
Moreover, if the base field is finite,
We prove Theorem 2 by first exploiting the algebraic structure of matrix multiplication to develop a linear algebraic converse for equation (11), and then constructing an information theoretic converse to prove inequality (12). The linear algebraic converse only relies on two properties of the matrix multiplication operation: 1) bilinearity, and 2) uniqueness of zero element. This technique can be extended to any other bilinear operations with similar properties, such as convolution, as mentioned later (see Theorem 4). On the other hand, the information theoretic converse is obtained through a cut-set type argument, which allows a lower bound on the recovery thresholds even for non-linear codes.
Our final result on the main problem is characterizing the optimum linear recovery threshold within a factor of for all possible , , , and , using the fundamental concept of bilinear complexity :
We define the bilinear complexity of multiplying an -by- matrix and a -by- matrix, denoted by , as the minimum number of element-wise multiplication required to complete such an operation. Rigorously, denotes the minimum integer , such that we can find tensors , , and , satisfying
for any input matrices , .
Using this concept, we state our result as follows.
For a distributed matrix multiplication problem of computing using workers, with parameters , , and , the optimum linear recovery threshold is characterized by
where denotes the bilinear complexity of multiplying an -by- matrix and a -by- matrix.
The key idea of proving Theorem 3 is to first convert any matrix multiplication problem into computing the element-wise product of two vectors of length . Then we show that an optimal computation strategy can be developed for this augmented problem using similar ideas we developed for the entangled polynomial code. Similarly, factor-of- characterization can also be obtained for non-linear codes, as discussed in Section VI.
The techniques we developed in this paper can also be extended to several other problems, such as coded convolution  and fault tolerance computing [14, 15], leading to tight characterizations. For coded convolution, we present our result in the following theorem.
For the distributed convolution problem of computing using workers that can each store fraction of and fraction of , the optimum recovery threshold that can be achieved using linear codes, denoted by , is exactly characterized by the following equation
Theorem 4 is proved based on our previously developed coded computing scheme for convolution, which is a variation of the polynomial code . As mentioned before, we extend the proof idea of Theorem 2 to prove the matching converse. This theorem proves the optimality of the computation scheme in  among all computation strategies where the encoding functions are linear. For detailed problem formulation and proof, see Appendix A.
Our second extension is in the fault tolerance computing setting, where, unlike the straggler effects we studied in this paper, arbitrary errors can be injected into the computation, and the master has no information about which subset of workers are returning errors. We show that the techniques we developed for straggler mitigation can also be applied in this setting to improve robustness against computing failures, and the optimality of any encoding function in terms of recovery threshold also preserves when applied in the fault tolerance computing setting. As an example, we present the following theorem, demonstrating this connection.
For a distributed matrix multiplication problem of computing using workers, with parameters , , and , if , or , or , the entangled polynomial code can detect up to
errors, and correct up to
errors. This can not be improved using any other linear coding strategies.
The proof idea for Theorem 5 is to connect the straggler mitigation problem and the fault tolerance problem by extending the concept of Hamming distance to coded computing. Specifically, we map the straggler mitigation problem to the problem of correcting erasure errors, and the fault tolerant problem to the problem of correcting arbitrary errors. The solution to these two communication problems are deeply connected by the Hamming distance, and we show that this result extends to coded computing (see Lemma 3 in Appendix B). Since the concept of Hamming distance is not exclusively defined for linear codes, this connection also holds for arbitrary computation strategies. Furthermore, this approach can be easily extended to the hybrid settings where both stragglers and computing errors exist, and similar results can be proved. The detailed formulation and proof can be found in Appendix B.
Iv Entangled Polynomial Code
In this section, we prove Theorem 1 by formally describing the entangled polynomial code and its decoding procedure. We start with an illustrating example.
Iv-a Illustrating Example
Consider a distributed matrix multiplication task of computing using workers that can each store half of the rows (i.e., and ). We evenly divide each input matrix along the row side into 2 submatrices:
Given this notation, we essentially want to compute
A naive computation strategy is to let the workers compute each uncodedly with repetition. Specifically we can let workers compute and workers compute . However, this approach can only robustly tolerate straggler, achieving a recovery threshold of . Another naive approach is to use random linear codes, i.e., let each worker store a random linear combination of , , and a combination of , . However, the resulting computation result of each worker is a random linear combination of variables , , , and , which also results in a recovery threshold of .
Surprisingly, there is a simple computation strategy for this example that achieves the optimum linear recovery threshold of . The main idea is to instead inject structured redundancy tailored to the matrix multiplication operation. We present this proposed strategy as follows:
Suppose elements of are in . Let each worker store the following two coded submatrices:
To prove that this design gives a recovery threshold of , we need to find a valid decoding function for any subset of workers. We demonstrate this decodability through a representative scenario, where the master receives the computation results from workers , , and , as shown in Figure 3. The decodability for the other possible scenarios can be proved similarly.
According to the designed computation strategy, we have
The coefficient matrix in the above equation is a Vandermonde matrix, which is invertible because its parameters are distinct in . So one decoding approach is to directly invert equation (21), of which the returned result includes the needed matrix . This proves the decodability.
However, as we will explain in the general coding design, directly computing this inverse problem using the classical inversion algorithm might be expensive in some more general cases. Quite interestingly, because of the algebraic structure we designed for the computation strategy (i.e., equation (20)), the decoding process can be viewed as a polynomial interpolation problem (or equivalently, decoding a Reed-Solomon code).
Specifically, in this example each worker returns
which is essentially the value of the following polynomial at point :
Hence, recovering using computation results from workers is equivalent to recovering the linear term coefficient of a quadratic function given its values at points. Later in this section, we will show that by mapping the decoding process to polynomial interpolation, we can achieve almost-linear decoding complexity even for arbitrary parameter values.
Iv-B General Coding Design
Now we present the entangled polynomial code, which achieves a recovery threshold for any , , and as stated in Theorem 1. 444For , a recovery threshold of is achievable by definition. Hence we focus on the case where . First of all, we evenly divide each input matrix into and submatrices according to equations (5) and (6). We then assign each worker an element in , denoted by , and make sure that all ’s are distinct. Under this setting, we define the following class of computation strategies.
Given parameters , we define the -polynomial code as
In an -polynomial code, each worker essentially evaluates a polynomial whose coefficients are fixed linear combinations of the products . Specifically, each worker returns
Consequently, when the master receives results from enough workers, it can recover all these linear combinations using polynomial interpolation. Recall that we aim to recover
where each submatrix is also a fixed linear combination of these products. We design the values of parameters such that all these linear combinations appear in (25) separately as coefficients of terms of different degrees. Furthermore, we want to minimize the degree of the polynomial , in order to reduce the recovery threshold.
One design satisfying these properties is , i.e,
Hence, each worker returns the value of the following degree polynomial at point :
where each is exactly the coefficient of the -th degree term. Since all ’s are selected to be distinct, recovering given results from any workers is essentially interpolating using distinct points. Because the degree of is , the output can always be uniquely decoded.
In terms of complexity, this decoding process can be viewed as interpolating a degree polynomial for times. It is well known that polynomial interpolation of degree has a complexity of . Therefore, decoding entangled polynomial code only requires at most a complexity of , which is almost linear to the input size of the decoder ( elements). This complexity can be reduced by simply swapping in any faster polynomial interpolation algorithm or Reed-Solomon decoding algorithm. In addition, this decoding complexity can also be further improved by exploiting the fact that only a subset of the coefficients are needed for recovering the output matrix.
V-a Maching Converses for Linear Codes
To prove equation (11), we start by developing a converse bound on recovery threshold for general parameter values, then we specialize it to the settings where or . We state this converse bound in the following lemma:
For a distributed matrix multiplication problem with parameters , , , and , we have
To prove Lemma 1, we only need to consider the following two scenarios:
(1) If , then (29) is trivial.
(2) If , then we essentially need to show that for any parameter values , , , and , we have . By definition, if such a linear recovery threshold is achievable, we can find a computation strategy, i.e., tensors , , and a class of decoding functions , such that
for any input matrices and , and for any subset of workers.
We choose the values of and , such that each and satisfies
for some matrices , , and constants , satisfying . Consequently, we have
for all possible values of , , and .
Fixing the value , we can view each subtensor as a vector of length , and each subtensor as a vector of length . For breivity, we denote each such vector by and respectively. Similarly, we can also view matrices and as vectors of length and , and we denote these vectors by and . Furthermore, we can define dot products within these vector spaces following the conventions. Using these notations, (33) can be written as
Given the above definitions, we now prove that within each subset of size , the vectors span the space . Essentially, we need to prove that for a given subset , there does not exist a non-zero such that for all . Assume the opposite that such an exists so that is always , then the LHS of (34) becomes a fixed value. On the other hand, since is non-zero, we can always find different values of such that is variable. Recalling (31) and (32), the RHS of (34) cannot be fixed if is variable, which results in a contradiction.
Now we use this conclusion to prove (29). For any fixed , let be a subset of indices such that form a basis. Recall that we are considering the case where , meaning that we can find a worker . For convenience, we define , and . Obviously, , and . Hence, it suffices to prove that , which only requires that forms a basis of . Equivalently, we only need to prove that any satisfying for any must satisfy .
Recall that forms a basis, so we can find a matrix for each such that .555Here denotes the discrete delta function, i.e., , and for . From elementary linear algebra, the vectors also form a basis of . Correspondingly, their matrix version form a basis of .
For any , we define . Note that , so equation (34) also holds for any subset . Note that if , and satisfies for any , then the corresponding LHS of (34) remains fixed. As a result, must also be fixed. Similar to the above discussion, this requires that the value of be fixed. This value has to be because satisfies our stated condition.
Now we have proved that any satisfying must also satisfy for . Because form a basis of , such acting on through matrix product has to be the zero operator, so . As mentioned above, this results in , which completes the proof of Lemma 1 and equation (11).
Note that in the above proof, we never used the fact that the decoding functions are linear. Hence, the converse does not require the linearity of the decoder. This fact will be used later in our discussion regarding the fault tolerance computing in Appendix B.
V-B Information Theoretic Converse for Nonlinear Codes
For a distributed matrix multiplication problem with parameters , , , and , if the base field is finite, we have
Without loss of generality, we assume , and aim to prove . Specifically, we need to show that, for any computation strategy and any subset of workers, if the master can recover given the results from workers in , then we must have .
Suppose the condition in the above statement holds. Given each input , the workers can compute using the encoding functions. On the other hand, for any possible value of , the workers can compute based on and . Hence, if we view
as a random variable andas random functions of
, we have the following Markov chain:
Because the master can decode as a function of , it can also obtain as a random function if the values of for all are known. This random function is exactly a linear map from to , which returns the matrix product of and the input. For breivity, we denote this function by . Consequently, we have the following extended Markov chain
Given the basic property of matrix multiplication, there is a natural one-to-one correspondence between and . Hence, the input matrix can be exactly determined from , i.e., . Using the data processing inequality, we also have . Now let be uniformly randomly sampled from , and we have bits. On the other hand, each consists of elements, which has an entropy of at most bits. Consequently, we have
Vi Factor of characterization of Optimum Linear Recovery Threshold
In this section, we provide the proof of Theorem 3. Note that the converse bound in this theorem is a natural consequence of Definition 3.666Rigorously, it also requires the linear independence of the ’s, which can be easily proved. Hence, we focus on proving the upper bound by developing an achievability scheme. Specifically, we need to provide a computation strategy that achieves a recovery threshold of at most for all possible values of , , , and . As we have explained in Section IV-B, a recovery threshold of is always achievable, so we only need to focus on the scenario where .
Our proof consists of steps. In Step , we show that any matrix multiplication is essentially computing the element-wise product of two vectors of length . Then in Step , we show that we can find an optimal computation strategy for this augmented computing task using similar ideas we developed for entangled polynomial code, which only requires a recovery threshold of .
For Step , recall the definition of bilinear complexity in Section III. We can always find tensors , , and such that any block of the final output can be computed as
where and are linear combinations of the blocks of and , defined as
This essentially converts matrix multiplication into a problem of computing the element-wise product of two “vectors” and , each of length . Specifically, the master only needs for decoding the final output.
Now in Step , we want to find the optimum computation strategy for this augmented computation task. Specifically, given two arbitrary vectors and of length , we want to achieve a recovery threshold of for computing their element-wise product using workers, each of which can multiply two coded vectors of length .
This problem can be solved using similar techniques we developed for entangled polynomial code. The main coding idea is to first view the elements in each vector as values of a degree polynomial at different points. Specifically, given distinct elements in the field , denoted by , we find polynomials and of degree , whose coefficients are matrices, such that
Recall that we want to recover , which is essentially recovering the values of the degree polynomial at these points. Earlier in this paper, we already developed a coding structure that allows us to recover polynomials of this form. We now reuse the idea in this construction.
Let , , …, be distinct elements of . We let each worker store
which are linear combinations of the input submatrices. After computing the product, each worker essentially evaluates . Hence, from the results of any workers, the master can recover , which has degree , and proceed with decoding the output matrix . This construction achieves a recovery threshold of , which proves the upper bound in Theorem 3.
The computation strategy we developed in Step provides a tight upper bound on the characterization of the optimum linear recovery threshold for computing element-wise product of two arbitrary vectors using machines. Its optimality naturally follows from Theorem 2, given that the element-wise product of two vectors contains all the information needed to compute the dot-product, which is a special case of matrix multiplication. We formally state this result in the following corollary.
Consider the problem of computing the element-wise product of two vectors of length using workers, each of which can store a linearly coded element of each vector and return their product to the master. The optimum linear recovery threshold, denoted as , is given by the following equation:777Obviously, we need to guarantee the existence of a valid computation strategy.
Note that Step of this proof does not require the computation strategy to be linear. Hence, using exactly the same coding approach, we can easily extend this result to non-linear codes, and prove a similar factor-of- characterization for the optimum recovery threshold , formally stated in the following corollary.
For a distributed matrix multiplication problem with parameters , , and , let denotes the minimum number of workers such that a valid (possibly non-linear) computation strategy exists. Then for all possible values of , we have
Vii Concluding Remarks
In this paper, we studied the coded distributed matrix multiplication problem and proposed the entangled polynomial code, which allows optimal straggler mitigation and orderwise improves upon the prior arts. Based on our proposed coding idea, we proved a fundamental connection between the optimum linear recovery threshold and the bilinear complexity, which characterizes the optimum linear recovery threshold within a factor of for all possible parameter values. The techniques developed in this paper can be directly applied to many other problems, including coded convolution and fault tolerance computing, providing matching characterizations.
One interesting follow-up direction is to find better characterization of the optimum linear recovery threshold. Although this problem is completely solved for cases including , , or , there is room for improvement in general cases. Another interesting question is whether there exist non-linear coding strategies that strictly out-perform linear codes, especially for the important case where the input matrices are large (