I Introduction
Linear operations, often represented by matrix multiplication, are the key techniques used in many applications such as optimization and machine learning. Many such applications require processing largescale matrices. For example, in the deep neural networks, the convolution layers, which are operations based on matrix multiplication, account for a large fraction of computational time
[1]. To increase the accuracy of the learning, we increase the size of the model, by using more layers and more neurons in each layer. This would increase the computation complexity and the overall training time. This heavy computation cannot be completed over a single machine. An inevitable solution to overcome this challenge is to distribute computation over several machines
[2, 3]. In particular, consider a masterworker setting for distributed computing, in which the master node has access to the information of two matrices. The master then partitions the task of matrix multiplications into some smaller subtasks and assign each subtask to one of the worker nodes to be executed. In this setting, the execution time is dominated by the speed of the slowest worker nodes, or the stragglers. This is one of the main challenges in distributed computing [4].Conventionally, the effect of stragglers has been mitigated by using redundancy in computing. This means that every task is executed over more than one machine, and the results are fetched from the fastest ones. More recently, it is shown that coding can be more effective in coping with stragglers in terms of minimizing the recovery threshold, defined as the total number of worker nodes that we need to wait for to be able to recover the final result [5, 6, 7, 8, 9, 10, 11, 12, 13]. In [5, 6], it is suggested that in matrix multiplication, one or both matrices are coded separately using maximum distance separable (MDS) codes. In [7], polynomial codes have been proposed to code each matrix, such that the result of the multiplications across worker nodes become also MDS coded. The shortdot technique has been presented in [10], where coded redundancy is added to the computation, where the effort is to keep the coded matrices sparse and thus reduce the load of the computation. In [8], an extension of the polynomial codes, known as entangled polynomial codes, has been proposed that admits flexible partitioning of each matrix and minimizes the number of unwanted computations. In [14], coded sparse matrix multiplication is proposed for the case where the result of the multiplication is sparse. In [11, 12], coded schemes have been used to develop multiparty computation scheme to calculate arbitrary polynomials of massive matrices, while preserving privacy of the data. In [15], a universal coding technique, Lagrange Code, is developed to code across several parallel computations of an arbitrary polynomial function, without communication across worker nodes. To derive an approximate result of the matrix multiplication over a single server, in [16]
, it is proposed to use the countsketch method, concatenated with fast Fourier transform to reduce complexity. To review countsketch methods, see
[17, 18], or Subsection IIA on preliminaries in this paper. In [13], the OverSketch method has been proposed where some extra countsketches are used, as a redundancy, to mitigate stragglers in a distributed setting.In this paper, we propose a distributed straggler–resistant computation scheme to achieve an approximation of the multiplication of two large matrices and , where (see Fig. 1). To exploit the fact that an approximated result is required to reduce the recovery threshold, we need to use some sorts of precompression. However, compression inherently involves some randomness that would lose the structure of the matrices. On the other hand, considering the structure of the matrices is crucial to reduce the recovery threshold. In this paper, we use count–sketch compression on the rows of and columns of and we use structured codes on the columns of and rows of . This arrangement allows us to enjoy the benefits of both in reducing the recovery threshold. To improve the overall accuracy, we need to use multiple count–sketches. Another layer of structured codes allows us to keep multiple count–sketches independent. This independency is used to prove theoretical guarantee on the performance of the final result and establish an achievable recovery threshold.
Notation: For , the notation represents the set . The cardinality of a set is denoted by . In addition, refers to the expected value , respectively. The
th element of a vector
, is denoted by and the th entry of a matrix is denoted by .Ii Problem Formulation, Notation and Preliminaries
Iia Preliminaries
In this subsection, we review some preliminaries.
Definition 1.
A set of random variables
is wise independent if for any subset of these random variables, , and for any values , we have [19](1) 
Definition 2.
A hash function is a function that maps a universe into a special range for some , i.e., . In other words, a hash function operates as a method in which items from the universe are placed into bins.
The most simple family of hash functions is completely random hash functions which can be analyzed using a model known as balls and bins model. The hash values are considered independent and uniform over the range of the hash function for any collection of data . Since this class of hash functions are expensive in terms of computing and storage cost, it is not very useful in practice. Another family of hash functions is known as Universal family, which satisfies some provable performance guarantees [19].
Definition 3.
The hash family is defined as a universal family if for any hash function with , chosen uniformly at random from , and for any collection of , we have
(2) 
In addition is defined as a strongly universal family if for any values we have
(3) 
This definition implies that for any fix ,
is uniformly distributed over
. Also, if is chosen uniformly at random from , for any distinct , the values of are independent [19].Remark 1:
The values are wise independent if the hash function is chosen from a strongly universal family.
Definition 4.
The countsketch [17, 18] is a method for representing a compact form of data which maps an dimensional vector to a dimensional vector where . The countsketch of can be defined using an iterative process, initialized with the vector to be entirely zero, as follows: in the th iteration () we have
(4) 
where is a sign function, is th entry of , and is a hash function that computes the hash of .
One countsketch of is created by using one hash function and one sign function. To improve accuracy in this approximation, more than one hash function can be used (say hash functions). To create countsketches, we need pairwise independent sign functions and 2wise independent hash functions. The output of these countsketches is a matrix. as the entry of , for each , can be approximately recovered by taking the median of the values of some entries of this matrix. For details, see Algorithm 1.
Theorem 1.
If and then for entry of the input vector , , we have [17, 20]
(5)  
(6) 
and
(7) 
where
is an estimation of
.Remark 2:
Suppose length vector has only nonzero entries. Then, from Corollary 2, one can see that an approximated version of with accuracy can be achieved by countsketches.
Theorem 3.
Remark 3:
According to Theorem 3, if the length vector has only nonzero entries, then the output of the countsketch method computes exactly with probability .
Definition 5.
The countsketch of an dimensional vector can be represented by a polynomial, named sketch polynomial, as
(10) 
where is the 2wise independent hash function and is the 2wise independent sign function, used to develop the countsketch.
Let us assume that we use and to countsketch vector and and to countsketch vector , respectively, represented by sketch polynomials
(11)  
(12) 
then represents a sketch polynomial for matrix , as
(13) 
where
(14) 
Also, the hash function on pairs is
(15) 
Remark 4:
Recall that and are two independent random variables with some distributions. Thus the distribution of over its output range is the convolution of the distributions of and .
IiB Problem Formulation
Consider a distributed system including a master and worker nodes, where the master node is connected to each worker node. Assume that the master node wants to compute which is an approximation of the multiplication of two matrices , where and and . Our goal is to compute subject to the following conditions

Unbiasedness:
(16) 
accuracy:
We say is an accurate approximation of the matrix if(17) for all , where is the Frobenius norm of matrix and is entry of .
Suppose and are partitioned into the submatrices of equal size as
(18) 
(19) 
where and are positive integers. One of the constraints in this system is the limited storage of each worker. We assume that the size of the storage at each node is equal to fraction of plus fraction of for some integers , and . Assume and are the encoding functions that are used by the master node to compute and for th worker node, , respectively. In other words, the master node encodes the input matrices and sends two coded matrices to th worker node as follows:
(20)  
(21) 
where . Each worker node computes and sends back the result to the master node. After receiving the results from a subset of worker nodes, the master node can recover , the approximation of the original result. Let be a reconstruction function which operates on a subset of the workers’ results and calculates as follows:
(22) 
Definition 6.
The recovery threshold of a distributed system with worker nodes and one master node is the minimum number of workers that the master node needs to wait for in order to guarantee that the master node can complete the computation, subject to space constraint at each worker node and unbiasedness and accuracy conditions. In this paper, the optimum recovery threshold is denoted by .
Iii Main results
The following theorems state the main results of the paper.
Theorem 4.
For accurate approximation of the multiplication of two matrices, we have
(23) 
Remark 5:
To prove Theorem 4, we propose a coding scheme that achieves the following three objectives at the same time:

It exploits the fact that only an approximation result is needed, by using some countsketches to precompress the input matrices. This compression is done such that it reduces the recovery threshold.

It relies on the structure of the matrix multiplication to add coded redundancy in the computation. This coding can force the system to perform some unwanted calculations. The coding is designed to reduce the number of unwanted calculations; thus, reduces the recovery threshold.

The proposed scheme, in the end, creates some independent countsketches of the , from which we calculate the final result, and also establish the theorem. As a side product, some dependent countsketches are also created. To minimize the recovery threshold, we need to minimize the number of these side products. We use another layer of structured code to reduce the number of these side products and thus reduce the recovery threshold.
We note that between opportunities one and two, the first one injects some randomness to the input matrices, while the second one relies on the structure of them. To achieve both at the same time, we use countsketches on the rows of the first and columns of the second matrix, and use a structured code on the columns of the first and the rows of the second matrix.
Remark 6:
Remark 7:
Depending on the parameters of the problem, this scheme can perform unboundedly better than entangled polynomial code, in terms of the recovery threshold. For example, for , , and , the recovery threshold of the proposed scheme is about , while in the entangled polynomial code, it is about , an order of one thousand improvement. Recall that the entangled polynomial code is designed to calculate the exact result.
Theorem 5.
If the original result is sparse, i.e., only a fraction of the entries of matrix is nonzero, then the proposed method computes exactly with probability and with the recovery threshold of
(24) 
Remark 8:
Note that fraction of the matrix is, in size, equivalent of blocks of matrix partitioned as .
Iv The Proposed Scheme
In the following, the goal is to compute the approximation of the matrix multiplication over a distributed system using CodedSketch scheme.
Iva Motivating Example
We first demonstrate the main idea of our scheme through a simple example. We then generalize this approach in the next section. Consider a distributed system with one master node and worker nodes which aim to collaboratively compute as an approximation of where and are two matrices partitioned as follows
(25) 
(26) 
The result of the multiplication can be computed using summation of four outer products as , where and are the th column of and th row of respectively. The proposed scheme is based on the following steps.

Step 1. The master node forms the following polynomial matrices based on columns of and rows of as follows
(27) and
(28) Then can be recovered from if we have the value of for seven distinct . More precisely, each entry of is a 6thdegree polynomial. Let us focus on entry of denoted by . Then
(29) In this expansion, one can verify that . If we have the value of for seven distinct then all the coefficients of
can be calculated using polynomial interpolation. In particular
which is the coefficient of can be calculated.

Step 2. To reduce the dimension of this product, the countsketch method is used. Assume we construct three countsketches for . Let us assume that the sketch polynomials of the rows of are described as:
(30) (31) (32) where is the sketch polynomial of the matrix using the th hash function. Same as before, assume we have three countsketches for . To be specific, assume that the related sketch polynomials are defined as follows:
(33) (34) (35) We note that the can be considered as a sketch polynomial for , where . For example, can be written as
(36) where in this expansion,
(37) (38) (39) Each entry of is a 6thdegree polynomial in which the coefficient of is the combination of entries of original result . This can be explained better as follows
(40) Then according to (27) and (28), and discussion followed we have
(41) (42) (43) In the expansion (40) particularly, the terms in (41)–(43) are of interest. The reason the other coefficients are not interesting is that the coefficients that we are looking for appear only in these terms. Thus, we have another countsketch hidden in the countsketch of . These three coefficients (41)–(43) form a countsketch of the original result of multiplication, i.e., . The reason that the special coefficients of (40) form a countsketch of , is the structure used in this scheme. This structure, which we name CodedSketch, is the concatenation of the countsketch and the entangled polynomial code. Other forms independent countsketch similarly.
In the following, the computation of these sketch polynomials over a distributed system is proposed and to form the results of where efficiently, we use the scheme Lagrange code proposed in [15]. 
Step 3. The master node creates the following polynomials and encodes the and using Lagrange coded computing
(44) (45) These polynomials are linear combination of sketch polynomials created using different hash functions. It can be seen that , and and so does . Since the extraction of hidden countsketch of is desired, we choose and . Let and . The number is dedicated to the th worker node, where if . Therefor, the master node sends and to th worker node.

Step 4. Having received matrices and from master node, the th worker node multiplies these two matrices. Then it returns the result, i.e., to the master node, the result calculated at node can be written as
(46) By substituting the and in (46), where , the polynomial with coefficients is created in which the countsketch results of are located.

Step 5. The master node can recover all of the polynomials’ coefficients by receiving the computation results of any 75 worker nodes. That is because the recovering process is equivalent to interpolating a 74thdegree polynomial given its value at 75 points. After interpolation and recovering the coefficients, a 74thdegree polynomial is created. Assume that, in this polynomial, all are replaced by . In this case, a bivariate polynomial is achieved in which can we choose to calculate three sketch polynomials , and for respectively. According to (41) we need the coefficients of , and to find the hidden countsketches of in the sketch polynomial of .

Step 6. The coefficients , and of the three sketch polynomials , and are shown in Table I. To achieve an approximation of
with lower variance, The master node takes the median of these estimations after multiplying them to the corresponding sign functions. For example, to approximate the value of
, the master node does the following(47) In another form the master node takes the median of the following terms
(48) Index 0 1 2 1st CountSketch 2nd CountSketch
Comments
There are no comments yet.