CodedSketch: A Coding Scheme for Distributed Computation of Approximated Matrix Multiplication

12/26/2018 ∙ by Tayyebeh Jahani-Nezhad, et al. ∙ Sharif Accelerator 0

In this paper, we propose CodedSketch, as a distributed straggler-resistant scheme to compute an approximation of the multiplication of two massive matrices. The objective is to reduce the recovery threshold, defined as the total number of worker nodes that we need to wait for to be able to recover the final result. To exploit the fact that only an approximated result is required, in reducing the recovery threshold, some sorts of pre-compression are required. However, compression inherently involves some randomness that would lose the structure of the matrices. On the other hand, considering the structure of the matrices is crucial to reduce the recovery threshold. In CodedSketch, we use count--sketch, as a hash-based compression scheme, on the rows of the first and columns of the second matrix, and a structured polynomial code on the columns of the first and rows of the second matrix. This arrangement allows us to exploit the gain of both in reducing the recovery threshold. To increase the accuracy of computation, multiple independent count--sketches are needed. This independency allows us to theoretically characterize the accuracy of the result and establish the recovery threshold achieved by the proposed scheme. To guarantee the independency of resulting count--sketches in the output, while keeping its cost on the recovery threshold minimum, we use another layer of structured codes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 8

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Linear operations, often represented by matrix multiplication, are the key techniques used in many applications such as optimization and machine learning. Many such applications require processing large-scale matrices. For example, in the deep neural networks, the convolution layers, which are operations based on matrix multiplication, account for a large fraction of computational time 

[1]

. To increase the accuracy of the learning, we increase the size of the model, by using more layers and more neurons in each layer. This would increase the computation complexity and the overall training time. This heavy computation cannot be completed over a single machine. An inevitable solution to overcome this challenge is to distribute computation over several machines

[2, 3]. In particular, consider a master-worker setting for distributed computing, in which the master node has access to the information of two matrices. The master then partitions the task of matrix multiplications into some smaller sub-tasks and assign each sub-task to one of the worker nodes to be executed. In this setting, the execution time is dominated by the speed of the slowest worker nodes, or the stragglers. This is one of the main challenges in distributed computing [4].

Conventionally, the effect of stragglers has been mitigated by using redundancy in computing. This means that every task is executed over more than one machine, and the results are fetched from the fastest ones. More recently, it is shown that coding can be more effective in coping with stragglers in terms of minimizing the recovery threshold, defined as the total number of worker nodes that we need to wait for to be able to recover the final result [5, 6, 7, 8, 9, 10, 11, 12, 13]. In [5, 6], it is suggested that in matrix multiplication, one or both matrices are coded separately using maximum distance separable (MDS) codes. In [7], polynomial codes have been proposed to code each matrix, such that the result of the multiplications across worker nodes become also MDS coded. The short-dot technique has been presented in [10], where coded redundancy is added to the computation, where the effort is to keep the coded matrices sparse and thus reduce the load of the computation. In [8], an extension of the polynomial codes, known as entangled polynomial codes, has been proposed that admits flexible partitioning of each matrix and minimizes the number of unwanted computations. In [14], coded sparse matrix multiplication is proposed for the case where the result of the multiplication is sparse. In [11, 12], coded schemes have been used to develop multi-party computation scheme to calculate arbitrary polynomials of massive matrices, while preserving privacy of the data. In [15], a universal coding technique, Lagrange Code, is developed to code across several parallel computations of an arbitrary polynomial function, without communication across worker nodes. To derive an approximate result of the matrix multiplication over a single server, in [16]

, it is proposed to use the count-sketch method, concatenated with fast Fourier transform to reduce complexity. To review count-sketch methods, see 

[17, 18], or Subsection II-A on preliminaries in this paper. In [13], the OverSketch method has been proposed where some extra count-sketches are used, as a redundancy, to mitigate stragglers in a distributed setting.

In this paper, we propose a distributed straggler–resistant computation scheme to achieve an approximation of the multiplication of two large matrices and , where (see Fig. 1). To exploit the fact that an approximated result is required to reduce the recovery threshold, we need to use some sorts of pre-compression. However, compression inherently involves some randomness that would lose the structure of the matrices. On the other hand, considering the structure of the matrices is crucial to reduce the recovery threshold. In this paper, we use count–sketch compression on the rows of and columns of and we use structured codes on the columns of and rows of . This arrangement allows us to enjoy the benefits of both in reducing the recovery threshold. To improve the overall accuracy, we need to use multiple count–sketches. Another layer of structured codes allows us to keep multiple count–sketches independent. This independency is used to prove theoretical guarantee on the performance of the final result and establish an achievable recovery threshold.

Fig. 1: Framework of distributed computation of approximated matrix multiplications.

Notation: For , the notation represents the set . The cardinality of a set is denoted by . In addition, refers to the expected value , respectively. The

th element of a vector

, is denoted by and the -th entry of a matrix is denoted by .

Ii Problem Formulation, Notation and Preliminaries

Ii-a Preliminaries

In this subsection, we review some preliminaries.

Definition 1.

A set of random variables

is -wise independent if for any subset of these random variables, , and for any values , we have [19]

(1)
Definition 2.

A hash function is a function that maps a universe into a special range for some , i.e., . In other words, a hash function operates as a method in which items from the universe are placed into bins.

The most simple family of hash functions is completely random hash functions which can be analyzed using a model known as balls and bins model. The hash values are considered independent and uniform over the range of the hash function for any collection of data . Since this class of hash functions are expensive in terms of computing and storage cost, it is not very useful in practice. Another family of hash functions is known as Universal family, which satisfies some provable performance guarantees [19].

Definition 3.

The hash family is defined as a -universal family if for any hash function with , chosen uniformly at random from , and for any collection of , we have

(2)

In addition is defined as a strongly -universal family if for any values we have

(3)

This definition implies that for any fix ,

is uniformly distributed over

. Also, if is chosen uniformly at random from , for any distinct , the values of are independent [19].

Remark 1:

The values are -wise independent if the hash function is chosen from a strongly -universal family.

Definition 4.

The count-sketch [17, 18] is a method for representing a compact form of data which maps an -dimensional vector to a -dimensional vector where . The count-sketch of can be defined using an iterative process, initialized with the vector to be entirely zero, as follows: in the th iteration () we have

(4)

where is a sign function, is th entry of , and is a hash function that computes the hash of .

One count-sketch of is created by using one hash function and one sign function. To improve accuracy in this approximation, more than one hash function can be used (say hash functions). To create count-sketches, we need pairwise independent sign functions and 2-wise independent hash functions. The output of these count-sketches is a matrix. as the entry of , for each , can be approximately recovered by taking the median of the values of some entries of this matrix. For details, see Algorithm 1.

Theorem 1.

If and then for entry of the input vector , , we have [17, 20]

(5)
(6)

and

(7)

where

is an estimation of

.

Corollary 2.

If there are estimations like , then [17, 20]

(8)

Corollary 2 shows that by choosing

with probability at least

, we have , where , for .

Remark 2:

Suppose -length vector has only nonzero entries. Then, from Corollary 2, one can see that an approximated version of with accuracy can be achieved by count-sketches.

Theorem 3.

For an dimensional and , if and , then with probability , we have

(9)

where and is the set of indices of the largest entries of  [21, 22, 20].

Remark 3:

According to Theorem 3, if the -length vector has only nonzero entries, then the output of the count-sketch method computes exactly with probability .

Definition 5.

The count-sketch of an -dimensional vector can be represented by a polynomial, named sketch polynomial, as

(10)

where is the 2-wise independent hash function and is the 2-wise independent sign function, used to develop the count-sketch.

Let us assume that we use and to count-sketch vector and and to count-sketch vector , respectively, represented by sketch polynomials

(11)
(12)

then represents a sketch polynomial for matrix , as

(13)

where

(14)

Also, the hash function on pairs is

(15)
Remark 4:

Recall that and are two independent random variables with some distributions. Thus the distribution of over its output range is the convolution of the distributions of and .

1:function count-sketch()
2:     : an -dimensional input vector
3:     : a matrix, initialized by
4:     choose from the family of 2-wise independent of hash functions
5:     choose from the family of 2-wise independent functions
6:     for  to  do
7:         for  to  do
8:              
9:         end for
10:     end for
11:end function
12:function recovering from
13:     for  to  do
14:         for  to  do
15:              
16:         end for
17:         
18:     end for
19:end function
Algorithm 1 The count-sketch algorithm [17, 18]

Ii-B Problem Formulation

Consider a distributed system including a master and worker nodes, where the master node is connected to each worker node. Assume that the master node wants to compute which is an approximation of the multiplication of two matrices , where and and . Our goal is to compute subject to the following conditions

  1. Unbiasedness:

    (16)
  2. -accuracy:
    We say is an -accurate approximation of the matrix if

    (17)

    for all , where is the Frobenius norm of matrix and is entry of .

Suppose and are partitioned into the sub-matrices of equal size as

(18)
(19)

where and are positive integers. One of the constraints in this system is the limited storage of each worker. We assume that the size of the storage at each node is equal to fraction of plus fraction of for some integers , and . Assume and are the encoding functions that are used by the master node to compute and for th worker node, , respectively. In other words, the master node encodes the input matrices and sends two coded matrices to th worker node as follows:

(20)
(21)

where . Each worker node computes and sends back the result to the master node. After receiving the results from a subset of worker nodes, the master node can recover , the approximation of the original result. Let be a reconstruction function which operates on a subset of the workers’ results and calculates as follows:

(22)
Definition 6.

The recovery threshold of a distributed system with worker nodes and one master node is the minimum number of workers that the master node needs to wait for in order to guarantee that the master node can complete the computation, subject to space constraint at each worker node and unbiasedness and -accuracy conditions. In this paper, the optimum recovery threshold is denoted by .

Iii Main results

The following theorems state the main results of the paper.

Theorem 4.

For -accurate approximation of the multiplication of two matrices, we have

(23)

The proof of Theorem 4 is detailed in Section IV.

Remark 5:

To prove Theorem 4, we propose a coding scheme that achieves the following three objectives at the same time:

  1. It exploits the fact that only an approximation result is needed, by using some count-sketches to pre-compress the input matrices. This compression is done such that it reduces the recovery threshold.

  2. It relies on the structure of the matrix multiplication to add coded redundancy in the computation. This coding can force the system to perform some unwanted calculations. The coding is designed to reduce the number of unwanted calculations; thus, reduces the recovery threshold.

  3. The proposed scheme, in the end, creates some independent count-sketches of the , from which we calculate the final result, and also establish the theorem. As a side product, some dependent count-sketches are also created. To minimize the recovery threshold, we need to minimize the number of these side products. We use another layer of structured code to reduce the number of these side products and thus reduce the recovery threshold.

We note that between opportunities one and two, the first one injects some randomness to the input matrices, while the second one relies on the structure of them. To achieve both at the same time, we use count-sketches on the rows of the first and columns of the second matrix, and use a structured code on the columns of the first and the rows of the second matrix.

For compression, we use count-sketch, to code the columns of the first and rows of the second matrix, we use the entangled polynomial code [8]. The last layer of the code is motivated by Lagrange code [15].

Remark 6:

The proposed scheme achieves the first term in Theorem 4. The second term is achieved by the entangled polynomial code [8] to calculate the exact result.

Remark 7:

Depending on the parameters of the problem, this scheme can perform unboundedly better than entangled polynomial code, in terms of the recovery threshold. For example, for , , and , the recovery threshold of the proposed scheme is about , while in the entangled polynomial code, it is about , an order of one thousand improvement. Recall that the entangled polynomial code is designed to calculate the exact result.

Theorem 5.

If the original result is -sparse, i.e., only a fraction of the entries of matrix is non-zero, then the proposed method computes exactly with probability and with the recovery threshold of

(24)
Remark 8:

Note that fraction of the matrix is, in size, equivalent of blocks of matrix partitioned as .

The proof of Theorem 5 is detailed in Section IV. In that section, we formally describe the proposed scheme and the decoding procedure.

Iv The Proposed Scheme

In the following, the goal is to compute the approximation of the matrix multiplication over a distributed system using CodedSketch scheme.

Iv-a Motivating Example

We first demonstrate the main idea of our scheme through a simple example. We then generalize this approach in the next section. Consider a distributed system with one master node and worker nodes which aim to collaboratively compute as an approximation of where and are two matrices partitioned as follows

(25)
(26)

The result of the multiplication can be computed using summation of four outer products as , where and are the th column of and th row of respectively. The proposed scheme is based on the following steps.

  • Step 1. The master node forms the following polynomial matrices based on columns of and rows of as follows

    (27)

    and

    (28)

    Then can be recovered from if we have the value of for seven distinct . More precisely, each entry of is a 6th-degree polynomial. Let us focus on entry of denoted by . Then

    (29)

    In this expansion, one can verify that . If we have the value of for seven distinct then all the coefficients of

    can be calculated using polynomial interpolation. In particular

    which is the coefficient of can be calculated.

  • Step 2. To reduce the dimension of this product, the count-sketch method is used. Assume we construct three count-sketches for . Let us assume that the sketch polynomials of the rows of are described as:

    (30)
    (31)
    (32)

    where is the sketch polynomial of the matrix using the th hash function. Same as before, assume we have three count-sketches for . To be specific, assume that the related sketch polynomials are defined as follows:

    (33)
    (34)
    (35)

    We note that the can be considered as a sketch polynomial for , where . For example, can be written as

    (36)

    where in this expansion,

    (37)
    (38)
    (39)

    Each entry of is a 6th-degree polynomial in which the coefficient of is the combination of entries of original result . This can be explained better as follows

    (40)

    Then according to (27) and (28), and discussion followed we have

    (41)
    (42)
    (43)

    In the expansion (40) particularly, the terms in (41)–(43) are of interest. The reason the other coefficients are not interesting is that the coefficients that we are looking for appear only in these terms. Thus, we have another count-sketch hidden in the count-sketch of . These three coefficients (41)–(43) form a count-sketch of the original result of multiplication, i.e., . The reason that the special coefficients of (40) form a count-sketch of , is the structure used in this scheme. This structure, which we name CodedSketch, is the concatenation of the count-sketch and the entangled polynomial code. Other forms independent count-sketch similarly.
    In the following, the computation of these sketch polynomials over a distributed system is proposed and to form the results of where efficiently, we use the scheme Lagrange code proposed in [15].

  • Step 3. The master node creates the following polynomials and encodes the and using Lagrange coded computing

    (44)
    (45)

    These polynomials are linear combination of sketch polynomials created using different hash functions. It can be seen that , and and so does . Since the extraction of hidden count-sketch of is desired, we choose and . Let and . The number is dedicated to the th worker node, where if . Therefor, the master node sends and to th worker node.

  • Step 4. Having received matrices and from master node, the th worker node multiplies these two matrices. Then it returns the result, i.e., to the master node, the result calculated at node can be written as

    (46)

    By substituting the and in (46), where , the polynomial with coefficients is created in which the count-sketch results of are located.

  • Step 5. The master node can recover all of the polynomials’ coefficients by receiving the computation results of any 75 worker nodes. That is because the recovering process is equivalent to interpolating a 74th-degree polynomial given its value at 75 points. After interpolation and recovering the coefficients, a 74th-degree polynomial is created. Assume that, in this polynomial, all are replaced by . In this case, a bivariate polynomial is achieved in which can we choose to calculate three sketch polynomials , and for respectively. According to (41) we need the coefficients of , and to find the hidden count-sketches of in the sketch polynomial of .

  • Step 6. The coefficients , and of the three sketch polynomials , and are shown in Table I. To achieve an approximation of

    with lower variance, The master node takes the median of these estimations after multiplying them to the corresponding sign functions. For example, to approximate the value of

    , the master node does the following

    (47)

    In another form the master node takes the median of the following terms

    (48)
    Index 0 1 2
    1st Count-Sketch
    2nd Count-Sketch