Matrix multiplication has been widely performed in a variety of areas. For example, in image processing, a multiplication of projection matrix and system coefficient matrix is used to reconstruct the original images from the projections GLZeng
. In signal processing, the discrete Fourier transform of a signal is calculated by multiplying the N-by-N DFT matrix with the signal matrixRGLyons . Many other applications include cryptography, computer graphics, economics, physics, electronics, etc, which all involve large scale data processing.
On homogeneous processor platforms, the problem of scheduling matrix multiplication load for parallel processing has been extensively studied, such as Canon’s algorithm Canon , SUMMA Geijn and Solomonik’s 2.5D algorithm solomonik , etc. However, these approaches generally ignore the heterogeneity of processors and links, as well as network topologies. Thus, in distributed computing and heterogeneous platforms, those algorithms fail to apply because they can’t guarantee load balance in those scenarios. To minimize the task completion time without considering the heterogeneity of the system is simply impossible. Therefore, for distributed computing and heterogeneous systems, additional factors need to be taken into account, such as heterogeneous processor speeds, heterogeneous link speeds, network topologies, distributed storage, etc.
To schedule matrix multiplication on heterogeneous processor platforms, researchers usually consider the following questions.
How to allocate the computing load to minimize the total communication volume?
How to minimize the task completion time?
To optimize the total communication volume, many previous research works apply rectangular partition on the result matrix AKalinov -Beaumont2 . Rectangular partition, which adopts the well-known divide and conquer strategy, divides the result matrix into multiple sub-rectangles, and assigns each sub-rectangle’s computing load to different processors respectively. However, approaches based on rectangular partition generally have the following drawbacks:
The restriction of each division’s shape being rectangular brings difficulty to find the optimal partition that minimizes the total communication volume.
The best communication volume of rectangular partition may not be globally optimal.
There have been perspectives using non-rectangular partitionDeFlumere1 DeFlumere2 . However, these approaches only allows one division of the partition to be in random shape, while keeping the majority of the rest still in the shape of a rectangle. Thus, these approaches do not completely resolve the problem brought by the rectangular partition.
Motivated by this, we proposes another approach called layer based partition (LBP). Rather than assigning one rectangular sub-matrix of the final result matrix to a specific processor to process, our algorithm assigns each processor with one layer. Each layer is of the same shape with the result matrix. We will show that this method guarantees the optimality of the total communication volume.
We further study the problem of scheduling matrix multiplication on heterogeneous processor networks with the goal of minimizing the multiplication completion time. While several approaches have been proposed Lastovetsky1 -Nagamochi , none of them have addressed problem in the context of a specific network topology, like a heterogeneous mesh. Besides, most previous works consider the problem in the real number domain such that it is allowed for a processor to get rows, for instance. Comparatively, we consider the problem in the integer domain which is more applicable in real practice. In our paper, we study the problem in two specific networks: star network and mesh, and propose different strategies to minimize their overall finishing time.
1.1 An Example
To better illustrate this matrix multiplication scheduling problem, an example is shown in figure 1. Consider a task of multiplying two matrices(A and B) using four processors , , and . Suppose the computing power of , , and are , , and respectively. How to schedule the computing load onto these four processors to optimize communication volume and multiplication finishing time? Figure 1 provides three scheduling schemes , and . represents a rectangular partition of the result matrix where each of the processor is assigned calculating one of the rectangular sub-matrices. and represent a layer based partition scheme. Specifically, is an evenly divided scheme, in which each processor takes an equal number of rows and columns and computes one layer of the result matrix. Compared to , remains unchanged in overall finishing time, but substantially reduces the total communication volume. also partitions the matrix multiplication load based on layers, but make the number of columns taken by each processor being proportional to its computing power. As a result, keeps a low communication volume like , while optimizing the multiplication finishing time. Note that according to our previous assumption in the introduction part, we only consider the process of getting the distributed results, whereas aggregating the distributed results can be done asynchronously and is out of the scope of this discussion.
The reason that and has much less communication volume compared to is that entries of the matrices are sent only once in and whereas they are sent twice in . For example, in , the first row of matrix A is sent to both and in order to calculate the upper left sub-matrix and upper right sub-matrix respectively. In contrast, in and , each entry of matrix and matrix is sent only once. Therefore, the total communication volume of and is greatly less than .
We can see that when performing matrix multiplication, layer based partition generates much less communication volume compared to rectangular partition. With correct distribution of multiplication loads to processor, we can further optimize the multiplication completion time. However, in practice, the problem of scheduling matrix multiplication load is more sophisticated, when we consider this problem in a specific network environment, in which factors like network topology, communication mode, etc have to be taken into account. It’s even more complex, if we consider the heterogeneity of the system. This paper explores these cases in detail.
1.2 Assumption and Constraints
In this paper we only consider acquiring the distributed results of each layer from the multiplication process, whereas the aggregation of those layers is out of the scope of this discussion. This is because we can exploit distributed storage to store those results in a distributed manner. Since addition processes are of much lighter weight than multiplication processes, we can do asynchronous aggregation afterwards or only when necessary, rather than summing up immediately. Therefore, we assume that we are at a decent state once all the multiplication procedures are done. Then, we store these multiplication results distributively in the memory or hard disk of each processor. In the concept of rectangular partition, that means we consider task completion once each sub-rectangle result is acquired, while the combination of these sub-rectangles are out of the scope of this discussion. In the concept of our layer based partition, that means we regard task as completed once each layer of the result matrix has been calculated and stored, whereas the summation process of these layers are left with asynchronous processes. In either way, once all the distributed multiplication results are acquired, we mark that as the completion of the task. This is an important assumption for the comparison between layer based partition and rectangular partition in the latter part of this paper. We apply this assumption when calculating the total communication volume and task completion time.
The memory limit can be a tight constraint for layer based partition, when the size of the matrix gets enormously large and exceeds memory. For that case, we need to hold the matrices in hard disks, and batch-load the memory and do processing. After processing we write the result back to disk. With distributed storage we can store each layer of the result matrix on the hard disk of each processor. Since these super-large matrices are usually not read intensive, we can do asynchronous sync-up, or only need to read from the disk and add up each layer when there’s an infrequent read request. All in all, it’s a bit tricky for layer based partition to handle matrices larger than memory size. But for other matrices, it’s convenient and bears substantial advantages.
1.3 Our Contributions
Our main contributions are:
We analyze the disadvantages of rectangular partition. First, we show it’s essentially difficult to obtain a communication-optimal partition. Second, we show that the best communication volume achieved by rectangular partition is not globally optimal. We propose the lower bound of matrix multiplication’s communication volume, and we prove rectangular partition consumes a total communication volume larger than the lower bound.
In contrast, we propose layer based partition(LBP) scheme. We show that our scheme is superior to rectangular partition because: A). It is easy to obtain a communication-optimal partition. B). It can reach the lower bound of total communication volume.
We study the load scheduling problem to minimize overall finishing time in star networks. We propose an equality based strategy, and apply this strategy under four communication modes respectively: SCSS, SCCS, PCCS, PCSS.
We study the load scheduling problem to minimize overall finishing time in mesh networks. We formulate this problem as a mixed integer programming problem called MFT-LBP. We propose an algorithms called PMFT-LBP
to solve it, which contains 3 phases. We also provide a heuristic to reduce time complexity.
One thing to notice is that the matrices discussed in this paper are dense matrices with hundreds or thousands of rows/columns, which are very common in current parallel processing environment.
The rest of the paper is organized as follows. Section reviews the related research. Section introduces layer based partition scheme, and analyzes its improvements over rectangular partition. Section studies the load scheduling scheme of LBP in network, and Section studies that in network. Section evaluates the performance of our algorithms through simulations. Finally, section concludes the paper.
2 Related Work
A great amount of research effort has been devoted to the problem of matrix multiplication parallel/distributed processing. In this section, we categorize those most related works into the following parts:
Approaches on homogeneous platforms.
Homogeneous platforms assume that all the computing / comunication resources and environment are identical. Matrix multiplication scheduling on homogeneous platforms have been extensively studied inCanon - Samantray , among which, Canon introduces the first parallel algorithm on homogeneous grids Canon . Fox et al. extend the analysis on two-dimensional mesh and hypercubes Fox . SUMMA overcomes the shortcomings of Cannon’s and Fox’s, and becomes the most widely applied parallel matrix multiplication scheme Geijn . Solomonik et al. solomonik propose a method known as the ‘2.5D Algorithm’, which can achieve asymptotically less communication than Canon’s algorithm and be faster in practise. Malysiak et al. Malysiak present a novel model of distributing matrix multiplication within homogeneous systems with multiple hierarchies. Scheduling of sparse-dense matrix multiplication has been studied in Yzelman Koanantakool .
However, when the computing scale gets larger and larger, heterogeneous computing is more suitable than homogeneous computing for super large distributed and parallel processing. An example is, CPU-GPU heterogeneous processing is inevitable a trend to achieve higher computing performance Mittal . Heterogeneous System Architecture(HSA) is another example that uses multiple processor types on the same integrated circuit to provide the best overall performance Hwu .
Approaches on heterogeneous platforms. Efforts have been devoted to analyze matrix multiplication on heterogeneous platforms AKalinov -Yang . Some researchers try to extend those parallel processing algorithms from homogeneous platforms to heterogeneous platforms. Both Kalinov AKalinov and Quintin et al. Quintin investigate the scalability to modify SUMMA Geijn , to fit into heterogeneous processor platforms. Ohtaki et al. Ohtaki propose a scheme to apply the Strassen’s algorithm on heterogeneous clusters.
However, these approaches only optimize communication volume, yet fail to consider whether multiplication completion time gets optimized as well.
In addition to approaches that try to apply homogeneous parallel algorithms, other researchers seek new perspectives. Alonso et al. Alonso use two strategies to implement parallel solvers for dense linear algebra problems on heterogeneous clusters. Malik et al. Malik propose a topology-aware matrix multiplication algorithm, and they base their hierarchical communication model on irregular 2D rectangular partition. Zhong et al. Zhong utilize functional performance model to balance load on heterogeneous networks of uni-processor computers. Demmel et al. Demmel propose a communication-efficient algorithm for all dimensions of rectangular matrices, apart from square matrices. Beaumont et al. Beaumont1 compares static, dynamic and hybrid resource allocation strategies for matrix multiplication, and analyse the benefit of introducing more static knowledge in runtime libraries.
The most utilized partition method in these approaches is rectangular partition. There exists a significant amount of research digging into the problem of rectangular partition on heterogeneous platforms, as discussed below.
Rectangular partition approaches on heterogeneous platforms. Many approaches have been proposed based on rectangular partition of matrix, and researchers have considered the optimal rectangular partition Lastovetsky1 Lastovetsky2 Clarke . However, even though the optimal rectangular partition problem attracts a great deal of attention, the optimal partition that minimizes the total communication volume remains an open question. Researchers have realized that the problem is complex. Ballard et al. Ballard study the lower bounds of communication volume, and the difficulty of finding communication optimal rectangular partition. Beaumont et al. Beaumont2 prove that given the area of each sub-rectangle, it is a NP-complete problem to find communication optimal rectangular partition of a specific matrix. In regard to this, DeFlumere et al. DeFlumere1 question the optimality of rectangular partition, and propose a perspective of non-rectangular partition. Further, DeFlumere et al. DeFlumere2 show that this non-rectangular scheme outperforms rectangular partition in terms of communication volume, and generalize this algorithm to a three processors’ scenario. Nagamochi et al. Nagamochi propose a recursive partitioning algorithm that dissects a rectangle into rectangles with specified areas. Beaumont et al. Beaumont3 in addition, propose a new approximation algorithm for matrix partitioning by adopting the idea of non-rectangular partition, and the recursive partitioning algorithm proposed by Nagamochi et al. Fuenschuh et al. Fuenschuh
present a polynomial time approximation algorithm that solves a soft rectangle packing problem, and derive an upper bound estimation on its approximation ratio.
In summary, researchers have begun to realize the difficulty in finding the optimal rectangular partition that balance loads while minimize communication volume. Beaumont’s work et al. Beaumont2 explicitly reveals that it is a NP-complete problem. Moreover, though alternative perspectives like non-rectangular partition have been proposed DeFlumere1 DeFlumere2 , those perspectives keep the majority of the partitions still in the shape of rectangle, which do not completely resolve the restriction brought by geometrical shapes. In contrast, in layer based partition scheme, we avoid the NP-complete problem and make it very easy to obtain a communication-optimal partition, which reaches the lower bound of total communication volume.
3 Layer Based Partition(LBP)
In this section we propose a new method - the layer based partition(LBP) scheme to tackle matrix multiplication on heterogeneous processor platforms.
3.1 Scheme Overview
While the goal remains as conducting two matrices’ multiplication, the approach taken by LBP is different from rectangular partition. In LBP, each processor is responsible for calculating one layer of the output matrix. Each layer is of the same dimension of the output matrix, and the output matrix is the aggregation of all layers.
Figure 2 shows an example of this LBP scheme with four processors cooperating on two matrices’ multiplication. Processor takes matrix ’s leftmost columns and ’s upmost rows, and then do multiplication. The result is still a matrix, which is actually the 1st layer of the final output matrix. The same method applies to the rest of the processors, with processor taking charge of 2nd layer, processor taking charge of 3rd layer, and processor taking charge of the last layer. The final output matrix is the sum of these four layers.
3.2 Improvements over Rectangular Partition
The important improvements of LBP scheme over rectangular partition include
It is much easier for LBP to obtain a communication-optimal partition. In fact, any layer based partition is already communication-optimal. But for rectangular partition, it is a NP-complete problem to find communication-optimal partition according to Beaumont Beaumont2 , because of the restriction that each division’s shape has to be rectangular.
Under assumption 1.2, LBP also takes less total volume of data sent by the source than rectangular partition. LBP essentially reaches the communication lower bound, as will be proven below.
In order to ensure an equal base of comparison, we assume in the following part of this paper that, all source nodes do not take part in computation.
(Layer Based Partition Theorem) LBP generates the minimal total communication volume in conducting two square matrices’ multiplication.
To prove the theorem, we firstly present the lower bound of communication volume.
For all scheduling schemes for two matrices’ multiplication, the lower bound of communication volume is , if the source does not take part in processing.
When conducting two matrices’ multiplication, both matrices have to be sent from the source to the computing node, because the source doesn’t take part in computing. Each matrix contributes a communication volume of , so two matrices together are . Since each entry of these two matrices has to be sent at least once, the total communication volume is always greater than or equal to . Thus the lower bound stands.
Back to Theorem 3.1. Suppose there are processors, therefore according to LBP, the task is divided into layers. Processor takes matrix ’s leftmost columns and ’s upmost rows, and thus the communication volume to transfer the necessary processing data from source to processor is . Similarly, the communication volume of processor is , processor is , …etc. The total communication volume is
As shown above, the communication volume of layer based partition scheme reaches the lower bound. Therefore, layer based partition scheme is communication-optimal.
Rectangular partition consumes a total communication volume greater than the lower bound.
In rectangular partition, suppose the output matrix is of size , and each sub-rectangle’s height is , width is , and area is . The total communication volume according to Beaumont2 is:
Because , therefore we have:
In the meanwhile, we have the sum of each sub-rectangle:
Since each is positive and :
Combined with equation (2), we get:
Thus the total communication volume of LBP() is less than rectangular partition().
3.3 Memory Limit Constraint
The memory limit can be a tight constraint for layer based partition, for the case where the result matrix’s size is large enough to exceed memory. That’s a bottleneck. For that case, because we can’t hold the entire result matrix in memory, we need to utilize hard disk and batch processing. We can load the memory of each processor to its limit, do the multiplication and write the result to disk. After that, we continue to load the next batch of data until completing all data sets. This approach is doable with the fast development of hardware like SSD and et cetera.
After the result of one layer is acquired on a single machine, we can leave the result there on the hard disk of that machine. We can do asynchronous processing to aggregate each layer. Alternatively, we can do lazy sync-up. For lazy sync-up we only need to read from the disks and aggregate each layer when there s a read request, which shouldn’t be very frequent for large matrices like this. After all, if a result matrix’s size exceeds the memory limit, after aggregating all the layers, it has to be stored on hard disks anyway. We change this centralized storage manner to a distributed manner, by storing each layer of the result matrix on the hard disk of each processor.
To summarize, layer based partition scheme avoids the NP-complete rectangular partition problem, and has been proven to be communication-optimal. The next step is to discuss how to apply it in different network topologies.
In the following section, we discuss the details of applying the LBP scheme on heterogeneous processor networks. We focus on analyzing LBP’s load balancing strategy, namely, how to schedule load distribution among different processors in the network. Since the LBP scheme already ensures communication volume to be minimal, the optimization goal of the load balancing strategy is to optimize the task completion time, which is another primary and widely discussed target of optimization on heterogeneous processor platforms. Again, as we mentioned in the introduction part, we mark it as the completion of the task once all the distributed multiplication results are acquired on each processor. The total time it takes to compute those multiplication results of all the layers is defined as the task completion time here.
4 LBP in Single-Neighbor Networks
To begin with, we discuss applying LBP in networks where each node can only receive loads from at most one of its neighbors. That node is allowed to further send loads to its other neighbors, except for the one where it obtains data from. We name this type of networks single-neighbor networks in the following parts of our paper. Typical examples of single-neighbor networks include star network, tree network, muti-level tree, etc. Here we use star networks to discuss the load balancing strategy under four scenarios: Sequential Communication Simultaneous Start (SCSS), Sequential Communication Consecutive Start (SCCS), Parallel Communication Simultaneous Start (PCSS), Parallel Communication Consecutive Start (PCCS). Sequential Communication means each node can only send load to its neighbors sequentially, while, Parallel Communication allows the source to transmit simultaneously. Consecutive Start means each non-source processor can only start processing data after receiving all the data it needs, while, Simultaneous Start allows each node to start processing while receiving data.
(Single Source Network Load Balancing Theorem) For single-neighbor networks, the condition to reach load balancing is to have all the nodes finish working at the same time.
This load balancing theorem comes from Bharadwaj etc.’s monograph Bharadwaj . The idea is if not all processors finish processing at the same time, loads can be reduced from busy processors, and assigned to idle processors to speed up the overall process. Consequently, the minimal finishing time is obtained only when all processors finishing working at the same time.
|The number of rows or columns|
|assigned to th processor|
|The inverse of the computing speed of|
|The inverse of the link speed of the th link|
|Computing intensity constant:|
|Unit load on th processor is processed|
|Communication intensity constant:|
|Unit load on th link can be transmitted|
|The total finishing time of the entire network|
In practice, because must be an integer, so it may be impossible to generate a set of integers that makes each processor finish exactly at the same time. Our approach is to relax each as a real number first, solve the relaxed problem, then find the integer solution that is closest to the solution of the relaxed problem. We believe that by making the integer solution as close to the optimal solution of the relaxed problem (in real domain) as possible, it is more likely that we can get the optimal solution of the original problem in which each is an integer.
4.1 Sequential Communication Simultaneous Start
In this subsection, we analyze the SCSS case where the source transmit loads to each processor sequentially, and in the mean time communication can overlap with computing. Figure 3(a) displays the time sequence of this case.
Each processor calculates one layer of the result matrix. Take processor for instance. To work out layer’s data, processor needs to get rows columns of ’s data, plus rows columns of ’s data. Thus its communication volume is , and the corresponding communication time on link is . Furthermore, for each entry of layer, it needs multiplications to get its value. There are as many as entries in this layer. Therefore the total number of multiplication is . The corresponding computation time is . Consider all the timing relationship shown in Figure 3(a), we have the following equations:
The set of equations above consists of equations specifying the time sequence relationship between pairwise processors, and one equation specifying the normalization constraint. Meanwhile, we have unknown variables . Therefore, the set of equations are solvable. Solving them, we get:
where is defined as:
The overall finishing time of the whole network can be obtained through the following equation.
4.2 Sequential Communication Consecutive Start
In this section we discuss the case where the source sequentially assigns load to each processor, and communication can not overlap with computation. According to Figure 3(b), we have the following equations:
These equations contains unknown variables. The solution set is:
where is defined as:
The overall finishing time of the whole network is:
4.3 Parallel Communication Consecutive Start
For the case where source communicates with each processor in a parallel manner, and communication can not overlap with computation, we have the following equations:
Solving the equations, we have:
where is defined as:
The overall finishing time of the whole network is:
4.4 Parallel Communication Simultaneous Start
The case where source can transmit data to all processors at the same time, and communication can overlap with computation is shown in Figure 4(b). All processor start processing and end processing at the same time, and the size of load should be proportional to each processor’s computing speed.
Solving the equations, we have:
4.5 Integer Adjustment
By solving the above equations, we get a set of real numbers that generates the minimal task finishing time. The next step is to find the closest integer solution. We’ll have a deeper discussion of how we obtain this closest integer solution from real number optimal solution, in the next section after we discuss multi-neighbor networks. Because we’ll face the same problem there, too. Here we’ll provide a simple heuristic as shown below.
We can round off the real number solution first, such that a processor gets the whole row/column assignment if it takes more than half of the fractional part of that row/column in the real number optimal solution. After the rounding off process, if the sum of each fails to equal , then we sort the processors in ascending order of their actual finish time . If the sum is less than , then from the processor that has the smallest , we assign an extra row or column to that processor until the sum equals . Otherwise, starting from the processor that has the largest , we remove one row/column from that processor until the sum equals . After the process we obtain the actual integer assignment, we can further update the overall task finishing time.
5 LBP In Multi-Neighbor Networks
In this subsection, we discuss the load balance strategy for another type of network, in which each node is allowed to receive loads from more than one of its neighbors. We call this type of network multi-neighbor networks, in contrast to the single-neighbor networks discussed previously. The target of the load balancing strategy is still minimizing the task finishing time. The definition of variables and constants are listed in the following Table 2 and Table 3.
|The start time of the th node in the network|
|The finish time of the th node in the network|
|The number of columns or rows of the|
|multiplier matrix that are assigned to th node|
|The volume of load transmitted from|
|node to node|
|The mesh network with fixed topology|
|The set of source nodes|
|The inverse of the link speed of the link|
|connecting node and|
|Communication intensity constant|
|The inverse of the computing speed|
|of th processor|
|Computing intensity constant|
|The size of both square multiplier matrices|
|The number of nodes in this mesh network|
|specifies the position relationship|
|of two nodes and in the network.|
|if is in a position that should|
|transmit to , and if otherwise|
|specifies the storage size of node . According|
|to 3.3, this can be memory or hard disk|
Note that in multi-neighbor networks, it is not applicable to apply the equal processing time load balancing strategy such as the one addressed in Theorem 4.1. Zhang et al. Zhangzeming analyze a similar problem. To summarise, if node receives load from only one of its neighbors, says, , we can obtain the following equation, , meaning that node starts processing when it finishes receiving load from node . However, if node also receives load from another neighboring node, say , the above equation may fail to stand, because the two neighbor nodes and might not finish transmitting at the same time, i.e,
and we can not determine node ’s start time.
Theorem 4.1 can optimize task finishing time for single-neighbor networks like tree, multi-level tree, etc. For multi-neighbor networks like multi-root tree, mesh, ring, torus, hypercube, since we can not apply Theorem 4.1, we formulate the load balancing problem as an optimization problem, called Minimize Maximum Finishing Time in Layer Based Partition(MMFT-LBP) problem.
A mesh is a typical multi-neighbor network where one single node can receive data from multiple neighbors. In the following part we discuss solving MMFT-LBP problem in mesh networks, to shed light on solving MMFT-LBP in other multi-neighbor networks.
5.1 MMFT-LBP In Mesh
In this subsection we focus on the MMFT-LBP problem in mesh networks. We denote the mesh network as a graph , shown in figure 5. The graph contains nodes which forms a dimension of , where . Generally speaking, the mesh has better performance in scheduling if the source is closer to geometric center. So we assume that the source is located at the center of this mesh network, and divides the mesh into four quadrants. The specific data flow pattern in each quadrant is shown in Figure 5.
For the limitation of scope of this paper, we only analyze PCCS mode, that is, data is forwarded in the network in the ‘parallel communication and consecutive start’ pattern. ‘Parallel communication’ allows each node to talk to its multiple neighbors at the same time. ‘Consecutive start’ means that each node has to wait until it receives its whole share of data before it can start computing. In short, we utilize PCCS mode as each node’s communication and processing model in our multi-neighbor networks, and leave the other communication patterns for future study.
We present the MMFT-LBP problem in the following.
Variables: , , , ,
The objective of MMFT-LBP(35) is to minimize the maximum finishing time of the mesh network.
Constraint (36) specifies the start time of the source. In this paper’s case, there’s only one node in set S.
Constraint (37) specifies that the start time of those non-source nodes should be the maximum time that they finish receiving all loads from their adjacent neighbors.
is a constant once the mesh network is determined and fixed.
Constraint (38) defines each node’s finishing time. According to LBP, each node’s computing load is the total number of multiplication it conducts .
Constraint (39) shows that because the source does not take part in processing, it sends out all of its load: the two multiplier matrices. And each entry of the matrices is sent only once.
Constraint (40) defines the amount of load taken by those non-source nodes.
The in constraint (43) is the number of columns taken by each node, and should be integers.
Constraint (45) shows that for each node, its storage size should at least be able to store the load from two multiplier matrices (), and the result matrix ().
Constraint (46) is the normalization constraint. The number of columns taken by each node should sum up to be the side length of the multiplier matrix.
Both the objective function and constraints contain maximum form of formulas. To solve it, we firstly reorganize the problem and remove those maximum forms of formulas, and show that both the original and reorganized form have the same optimal solution.
5.2 MFT-LBP In Mesh
The objective function of MMFT-LBP problem contains maximum form of formulas, which makes it hard to solve directly. To obtain the optimal solution of MMFT-LBP problem, the first step is to reorganize the objective function.
We introduce one additional unknown variable , and a set of constraints , which ensure to be no earlier than any node’s finishing time . Then we transform the objective function from (35) to . The optimal solution to the original problem is still the optimal solution to the transformed problem, because when is minimized, .
Similarly, we relax constraint (37) to be the following linear inequality.
This inequality (47) implies processor does not necessarily need to begin processing immediately when it finishes receiving all the data, it can choose to hold the data for a while and then start processing. Therefore, constraint (47) increases the solution space defined by constraint (37) to incorporate the following:
Even if inequality (47) increases the feasible solution space, the optimal solution still remains the same. This is because the target of this problem is to minimize the overall finishing time, the minimal finishing time is achieved always when each node starts processing and forwarding once it completes receiving load from its neighbors, even if it is allowed to hold on for a while before it starts processing and forwarding.
In MMFT-LBP problem statements, if we use constraint (47) to replace constraint (37), we get a new problem called Minimize Finish Time with Layer Based Partition(MFT-LBP). As we have discussed, we have the following theorem.
(Maximization Relaxation Theorem) MMFT-LBP problem has the same optimal solution with MFT-LBP problem.
The full problem statement is presented as follows.
Variables: , , , ,
PMFT-LBP: The PMFT-LBP algorithm contains the following three phases. In Phase I, it relaxes integers to real numbers and solves the relaxed linear programming problem. In Phase II, based on the optimal real number solution obtained in Phase I, PMFT-LBP determines a feasible integer solution for the original PMFT-LBP problem, which is close to the optimal real number solution. In Phase III, starting from the feasible integer solution obtained in Phase II, PMFT-LBP conducts a “neighbor search” and seeks for local optimal feasible integer solution.
Phase I. In this phase, the PMFT-LBP algorithm relaxes condition (57) to be a positive real number
and solves a relaxed version of the MFT-LBP problem, called MFT-LBP-relax.
With this relaxation, all constraints are either linear equality or linear inequality, forming a convex polygon feasible region. Its objective function is also linear defined on this polyhedron. Hence, MFT-LBP is a LP problem and can be solved to get the optimal real number solution , , , , .
Phase II. In this phase, PMFT-LBP calls an algorithm called finds an integer feasible solution(FIFS) based on the optimal real number solution obtained from phase I. we firstly round off each to its closest integer. The intuition here is that if the real number optimal solution assigns the larger portion of a column to one processor, then assigning that processor with the entire column will have a higher chance to get “closer” to optimality. It is vice versa that if the processor is assigned with the smaller portion of a column, we shall remove its share of this column.
However, the rounding off process alone may not guarantee a feasible integer solution, because it may result in the sum of all fail to equal the multiplier matrix’s side length , constraint (60). To resolve this problem, we conduct a subtle adjustment. For cases in which the sum is greater than , meaning that there exists duplicate assignments, we shall reduce work load from some processors. Since the processor with the longest finishing time is the bottleneck affecting the overall finishing time, it has the highest priority to reduce its share of loads. On the contrary, for cases in which the sum is less than , meaning that some rows/columns haven’t been assigned to any processor, processor currently with the shortest running time has the highest priority to take up the responsibility.
We don’t make the adjustment to one processor all at once. Instead, we conduct the adjustment iteratively. Every iteration we only adjust one row/column, then we update each processor’s , , to determine the processor to conduct adjustment on for the next round of iteration. The processor with the longest processing time currently will be the one to remove a row/column from in the next round of iteration, and the processor with the shortest processing time will be the one to take the extra load in the next iteration.
When the sum of all equals , we finally find an integer feasible solution , , , .
Phase III. In this phase, PMFT-LBP conducts a so-called “Neighbor Search” process to optimize the feasible solution obtained in phase II. The reason why this feasible solution still need optimization is that the “Rounding Off” procedure in phase II might bring about bias and take our feasible solution “away” from optimal. Therefore, a further optimization is necessary.
The “Neighbor Search” runs in iterations. For the first iteration, it starts from the feasible integer schedule obtained in phase II, and compares the current overall finishing time with its neighbors’. is defined as one of ’s neighbors if all except for only two dimensions and , where and , such that . If one neighbor offers shorter overall finishing time than and the rest of its neighbors, we change load schedule from to . Then, “Neighbor Search” will use as a new start point and begins a new iteration. If at some stage, no further optimization can be made towards minimizing overall finishing time, which means the schedule at that stage is a local minimal point that has the shortest overall finishing time among all of its neighbors, iteration stops and we say we find our optimal schedule.
Considering the high time complexity of PMFT-LBP algorithm because so many LP-based updates are involved, we propose a heuristic called MFT-LBP-heuristic. This heuristic calculates a good feasible integer solution that is close to the optimal solution with a time complexity which is substantially reduced from PMFT-LBP. The basic idea is that whether one single row/column should be assigned to one processor or another simply doesn’t bring about much improvement of the result. However, simplifying these steps saves time complexity substantially.
Based on this idea, MFT-LBP-heuristic only keeps phase I the same with PMFT-LBP algorithm. In phase II, after obtaining the optimal real-number schedule , the heuristic rounds off each to get slightly differently. The difference is, if the sum of doesn’t equal , the heuristic sorts all processors in ascending order of their finishing time , which forms an array . If the sum of is less than , then from the first element in , the heuristic keeps adding to each processor’s until . Otherwise if the sum of is greater than , the heuristic starts from the last element of and minus from each processor’s one by one until . The adjusting process continues in a circular manner so if it reaches one end of , it jumps to the the other end of the array and continue the process again.
In phase III, PMFT-LBP searches each neighbor of current schedule . The time complexity of searching process is , where is the number of processors. Once is big, the scalability of the algorithm is poor. In order to speed up the local search process, we apply the concept of ‘gradient descent’ here. At each iteration, we only look at the neighbor that has the highest chance to decrease the overall finishing time from current schedule . We know that differentiate from by just two dimensions and . If is the schedule of the processor that currently takes the longest processing time, while is the schedule from the processor that takes the shortest, then stands the highest the chance to be the neighbor that has shortest overall finishing time. We compare ’s with ’s. If ’s is shorter, we use in the next iteration. Otherwise, since even
cannot further decrease the overall finishing time, the other neighbors probably cannot either. Therefore we takeas the optimal schedule we look for.
This heuristic only solves LP problems twice and reduces time complexity of the iterative LP-based update process in PMFT-LBP. The result of this may sacrifice the overall finishing time of the mesh network a little bit, but reduce the algorithm’s time complexity substantially by reducing a lot of time-consuming LP iterative updating processes.
As will be seen in next section, the performance of our heuristic is extremely close, and even in cases equal to PMFT-LBP both in terms of communication volume and finish time.
6 Performance Evaluation
6.1 Performance Evaluation of Star Network
In this subsection, we study the performance of our layer based partition scheme in a star network, which is an instance of single neighbor network. In each iteration, we randomly generate two matrices, and a heterogeneous star network of processors to conduct the matrix multiplication.
Star Network. The star network contains one source and multiple children connected to the source. As mentioned previously, we assume the source node only transmits load, but does not take part in computing. In our simulation, we use a star network containing 16 children, where each child’s unit processing time
is uniformly distributed in the range of (0.0005, 0.0008), and each link’s unit transmission timeis uniformly distributed in the range of (0.0002, 0.0005).
Communication and Processing mode. In order to better compare our algorithm with some other related algorithms, we use PCCS mode as the communication and processing mode here, meaning the source can communicate with each processor in a parallel manner whereas communication cannot overlap with computation.
Matrices. The side length of the matrices in our simulation goes from 100 to 1000. When analyzing the performance of each algorithm over matrix size, each data point is an average of 10 independent experiments over 10 independently different star network.
6.1.1 Evaluation Metrics
Total Communication Volume. Total communication volume is defined as the sum of data volume transmitted on each link.
Task Finishing Time. The task finishing time in star network is defined as the time period from the source starts to send data till the last processor finishes working.
6.1.2 Comparison Algorithms
We compare our layer based partition algorithm to a couple of typical rectangular partition algorithms.
Even-Col. Even-Col is a naive rectangular partition algorithm that simply partitions the matrix into equivalent columns.
PERI-SUM. Beaumont et al. Beaumont2 deal with the geometric problem of partition the unit square into rectangles of given area , , , etc, so as to minimize the the sum of the perimeters(PERI-SUM) of the rectangles, which is proportional to communication volume. Beaumont et al. further propose the communication lower bound, and introduce a -approximation algorithm. We use it in our comparison.
Recursive. Nagamochi et al. Nagamochi introduce a recursive partitioning technique on the basis of PERI-SUM, and improve the approximation ratio from to .
NRRP. Beaumont et al. Beaumont3 combine the idea of non-rectangular partitioning from DeFlumereDeFlumere2 and recursive partitioning algorithm proposed by NagamochiNagamochi . The combination of these two ingredients lead to an improvement of the approximation ratio down to .
Lower Bound. Ballard et al. Ballard proposed the lower bound of communication volume in rectangular partition to be . We use this lower bound to compare with our LBP’s communication volume.
6.1.3 Evaluation Result on Star Network
Communication Volume with Increasing Matrix Size. As one of our most significant contributions, the simulation displays overwhelming superiority of our layer based partition algorithm over rectangular partition algorithms in total communication volume. Figure 6(a) compares the total communication volume of each algorithm. While the total communication volume generally increases along with the expansion of matrix size , LBP generates the smallest total communication volume. Specifically, when the matrix size reaches , the total communication volume of layer based partition reduces from the lower bound of the rectangular partition. Meanwhile, as stated in Beaumont3 , the lower bound of rectangular partition is too optimistic to reach in actual practice, and the real best rectangular partition result is obtained actually in NPPRBeaumont3 , RecursiveNagamochi and PERI-SUMBeaumont2 . Genearally, LBP presents a total communication volume that reduces from NPPR, from Recursive, and from PERI-SUM, respectively. We observe that LBP keeps this ratio over rectangular partition algorithms along with the increase of matrix side length. Moreover, when the star network gets larger, this ratio gain becomes even bigger. We attribute these to the advantage of layer based partition over rectangular partition in communication volume.
Finishing Time with Increasing Matrix Size. Figure 6(b) shows the overall finishing time of each algorithm. We observe that while all algorithm’s finishing time increase with matrix size, LBP, PERI-SUM, Recursive, NPPR present similar curves, with their overall finishing time much smaller than that of Even-Col. Specifically, when the matrix size is , the overall finishing time of those four algorithm is about smaller than that of Even-Col. One reason for this is that those four algorithms, LBP, PERI-SUM, Recursive, NPPR all achieve load balance when scheduling the load. No matter dividing the matrix into layers or rectangles, each of the four algorithms makes sure that each share of load is proportional to that processor’s computing ability.
Summary. In this subsection we compare our layer based partition algorithm with rectangular partition algorithms in terms of total communication volume and task finishing time. The simulation results proves that our layer based partition algorithm generates a total communication volume that is substantially reduced from the state-of-the-art rectangular partition algorithms. In the meanwhile, LBP also reaches load balance and generates a total overall finishing time that is as low as the other algorithms.
6.2 Performance Evaluation of Mesh
In this subsection, we study how the layer based partition scheme performs in mesh networks. In each single run of the simulation, we randomly generate a heterogeneous mesh network, and two square matrices conducting multiplication. Then our LBP algorithm and the other comparing algorithms are called to schedule the matrix multiplication load on the given mesh network.
Mesh Network. The mesh network is heterogeneous, with each link speed and processor speed independently generated. The unit processing time of the processors is uniformly distributed in the range of , while the unit transmission time of the links is uniformly distributed in the range of . In our simulation, we use use three square meshes, which are of dimension 5*5, 7*7 and 9*9. For model simplicity and without loss of generality, we focus on studying the case for one quadrant-the lower right one-in figure 5, and the source node is located at the top left corner. The cases of the other three quadrants are similar. However, SUMMA is an exception, in which no single source exists, and each processor in the mesh takes one block of matrix data. So when evaluating the performance of SUMMA, we divide the matrix data into blocks and store it on corresponding processor.
Matrices. The matrices we analyze are large scale dense matrices. In our simulation, we randomly generate matrices with their side length ranging from 1000 to 2000.
When analyzing the performance of each algorithm over matrix size, each data point in our simulation is an average of 10 independent experiments over 10 independently different mesh network.
6.2.1 Evaluation Metrics
Overall Communication Volume. Overall communication volume is defined as the sum of data volume transmitted on each link. Compared to the total data volume coming out of the source, overall communication volume provides a more direct view of data running on each link.
Task Finishing Time. The task finishing time in mesh network is defined as the time period from the source starts to send data till the last processor finishes working.
Total Number of Iterations to Solve LP. Since we rely on simplex algorithm to solve LP in our LBP and LBP-heuristic, we evaluate both algorithms’ efficiency in terms of average total number of iterations taken by simplex algorithm to solve LP.
6.2.2 Comparison Algorithms
SUMMA. Geijn et al. Geijn propose SUMMA, which is the most widely applied parallel matrix multiplication scheme on homogeneous grid. The algorithm allocates matrix blocks over the grids. In each step, the pivot column of blocks is communicated horizontally and the pivot row of blocks is communicated vertically. Each processor uses the pivot blocks it get to update its rectangle in each step. We apply this algorithm on our heterogeneous mesh network.
Pipeline. The Pipeline algorithm is a classic scheduling method. Starting from the source, each node forwards the entire copy of data along the grids to each of its neighbor in the mesh network. Duplicate copies may be sent to one node, however, it only keep the first received one. Once that node finishes receiving its first copy, it starts processing the data while forwarding. The whole system acts like a pipeline with communication overlaps with computing.
Modified Pipeline. Tan et al. Tan propose an improved pipeline broadcast scheme for distributed matrix multiplication. The non-blocking pipeline scheme takes advantage of tuned chunk size to boost communication performance. We apply the idea to heterogeneous mesh network.
6.2.3 Evaluation Result on Mesh
Overall Communication Volume. Figure 7 displays the overall communication volume when conducting matrix multiplication of two matrices in (a) 5*5 mesh, (b) 7*7 mesh, and (c) 9*9 mesh respectively. The simulation result reveals that while all algorithms’s overall communication volume goes up as matrix size increases, SUMMA, LBP and LBP-heuristic generate almost the same smallest overall communication volume, which is smaller than that of Modified Pipeline and smaller than that of Pipeline. SUMMA is well-known to be communication-optimal on homogeneous mesh network. Though applying SUMMA on heterogeneous mesh may affect its overall finishing time due to the change of processor speed and link speed, its data transmission pattern won’t be affected. So SUMMA is still communication-optimal on heterogeneous mesh. LBP and LBP-heuristic generates almost the same communication volume as SUMMA, consequently, implies that LBP and LBP-heuristic are at least close to communication optimal on heterogeneous mesh. Moreover, we observe that LBP, LBP-heuristic, SUMMA are close to each other as network dimension increases, but their difference ratio with the other algorithms are getting larger and larger.
Task Finishing Time. Figure 8 shows the task finishing time of each algorithm on (a) 5*5 mesh, (b) 7*7 mesh, and (c) 9*9 mesh. Generally, LBP generates the smallest task finishing time than the rest of algorithms. LBP-heuristic gives a task finishing time that is slightly longer than that of LBP, which are more in 5*5 mesh, more in 7*7 mesh, and more in 9*9 mesh, respectively. This tiny difference can entirely be ignored. SUMMA, since it can no longer reach load balance with link speed and processor speed vary, its task finishing time are, respectively, more in 5*5 mesh, more in 7*7 mesh, and more in 9*9 mesh, than that of LBP. Moreover, Modified Pipeline are respectively more in 5*5 mesh, more in 7*7 mesh, and more in 9*9 mesh. Pipeline are respectively more in 5*5 mesh, more in 7*7 mesh, and more in 9*9 mesh. All in all, LBP and LBP-heuristic present the best performance in terms of task finishing time.
Total Number of Iterations to Solve LP. As mentioned previously, we use the simplex algorithm to solve LP in our LBP and LBP-heuristic algorithm. Each time solving the LP costs a certain number of iterations by the simplex algorithm. And according to our algorithm, we may re-solve LP a couple of times due to 1.find real number solution 2. find integer solution 3. local search, etc. Therefore, the total number of iterations to solve LP is a good indication of the efficiency of our algorithms. Figure 9 counts the average total number of iterations in solving LP by LBP and LBP-heuristic on 5*5 mesh, 7*7 mesh, 9*9 mesh. Each point is an average of 10 identical independent experiments. The solid lines represent LBP whereas the dashed lines represent LBP-heuristic. We have the following observations: 1. The solid lines vary dramatically due to the uncertain number of times to re-solve LP by LBP, while the dashed lines are comparatively stable. 2. The total number iterations show no correlation with respect to matrix size, a good evidence indicating that both algorithms are suitable for large scale matrix scheduling. 3. The total number of iterations does show positive correlation with respect to mesh size. 4. For the same mesh size, dashed lines are generally far below solid line, which indicates that LBP-heuristic generally requires much less total number of iterations to find a solution than LBP. In other words, LBP-heuristic is more efficient.