The Square Kilometre Array (SKA) will be the largest radio telescope in the world . The two components of the first phase of SKA (SKA1) — SKA-Mid and SKA-Low — will jointly produce large amounts of data at a rate of one Terabyte (TB) per second, with the second phase data rate reaching at least ten times higher. All this data has to be captured, reduced, processed, and analyzed in near real-time. This poses a great challenge, since the current generation of radio astronomy data processing systems are designed to handle data approximately two to three orders of magnitude smaller than that of the SKA1.
To tackle this challenge, we developed the Data Activated Liu Graph Engine (DALiuGE111https://github.com/ICRAR/daliuge) to execute continuous, time-critical, data-intensive workflows in order to produce science-ready data products. Compared to existing astronomical workflow systems, DALiuGE has several advantages such as separation of concerns, data-centric execution, graph-based dataflow scheduling, and native support for streaming processing.
A technical overview of DALiuGE and its operational production systems are described in . In this paper, we focus on the DALiuGE graph scheduling sub-system. In particular, we discuss technical details on dataflow partitioning algorithms and implementations.
2 Related work
The dataflow computation model  represents workflows as Directed Acyclic Graphs (DAG), where vertices are stateless computational tasks (i.e. functions) and edges connect the output of one task with the input of another. Although the dataflow model exploits parallelism inherent in DAGs through data dependencies, mapping an irregular DAG onto hardware resources for optimal execution is an NP-hard problem . Early work attempted to derive data structures (e.g. assignment graph  or allocation graph ) from the original DAG in order to perform tractable searching and optimisation algorithms (e.g. using the maximum flow solutions ). While these algorithms were able to uncover an optimal solution in polynomial time, the growth rate of the assignment graph is , where denotes the number of vertices in the original DAG and denotes the number of available processors. Therefore, as the DAG size and resource pool grows substantially (e.g. from tens of tasks running on a laptop to millions of tasks running on thousands of processors), these exact optimisation methods quickly become intractable.
A variety of heuristics-based algorithms have been developed for scheduling DAGs on multiprocessors. These heuristics in general fall into two alternative approaches — one-phase or two-phase. In the one-phase approach (e.g., the widely-used HEFT algorithm ), DAG scheduling is performed by directly mapping a ranked list of workflow tasks to another ranked list of resource units (e.g. processors or nodes) based on some aggregated run-time workflow profiles and resource statistics. In contrast, the two-phase approach [13, 16] first partitions the DAG into a number of clusters based on heuristics such as load balancing 
, minimal data movement, etc. In the second phase, these clusters are then mapped onto actual hardware resources for execution. We currently adopt the two-phase approach because the output from the first phase encodes a resource demand abstraction (RDA) from intrinsic properties of the DAG. The RDA becomes the input for resource mapping in the second phase. More importantly, the RDA provides a more accurate estimate of resource demand for future capacity planning and observation scheduling for the telescope manager. However, most two-phase algorithms were targeted to multiprocessors on a single compute node, where each workflow task consumes exactly one processor. Our workflows need to run across clusters of compute nodes, each consisting of multiple processors. More importantly, each workflow task inherently demands multiple yet different number of processors/cores and different amount of memories. Dealing with this kind of complexity in resource demand and multiplicity in resource capabilities is one of our contributions in this paper. Moreover, unlike most existing DAG scheduling/mapping algorithms, our partitioning algorithm aims to reduce the overall resource footprint given these complexities and constraints.
On the other hand, the advantage of the one-phase approach is its flexibility to incorporate run-time resource heterogeneity. We leave for our future work a thorough investigation and application of the one-phase approach to our DAG mapping problem.
has been made recently to partition vary large graphs for various social network analysis and machine learning applications, direct application of these graph partitioning algorithms for dataflow partitioning often leads to sub-optimal solutions. This is because the DAG (or general graph) representationof the dataflow does not encode the notion of workflow execution working set - a small set of workflow tasks that are being executed at time . Only tasks in consume resources, other tasks are either waiting for the completion of their “upstream" tasks in or have already completed their executions. Therefore, partitioning the entire graph (e.g. in the order of millions of nodes) for subsequent resource mapping is (1) wasteful given that , and (2) ill-posed since is time-dependent and is unknown at the time of graph partitioning.
3 Overview of Graph Execution
Following the two-phase approach, the four steps of the graph execution are illustrated in Figure 1. We briefly introduce them in this section. Readers are referred to for a detailed technical discussion on graph execution.
Starting from the top left corner, a staff astronomer composes a logical graph representing high-level data processing capabilities (e.g., “Image deconvolution") using resource oblivious dataflow constructs and workflow task components. The first step unrolls the logical graph by expanding all parallel branches and loops, instantiating tasks in all branches and iterations and connecting them with directed edges as per the logical graph definition. The result of unrolling is the Physical Graph Template (PGT) shown in the top right corner. It should be noted that, unlike traditional dataflow graph representations, DALiuGE models data as well as tasks as graph vertices. From a workflow viewpoint, all data items are essentially “data tasks" (shown as parallelograms in Figure 1) that can trigger the execution of their consumer tasks (shown as rectangles).
The second step, i.e. the focus of this paper, divides the PGT into a set of logical partitions such that certain performance requirements (e.g. total completion time, total data movement, etc.) are met under given constraints (e.g. resource footprint, collocation criteria, device locality, etc.). This step outputs the Physical Graph Template Partition (PGTP), which provides the Telescope Manager with an approximate solution to construct the observation scheduling blocks months or weeks prior to observation and compute resource allocation. An example of PGTP is shown at the bottom right of Figure 1, where 19 partitions are produced and one of them is visually expanded with 11 enclosing workflow tasks. Furthermore, a resource reservation that contains 19 nodes can be submitted to the telescope manager weeks before the associated observation takes place.
The third step maps each logical partition of the PG onto a given set of currently available resources in certain optimal ways. In principle, each partition is placed onto a physical compute node in the cluster. Such placement requires real-time information on resource availability, and we currently assume resource pools consisting of nodes with identical capabilities of computing, storage, and interconnect. In cases where the number of partitions is greater than the number of available nodes , DALiuGE can be configured to merge the PGT partitions into virtual clusters with the goal of balancing the overall workload (both compute time and memory usage) evenly before mapping.
The final step involves optimal execution of tasks that have been allocated to a single node by the previous two steps. DALiuGE currently offloads this step to local schedulers provided by the host OS running on each compute node. We are currently working on the integration of graph-based GPU schedulers for dynamically scheduling GPU accelerated workflow tasks on single node with multiple GPUs.
In the following sections, we focus solely on the technical details of the second step — dataflow partitioning.
4 Dataflow partitioning
During graph partitioning, a PGT of vertices is decomposed into
partitions, each of which conceptually represents a compute node with a pre-defined resource capacity vector. The goal of graph partitioning is to obtain an estimate on the minimum number of compute nodes needed to execute the PGT and its corresponding PGT completion time . Initially, the partitioning algorithm lets with each vertex being an individual partition. The algorithm then iteratively decreases through partition merging (line 11 in Algorithm 1). This is equivalent to keeping the PGT completion time monotonically non-increasing as exemplified in Figure 3. See Theorem 1 for a proof. Therefore, a partition scheme that produces ideally achieves the minimum PGT completion time , thus under the current graph partitioning algorithm. This follows the “data locality" principle which suggests that the unit cost of data movement between two partitions is far greater than that within the same partition. Therefore, fewer partitions lead to faster completion with less data movement, resource usage and lower operational cost.
On the other hand, a smaller corresponds to a greater resource demand per partition since more Drops are allocated to each partition. This means the aggregated resource demand from concurrently-running Drops in a given partition is more likely to exceed , slowing down the graph execution due to resource over-subscription. An ideal partitioning solution not only obtains an optimal but also ensures that resource demands in all partitions stay below at any point during the graph execution. Satisfying this constraint avoids unpredictable execution delay due to resource over-subscription, thus ensuring . Formally, the graph partitioning is formulated as a constrained optimisation problem:
where is a function that outputs the number of partitions given a and a partition solution . is a function that outputs the completion time given a and a partition solution . denotes the aggregated resource demand from all running Drops in partition at time . We refer to the constraint defined in Equation 1 as the DoP constraint, where “DoP" stands for Degree of Parallelism. Figure 2 exemplifies partitioning solutions that do (not) satisfy the DoP constraint.
Once the optimal graph partitioning solution is available, both (known as the Physical Graph Template Partition) and (i.e. ) are used by the telescope manager for the generation of observation and computing resource schedules well before the observation takes place.
4.1 Partitioning Algorithm
The main idea of the partitioning algorithm (Algorithm 1) is to iteratively reduce data movement between inter-node Drops by “merging" them into the same node, where the cost of intra-node communication is negligible. Given a PGT , the algorithm sorts all edges in based on their weights in a descending order. The edge weight here denotes the volume of data “on the move" from one Drop to the next. Each drop is initially allocated to a separate node. Then going through all edges in a descending order of their weights, the algorithm merges two partitions associated with the two Drops on both ends of the edge if the merged partition meets the DoP constraint defined in Equation 1. The algorithm is “greedy" since it reduces larger costs before dealing with smaller ones. However, this may not necessarily lead to a globally optimal solution especially for large graphs. We are currently investigating various local search heuristics to overcome this limitation.
Although the iterative edge zeroing procedure is based on the graph clustering algorithm , we added two important additional changes. First we allow two existing partitions to re-merge again in order to further reduce the number of partitions, which in turn reduces the total completion time as suggested in Theorem 1. Second, we evaluate the DoP constraint in order to accept or reject partition merging (line 12) proposals. The evaluation of the DoP constraint not only considers each graph vertex’s processing requirement in terms of maximum number of concurrent threads, memory usage, etc., but also incorporates predefined resource capacities for each partition including number of cores, memory capacity, etc.
The edge zeroing statement at line 10 in Algorithm 1 ensures the completion time of is strictly non-increasing.
If the edge is on the longest path of with a length , there are two possibilities after ’s weight becomes zero — remains the longest path of or another path becomes the longest path of . In the first case, let be the new length of . It is easy to verify that . In the second case, let be the length of . It must be true that because otherwise (rather than ) would have been the longest path before the edge zeroing takes place.
If the edge is not on the longest path of , there are also two possibilities after ’s weight becomes zero — remains off the longest path of or becomes part of the “new" longest path of . In the first case, since is not affected whatsoever, its completion time remains the same, thus non-increasing. In the second case, let be the length of . It must be true that because otherwise (rather than ) would have been the longest path before the edge zeroing takes place.
4.2 DoP Constraint Evaluation
In this subsection, we discuss the DoP evaluation algorithm defined in the try_merge_partition function called at line 12 in Algorithm 1. As shown in Equation 1, this boils down to efficiently computing the total resource usage summed over all running Drops inside a given partition at a particular time . To do this, we first establish the equivalence between the set of Drops running in parallel at time and the concept of antichain  — a set of mutually unreachable vertices of a DAG associated with a given partition.
If all Drops in are running in a non-streaming mode, is an antichain of .
The non-streaming running mode excludes one possible form of parallelism — pipelining. All other forms of parallelisms require Drops in be mutually unreachable on because otherwise they would never have been running in parallel due to their inter-dependencies as a result of reachability.
We define the length of an antichain as the number of Drops in , and define the weighted length of an antichain as the aggregated weight summed over all Drops in . The weight of the th Drop in an antichain is the pre-determined peak resource usage denoted by . Let denote the set of all antichains in an partition graph . It then follows from Theorem 2 that the total resource usage is bounded by some antichain(s) that has the maximum (longest) weighted length amongst all antichains in :
Equation 2 bounds a time-dependent value by a time-invariant constant such that if for a given partition, the constraint condition in Equation 1 will be satisfied. However, finding the antichain that produces is not trivial since the cardinality of — the total number of antichains in a partition graph — can be in the order of , with being the number of vertices in . Therefore, enumeration and evaluation of all antichains is computationally unfeasible in practice, where a typical partition has at least tens or even hundreds of tasks (e.g. there could be up to one billion antichains for a graph with merely 30 vertices).
To compute the maximum antichain length for a given graph in polynomial time, one can apply Dilworth’s Theorem , which states that the maximum length of an antichain is equal to the minimum number of chains needed to fully “cover" the graph. In particular Fulkerson  established the equivalence between the maximum antichain length and the maximum matching in a constructed split graph (a.k.a. bipartite graph). As a result, the longest antichain — the antichain that has the maximum cardinality — of a graph can be discovered in time. However, Equation 2 suggests that the longest antichain does not necessarily have the longest weighted length unless . Hence, whilst we can efficiently solve for a special case where each Drop consumes only one unit of resource (e.g. 1 core, 1G of RAM, etc.), we need a different algorithm to evaluate more generic cases where Drops consume arbitrary units of resources (e.g. 16 cores, 375 MB of RAM).
In the following, we discuss details of Algorithm 2 that efficiently computes for generic cases based on Cong  to compute a maximum weighted -family. While a -family covers a union of at most antichains in a DAG, we are interested only in a special case (where ) in order to solve our problem of computing the maximum weighted length of a single antichain.
The central idea of Algorithm 2 is to exploit the equivalence between the weighted maximum anti-chain of the original DAG and the minimum-cost maximum-flow (MCMF) solution of the split graph created at Line 2. The equivalence is proved in  and more generally in . Note that number of nodes of is where is the number of the original DAG, which is the union of the two DAGs and . This ensures that a polynomial algorithm on remains tractable on .
To find the MCMF solution, we first derive the admissible graph from (line 3), and run the normal maximum flow algorithm  to obtain the flow in time (line 4). We then construct the residual graph from (line 5). has the identical set of vertices as , and if there are no edges going from the source vertex of to some vertex , then we set the node potential of to 1 (line 8). In the end, the maximum weighted antichain is calculated (line 14 to 20) based on expressions defined in Theorem 3.1 . Figure 3 shows the results of running Algorithm 1 and 2 by scheduling three different radio interferometry imaging workflows.
Optimal scheduling of large-scale, data-intensive workflows is challenging. In this paper, we discussed related work on graph scheduling and proposed polynomial time optimization methods that minimize both workflow execution time and resource footprint while meeting resource demand constraints imposed by individual algorithms. We show preliminary results obtained from three radio astronomy data pipelines.
M. Bateni, S. Behnezhad, M. Derakhshan, M. Hajiaghayi, R. Kiveris, S. Lattanzi,
and V. Mirrokni.
Affinity clustering: Hierarchical clustering at scale.In Advances in Neural Information Processing Systems, pages 6867–6877, 2017.
-  S. H. Bokhari. A shortest tree algorithm for optimal assignments across space and time in a distributed processor system. IEEE transactions on Software Engineering, (6):583–589, 1981.
-  R. Braun, T. Bourke, J. Green, E. Keane, and J. Wagg. Advancing astrophysics with the square kilometre array. Advancing Astrophysics with the Square Kilometre Array (AASKA14), 1:174, 2015.
-  K. Cameron. Antichain sequences. Order, 2(3):249–255, 1985.
-  V. Chaudhary and J. K. Aggarwal. A generalized scheme for mapping parallel algorithms. IEEE Transactions on Parallel and Distributed Systems, 4(3):328–346, 1993.
-  J. Cong. Computing maximum weighted k-families and k-cofamilies in partially ordered sets. Computer Science Department, University of California, 1993.
-  J. B. Dennis and D. P. Misunas. A preliminary architecture for a basic data-flow processor. In ACM SIGARCH Computer Architecture News, volume 3, pages 126–132. ACM, 1975.
-  R. P. Dilworth. A decomposition theorem for partially ordered sets. Annals of Mathematics, pages 161–166, 1950.
-  D. R. Fulkerson. Note on dilworth’s decomposition theorem for partially ordered sets. In Proc. Amer. Math. Soc, volume 7, pages 701–702, 1956.
-  A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. Journal of the ACM (JACM), 35(4):921–940, 1988.
-  G. Karypis and V. Kumar. Multilevelk-way partitioning scheme for irregular graphs. Journal of Parallel and Distributed computing, 48(1):96–129, 1998.
-  Y.-K. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys (CSUR), 31(4):406–471, 1999.
-  J.-C. Liou and M. A. Palis. A comparison of general approaches to multiprocessor scheduling. In Parallel Processing Symposium, 1997. Proceedings., 11th International, pages 152–156. IEEE, 1997.
-  D. Marcus. Graph theory: a problem oriented approach. The Mathematical Association of America, 2008.
-  C. Martella, D. Logothetis, A. Loukas, and G. Siganos. Spinner: Scalable graph partitioning in the cloud. In Data Engineering (ICDE), 2017 IEEE 33rd International Conference on, pages 1083–1094. Ieee, 2017.
-  V. Sarkar. Partitioning and scheduling parallel programs for execution on multiprocessors. PhD thesis, 1987.
-  H. S. Stone. Multiprocessor scheduling with the aid of network flow algorithms. IEEE transactions on Software Engineering, (1):85–93, 1977.
-  H. Topcuoglu, S. Hariri, and M.-y. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. Parallel and Distributed Systems, IEEE Transactions on, 13(3):260–274, 2002.
-  D. Towsley. Allocating programs containing branches and loops within a multiple processor system. IEEE Transactions on Software Engineering, (10):1018–1024, 1986.
-  C. Tsourakakis, C. Gkantsidis, B. Radunovic, and M. Vojnovic. Fennel: Streaming graph partitioning for massive scale graphs. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 333–342. ACM, 2014.
-  C. Wu, R. Tobar, K. Vinsen, A. Wicenec, D. Pallot, B. Lao, R. Wang, T. An, M. Boulton, I. Cooper, et al. Daliuge: A graph execution framework for harnessing the astronomical data deluge. Astronomy and Computing, 20:1–15, 2017.