The trend towards larger many/multicore architectures adds pressure on the memory technology in order to sustain maximum performance. Such advancements typically target reducing latency, increasing and scaling data transfer bandwidth. For a parallel program running on such platforms, performance is often subject to data access latency that varies based on where the data resides. This may range from local resources (e.g. L1/L2 caches) to uncore resources such as the L3 or DRAM. Furthermore, to increase the DRAM bandwidth and capacity, computing nodes increasingly leverage within-socket (amd-epyc2; fujitsu-a64fx2) and multi-socket (intel-scalable; amd-zeppelin) Non-Uniform Memory Access domains (NUMA) partitions, adding yet another level to the memory hierarchy. This makes the optimization of parallel programs rather more challenging as it is potentially impacted by nonuniform data access latency, memory channel bandwidth and resource contention. These challenges are often addressed by approaches that have the goal of maximizing data locality (unat-tpds17). Thus, advanced programming, runtime and compiler level techniques have been introduced to handle the resulting asymmetry in access speeds, to overcome penalties of non-local requests and maximize the utilized bandwidth (zoltan-icss-11; compiler-opt-locality-asplos-94; acar-locality; unat-tpds17).
One of the expressive parallel programming paradigms are the task-based programming models (Nakashima2014_Massive_threads; Willhalm2008_TBB; wheeler-qthreads; duran-ppl11). These are frequently used to express the parallelism of the program in the form of tasks. The task scheduling problem (i.e. mapping tasks to execution units such as threads) is, however, nontrivial. This is particularly true with the growing complexity of task DAGs (Direct Acyclic Graph) and execution platforms. Therefore, scheduling must be performed online by the task-based runtime schedulers.
Many schedulers embraced by modern task-parallel runtime libraries leverage random work-stealing as it achieves dynamic load-balance and increases the average core utilization (blumofe-cilk). Hence, work-stealing has become the de facto implementation choice of many task-parallel runtime systems. Nonetheless, random work-stealing assumes a flat view of the memory resources and is bound to suffer a considerable performance penalty, especially when scheduling tasks whose performance is influenced by data access latency and bandwidth.
For this reason, locality-aware extensions to work stealing have been proposed to mitigate the effects of non-local memory/cache access latency. These schedulers fall into three categories. In the first category, the task-DAG’s input/output dependencies are analyzed to devise a locality-aware schedule (drebes-taco14; drebes-pact2016; Barrera-ics2018). Tasks are then mapped to reduce a distance metric based on a static description of the hardware topology, which explains how the different hardware components (such as caches, memory controllers, cores, threads, etc.) are organized. A second category focuses on creating work-stealing regions with low latency memory access. This is achieved by leveraging locality hints from the programmer to aggregate the tasks into execution places (i.e. collections of cores) that can be configured to match the hardware architecture such as threads, cores or sockets. Examples of this approach include LAWS (guo-ipdps10), ADWS (ShiinaSC19_ADWS), SWAS (swas-tzilis-icppw-17) and Olivier et al. (Olivier_IJHPCA12_OpenMP). In the third category, the task data allocation is controlled by the runtime to evenly distribute the data over the available NUMA domains, then schedule tasks close to their known data regions (chen-ics14).
While locality-driven execution is a key factor to reducing data access latency and increasing performance, the overall application performance gain is subject to elements that are not clearly addressed by such schemes. Consider, for example, the arithmetic intensity (AI, i.e. flops per byte ratio). Application kernels with high AI are less impacted by data placement and can benefit from maximizing the compute throughput of all available execution units. This can be achieved by using simple greedy schemes such as work stealing. Hence, such applications do not benefit from the scheduling schemes that start by by maximizing locality and then applying work stealing (ShiinaSC19_ADWS), or schemes that limit their scope to tasks that have higher LLC miss rates (drebes-pact2016). Another key element to achieving higher application performance is maximizing the task data reuse and reducing resource requirements (e.g. bandwidth). This is specifically important for fine-grain task whose data size is within the lower level caches. Finally, it is not clear how such schedulers behave in events of lower DAG parallelism, which can be a result of unwinding the recursion in common divide-and-conquer DAGs that are a special case of fully strict DAGs (blumofe-cilk). In light of these elements, a scheduler that dynamically adapts to the task and DAG requirements (e.g. flops, channel bandwidth, cache reuse, parallelism, criticality) can obtain high performance on a variety of platforms while at the same time simplifying the programmer’s task.
This paper proposes ARMS, the Adaptive Resource-Moldable Scheduler. ARMS dynamically builds a model of the performance of each application task on the system resource partitions, such as the cores that share cache levels, NUMA nodes, sockets, etc. The model used, herein, serves as a predictor for the performance of a task on the available system resource partitions, and guides the scheduling decisions. It is created at runtime for each task based on its location in the task DAG topology. Software topology information is used as a portable key by the runtime to map the logical location of task’s input (e.g. the Cartesian coordinates or the matrix indices) to a physical core. In the absence of topology information, the scheduler automatically assigns a relative address to the task based on its location in the DAG (depth and breadth). Hence, ARMS is able to model the effects of scheduling a task on multicore partitions within or outside of its data region (e.g., NUMA node), identified by the task’s specific topology location. Then, it uses this knowledge to create a schedule that reduces the per-task parallel cost. In addition, ARMS supports not only traditional 1:1 mapping (i.e. a single task to a single thread) but also 1:M mapping (i.e. moldability when 1 task is assigned M resources via worksharing). A moldable task encompasses an internal scheduler that assigns the work partitions to threads within the task function. This, in essence, represents a Single-Program-Multiple-Data (SPMD) region. The flexibility in the mapping allows, for example, a memory-bound task to leverage a higher memory bandwidth by using more resources (threads), while a resource intensive task can set the number of threads to match its cache requirements. Also, the number of threads assigned to a task is dynamically changed based on the DAG’s parallelism.
The main contributions of this work are as follows:
We identify the key parameters that constitute a locality adaptive performance model for a task, which are the task work function and the topology information.
We map the parameters to a model that captures the performance on the available execution places, which is used by ARMS via a novel resource selection algorithm to deliver schedules that achieve good balance between locality and parallelism.
We conduct a thorough analysis of the effectiveness of ARMS against the state-of-the-art schedulers. This study reveals that even where there is a high DAG concurrency, resource aggregation (i.e. with 1:M mapping) is inevitable for memory intensive tasks, and in events of changing DAG parallelism. It also demonstrates that for tasks with high-compute intensity, locality maximizing schedules should be avoided.
The rest of the paper is organized as follows: Section 2 introduces the scope and a few useful scheduling terminology used in this paper. Section 3 describes the scheduling algorithm and the components used to develop the dynamic locality scheduling decisions. In Section 4, we describe the evaluated applications and the experimental methodology. Finally, Section 5 evaluates the scheduler with respect to locality-aware and traditional techniques. Section 6 highlights the related work in the field of locality-aware work-stealing runtimes. Finally, Section 7 concludes the paper.
Figure 2 shows the impact of data locality and task work size on the core performance using a compute-intensive chain of N-Body tasks running on a dual-socket 16-core Intel Skylake (described in Section 4) . The study compares single-threaded (non-moldable) to dual-threaded tasks (a molding option). The experimental DAG chain is shown by Figure 1, where the dotted line represents both an iteration and an output dependency from Task B back to Task A. Caches are flushed after each iteration to analyze the effect of streaming data across local and remote NUMA domains. The cache reuse happens within the iteration. Then, the chain is terminated after 1000 iterations to get a sustainable flops rate. Performance is expressed using the core flops (MFLOP/s). The average is taken across multiple runs to give a statistical guarantee over reproduciblity. The shaded regions in Figure 2 depict the different levels of the caches where the total working set fits. Listing 1 briefly highlights the benchmark evaluated herein. It shows a direct N-Body () computations using 2 threads. The output of task A is input to task B and so on. The start and end indices mark the partition that the task works on in case of moldable execution. The task’s data pointer (i.e. pos_target) is either pinned to the local NUMA or remote NUMA. Specification of computation and data locations are shown in Table 1. The “not molded” column shows the configurations used for Figure 2(a), whereas the “molded” column shows the configurations used for Figure 2(b). So for the chain of executions of task A followed by task B, we specify the computation thread id, and the NUMA id of the cell data (the N-Body domains are usually structured as trees). In this case, the cell data is pos_target as per Listing 1. For example, in the “Remote access” scenario, the pos_target output of task B located on NUMA node 1 would be pos_source input to task A that executes on a thread 0 residing on remote NUMA node 0.
|Task||Task A||Task B||Task A||Task B|
|Execution thread id(s)||0||0||0, 1||0,1|
|Local access||NUMA node id for data||0||0||0||0|
|Execution thread id(s)||0||16||0,1||16,17|
|Remote access||NUMA node id for data||0||1||0||1|
We make two observations here: First, if we consider the “molded” case shown by Figure 2(b), NUMA-aware computation (i.e. Local access) for this benchmark is beneficial only for the finest grain case (input size = 1k). Second, when the task is not molded (as shown in Figure 2(a)), there is no gain in preserving NUMA locality. Interestingly, in the absence of moldability, the most locality preserving approach, that is running dependent tasks on thread 0 and allocating on the local NUMA node, does not perform well on average, as shown by the blue bars in Figure 2(a)
. This is the case when computation is done next to the data in physical memory using a single thread to maximize L1 cache reuse. So, in practice such a choice incurs an evident performance degradation in certain cases. In this case, “remote” wins because of the interleaved access to multiple memory channels (0,1) from tasks A and B when data is distributed over 2 memory domains. As these observations vary based on the application, they make dynamic task scheduling decisions even more desirable, since the best mapping depends on the underlying platform, class of computation, and granularity of task among other attributes. The main hypothesis behind ARMS is that the impact of scheduling decisions can be estimated via an online performance model that is aware of the task type and of topology information that encodes the data location.
2. Scope and Background
This work targets a general form of DAGs with iterative or recursive structure, where nodes represent tasks and edges are either execution or data dependencies as shown by the synthetic example in Figure 3. The black edges refer to direct data dependencies (i.e. there is an edge from to directly reuses output data from ). The green edges refer to execution dependencies (i.e. there is an edge from to cannot start unless has completed execution). For example, a task-based version of post-order recursive traversal would not visit a node unless all descendants have been traversed. The dotted line indicates an iteration, which is a concatenation of the DAG to itself for an number of iterations. This edge does not indicate a cycle as the execution terminates after the program iterations have elapsed. The gradient change in red color indicates a different task type. A task type in the context of this work represents the task’s work function. Figure 4 depicts the possible Resource Partitions () denoted by the Leader () logical thread id, which is the thread with the smallest id in the partition, and the Width () of the partition, which is the number of workers (i.e. logical threads) involved. can be mapped to a work-sharing region with is moldable.
3. Adaptive Resource-Moldable Scheduler (Arms)
In a parallel system, if the available resource partitions are denoted by , then each invocation of ARMS (denoted by ) on a task should return (). Therefore, this section describes the different components of ARMS. In Section 3.1, we define the concept of STA (Software Topology Address) and show how to construct STAs in order to differentiate between the performance models based on locality. Then in Section 3.2, we show how we build the moldable resource partitions that are to be analyzed by the model. Section 3.3 highlights the resource selection algorithm. These components are necessary to determine what constitutes an efficient resource partition to schedule a specific task based on an online performance model.
3.1. The Software Topology Address Construction
Among the common techniques in constructing logical representations of spatial domains is using graph, geometric or algebraic forms. For example, continuous air particles, once discretized, can be expressed as mesh topology points accessed by their Cartesian coordinates. This matrix-free representation is useful for carrying out stencil updates. Since the topology is a static description of the domain, we propose to use it to create a portable initial mapping of the location of task’s data to a physical location in hardware (i.e. the initial thread). The portable numerical identifier of the logical location of task’s data is called the Software Topology Address (STA).
, where each color represents a location (e.g. physical core). The advantage of this initial mapping is that it attempts to preserve inter and intra-task data reuse as the tasks that share a coarse key or close enough keys should be mapped to the same/nearby hardware locations. For algebraic representations such as matrices that arise from the discretization of Partial Differential Equations (PDEs), we use the indices of the corresponding matrix blocks to create an address. The locality notion in these can be trickier as the indices of the matrix elements do not directly express topology, however, tasks that operate on the same matrix block are still guaranteed to share the same mapping. In the absence of topology, the STA is assigned to the nodes based on their relative location in the DAG (i.e. the depth of the node, and the location in the breadth). This is because it is known that nodes that are close in the DAG are much more likely to have data reuse. In this case, however, the DAG should exist a-priori to auto assign the STAs based on the depth and breadth of the nodes depicted by the final DAG structure. There is no such restriction for physical domains with inherent topology structure as the STA assignment in such domains is independent of the DAG structure, so dependencies can be inserted at execution time. Equations1 to 4 show the formulas used to assign an initial location to a task based on its topology information, which is done in 4 stages.
Stage 1: Configuring the granularity of the STA key: since the STA is used later in this paper to index the performance model, a granularity control of how many models to create need to be available. Hence, to reduce the overhead of creating many models, is tunable parameter, which is the maximum number of bits used to express the STA. Based on Equation 1, we allow times as much as there are fine-grain resource partitions (e.g. cores in the simple case). Sensitivity analysis of the granularity of STA creation is left to a future study. Note that a “worker” refers to a logical thread.
Stage 2: Obtaining the key from topology: in this stage depicted by Equation 2, we retrieve the space filling order () as an integer identifier for the model using the logical location of the task’s data (e.g. Cartesian coordinates) or the location of node in the DAG if the former does not exist.
Stage 3: Obtaining the relative hardware location of the key: as shown in Equation 3, this is obtained by dividing the STA by the maximum integer that can be represented using the defined granularity in Equation 1.
Stage 4: Mapping the key to initial physical location: in the last step (Equation 4), we map the relative location to a worker id, which falls in a moldable resource partition as shown in Figure 4. The logical worker id maps to a physical location as we explain in Section 3.2
|1||0,2,4,8,1,3,5,7||Thread affinity mappings|
|2||1,2,4||Widths for Leader thread 0|
|3||1||Widths for Leader thread 1|
|4||1,2||Widths for Leader thread 2|
|5||1||Widths for Leader thread 3|
3.2. The Moldable Resource Partitioning
In the case of single-threaded execution, the partition id calculated by Equation 4 is the logical id of the smallest unit of execution exposed by the runtime environment, which is typically a hardware thread. However, to unleash the aforementioned potential of moldability, we express the system as a set of execution places. The system in Figure 4 shows an example hardware configured with 2 places with , 4 places with and 8 places with . A place encompasses the cores that share a resource (e.g., cache level, memory subsystem, network element). Thus, the relative location resembled by the line segments below Figure 4 and indicated by Equation 3 points to a moldable resource partition (initially, is selected). To allow the flexible allocation of resources, the scheduler optionally accepts a layout description file at initialization phase. For example, the dual-socket 4-core system shown in Figure 4 can be described using the file whose content is highlighted in Table 2. Line#1 of Table 2 represents the thread affinities (hardware thread id) for each of the logical threads (workers). So in this case, the runtime is configured with 8 workers, which are mapped to the listed hardware thread ids. For example worker 0 is mapped to thread id 0, worker 1 is mapped to 2 and so on. In the following 8 lines, the supported resource widths for each of the workers is specified. Hence, for leader worker 0, the supported widths are 1, 2 (spanning workers 0 and 1), and 4 (spanning workers 0 - 3). This flexible specification can easily support heterogeneous architectures, however, the evaluation of ARMS on such platforms is currently work in progress. Note that there exist tools such as hwloc (hwloc) that can realize the system layout, and thus can be used here to generate the layout description.
3.2.1. Moldable work-stealing
Since ARMS is based on work-stealing, we present Figure 6 to show the interaction between work-stealing and work-sharing queues for moldable task execution. Each worker thread has a work-stealing and a work-sharing queue. The work-stealing queue includes the tasks that either are initially assigned the thread (i.e. the STA mapped thread), or the tasks that are stolen to achieve a load-balanced execution (work-balancing modes are covered in Sections 3.3.1 and 3.3.2). In Figure 6, T3 reaches the front of the queue. At this stage, the width is decided using the ARMS performance model, then the task is inserted to the work-sharing queues of the execution place. The queues are based on a lock-free implementation adopted from (lock-free-Sutter). So T3 is assigned to the partition which represents the partitions (T3-0, T3-1, T3-2). These are executed asynchronously on workers (3,4,5). The task work function is entitled to implementing the internal scheduling scheme (similar to the OpenMP parallel regions). Hence, each task receives a partition id in the range [0,width-1], and has access to . In this paper, we adopt a static scheduling scheme to preserve thread-level data reuse across dependent tasks. This is semantically similar to OpenMP’s static schedule for loop-parallel codes.
3.3. The Online Performance Model
Identifying the model: A key contribution of this work is defining the parameters that constitute a locality-adaptive performance model for a task. Without STA as a key (Figures 5(a) and 5(b)), the interpretation of a model would neglect the effect of data locations. Since the STA implies the mapping of task’s data to a hardware location, it facilitates creating a performance model per locality for a specific task. This is to study the effects of preserving data locality for that task. Once it is initialized (e.g. using NUMA first touch policy or hwloc-lib’s explicit memory pinning) in a resource partition , the model studies the effect of molding the task and its dependencies that have the same work function. In our implementation, we express tasks as object-oriented classes for easier C++ type resolution using std::type_id at runtime. Alternatively, a distinct integer for each task’s work function can be used for non Object-Oriented specifications. Adding the STA enables to model the performance per locality on the available system partitions. This manifests as a 2D array structure (model[type_index][sta]). A reference to the performance table is assigned to each task at initialization phase. For each model (Figure 5(c)), the following schemes are studied:
The locality scheme: for the thread’s local tasks, schedule within the set of inclusive partitions of the initial thread, i.e., all the partitions that share a resource with the initial task’s thread as depicted by the layout description. For locality sensitive tasks, a cost-efficient place can belong to this set.
The work-balancing scheme: use if thread’s local queue is empty.
The used performance modeling scheme:
We adopt an online history-based scheme
similar to that used by StarPU (starpu-model-augunnet-europar-2009) task programming library. We extend it to model the performance on execution places. This renders it powerful in accurately predicting the performance of fine-grain tasks, which are typical since one of the important purposes of fine-grain task-parallelism is maximizing concurrency by creating many tasks (Task Count Execution Places). Therefore, the history-based scheme proves to be effective as we show in Section 5. However, the implementation of the performance model is decoupled from the scheduler. Thus, models such as regression based or analytical models can be seamlessly used in ARMS.
We start by training the model on the available places as the number of tasks that share a model is much larger than the available places (a simple recursive task parallel code like SparseLU on 64x64 blocks results in 12k tasks partitions). The online history-based scheme assumes that performance is insensitive to the DAG iteration number, so a previous iteration’s value can be used to predict the current iteration. However, the dynamic changes in performance can be detected as the timing values are continuously updated for the selected execution places as we show in Section 3.3.1.
Eventually, for the model to result in an efficient mapping (Figure 5(d)) the locality and the work-balance schemes are used as shown in Sections 3.3.1 and 3.3.2.
On the runtime overhead of the training scheme: The initial STA mapping determines the distribution of the tasks over the worker threads. This offers a way to balance the work since the STA identifiers are spread over the input domain. For example, the space-filling order mapping is traditionally used to evenly partition domains with known input coordinates. Also, the even distribution is also achieved when STAs are automatically assigned by the runtime. The STAs are then normalized to the available threads as we show in Section 3.1. This reduces the penalty of under-utilization due to imbalanced partitioning.
In the case of the history-based model adopted here, the timetable is greedily filled in increasing order of the resource width (i.e. starting the execution from ). The locality scheme highlighted in Section 3.3.1 guarantees that the initial thread is always included in the subsequent schedules of the task (), which ensures producer-consumer data reuse between dependent tasks. We note that, since we use an online approach, we do not separate the training phase from the actual execution of the application; all work contributes to training, which does not have the overhead imposed by offline schemes. Based on our profiles, the impact of timetable filling and the sub-optimal resource width choices is negligible as we start from the assumption that Task Count Execution Places, which is valid across all experiments in Section 5. Nonetheless, if this assumption does not hold (e.g. in the case of relatively many execution places), the layout description (e.g. Table 2) can be adjusted such that the requested partitions do not span beyond the resources needed by the task with the max working set.
3.3.1. The locality scheme
The purpose of this scheme is to adjust the resource width within the inclusive sharing partitions. As mentioned before, a moldable task involves a work-sharing region that execute on a resource partition constituting one or multiple workers executing asynchronously. Table 3 shows the local partitions spanning threads 0 - 3 of the system shown in Figure 4. This means that logical thread 0 is included in the local resource partitions (), (), and (). When a task is initialized in a given thread (as per the STA), the modeled performance (i.e. execution time herein) of the inclusive partitions (e.g. Table 3) is fetched. This serves as a lookup table for the indices of the local partitions of each thread. It is initialized at startup given a layout description file such as Table 2. Hence, to predict the effect of moldability (i.e. changing the resource width), the partition that minimizes the cost function is selected for scheduling. This cost is depicted by the cpu time perceived by the leader thread () multiplied by the resource width (). Therefore, the higher the parallel cost, the lower the selected width and vice versa. The new cost is updated as per the new cost (Line 1). Eventually, for a latency-bound workloads, the model picks the width that matches the task’s working set size and minimizes the resource over-subscription. Also, in the events of lower DAG parallelism (at coarsening phases), the parallel cost will be lower, since more workers are available to execute the task. This results in the task being dynamically mapped to a local partition of width that increases utilization. Algorithm 1 (Lines 1 - 1) highlights the locality scheme of ARMS.
|Init thread||Local inclusive partitions|
3.3.2. The work-balancing scheme
Work-imbalance in dynamically scheduled DAGs is a well-studied problem, and it is especially considered in task-based runtime systems. One of the known solutions is treating the side effect of imbalance (i.e. idleness) using distributed work-stealing. In this approach, each worker thread has its own local queue that can be stolen from in events of idleness based on non-deterministic/random decisions (opt-dist-work-stealing-kumar-iaaa16; dist-work-stealing-paudel-icpp13) or deterministic decisions (shumpei-sc19). ARMS uses distributed work-stealing queues. However, the stealing decisions are not entirely random. ARMS
follows a heuristic method to try to maximize locality, based on the steps below:
Local work-stealing (Algorithm 1 (Lines 1 - 1)): this is done by checking the queues of the threads of inclusive partitions. For example, if thread id is idle, then based on Table 3, the inclusive threads are . The queues of these candidates are checked in round-robin fashion starting from the 111This step is dropped from Algorithm 1 for brevity. When a task is found, similar cost analysis to the locality scheme is applied.
Non-local work-stealing (Algorithm 1 (Lines 1 - 1)): if no work is found from previous step, a random non-local task is fetched. The resource partition that globally minimizes the cost is fetched (Lines 1 - 1). The stealing thread checks if it falls in the partition of the global minimum, otherwise, the stealing attempt is rejected (i.e. when condition in Line 1 evaluates to false). When the attempts reach certain threshold, the request will be fulfilled anyway (Line 1).
|Intel Skylake (intel-skylake-specs)||2||16||1||48(16)||22(16)||1024(1)||32(1)|
This section describes the experimental methodology used to evaluate the contributions of this works. ARMS is integrated into Runtime_X (xitao), a DAG runtime system implemented on top of modern C++ threading extensions ( c++11). Runtime_X is designed to flexibly evaluate scheduling policies and already features moldable tasks. However, ARMS is decoupled from the runtime internals.
Experiments are performed on an Intel Skylake (Intel Xeon Gold 6130), which is a modern multicore architecture with the memory hierarchy described in Table 4. The nodes are hosted at Tetralith (NSC’s largest HPC cluster), which consists of 1908 compute nodes each with a dual socket 16-core Intel SkylakeCPU, giving a total of 61056 CPU cores. Runtime_X is configured to run with 32 worker threads
The layout description file: worker threads are mapped to the physical cores of the platform totaling 32 cores. The configured resource widths are 1, 2, 4 and 16. This means that we do not map a task across the 2 sockets. Note that there is no restriction on mapping a task to two sockets. Based on our traces, the scheduler will directly identify this as a suboptimal choice due to the cost of NUMA misses resulting from accesses from remote workers. This reflects as a high modeled parallel cost. From a memory perspective, a task can leverage (1, 2, 4, 16) x L1/L2 caches and the L3 cache, with width 16.
4.2. Baseline Schedulers
Below we describe and justify the baseline schedulers used:
Random Work-Stealing Scheduler (RWS): it is based on distributed work-stealing, where each worker greedily tries to reduce idleness by fetching work from victim queues. It has been formally introduced in (blumofe-jacm99) and has appeared in several parallel task-based libraries such as Cilk (blumofe-cilk), and Intel TBB (kukanov-itj11).
Almost Deterministic Work Stealing (ADWS): this scheduler is introduced in (ShiinaSC19_ADWS). It is currently the state-of-the-art in work-balanced locality-aware scheduling developed as an extention to the scheduler of the MassiveThreads library (Nakashima2014_Massive_threads). It follows a deterministic task allocation based on programmer workload hints, where the work is recursively split across the DAG nodes (spawned and continuation tasks). Each worker maintains a local and a migration queue, and work stealing is only allowed inside the “work groups”. A port of ADWS has been implemented in Runtime_X.
ARMS-1: We also evaluate the proposed scheduler using 1:1 task mapping. Partition widths are persistently set to 1. Tasks are initialized and executed in the NUMA node mapped by their STA. Local stealing is preferred, then global stealing requests are fulfilled when the stealing thread reaches idleness threshold and the steal that reduces the cost function (as per the model) is chosen.
ARMS-M: The proposed scheduler with all components from Section 3, using 1:M task mapping. The scheduler is configured with the parameters shown in Table 5:
|sta||enabled||capture/assign sta values|
|perf-model||enabled||build online perf model|
|idle-tries||10||idle iterations before stealing|
|local-steal||enabled||use local balancing scheme|
|global-steal||enabled||use non-local balancing scheme|
|moldablity||enabled||map a task to M workers|
4.3. Synthetic Benchmark
To validate the adaptive online performance model adopted by ARMS, we use a synthetic DAG benchmark as shown by Figure 7. The degree of DAG parallelism as well as the depth of the DAG are control parameters. This is useful to fix the number of tasks and understand the trade-off between parallelism and locality. Also, the tasks can be configured to be either matrix multiplication (MatMul) or Stream Triad tasks. For each chain, the table underneath Figure 7 shows the relative STA location of the thread obtained from Eq 3. The initial thread is acquired using Eq 4 with the 32 worker threads available in the experimental platform.
Iterative DAG - HEAT: we leverage a DAG implementation to compute heat diffusion on a 2D grid. One of the iterative numerical methods to achieve this is to use 2D Jacobi stencil. We use a 5-point stencil and create dependencies between the neighbor nodes (see Figure 8(a)). The approach involves computing the stencil in a compute task, and copying out the update in a copy task. The DAG is iteratively executed for a fixed number of 2k iterations. For STA specification, we use the coordinates of block of mesh points involved in a task.
Recursive DAG - SparseLU: we port a SparseLU benchmark from the Barcelona OpenMP Tasks Suite (duran-icpp09). This benchmark computes an LU matrix factorization over sparse matrices. The matrix is composed of NxN blocks, each of which has a pointer to a sub-matrix of size MxM. Load-imbalance is evident due to the sparsity of the matrix. For each phase of the LU algorithm, a task is spawned for non-empty blocks (see Figure 8(b)). For STA specification, the matrix block indices are used.
Recursive DAG - The Fast Multiple Method (FMM): we port the fmm-minimal task-based implementation from the exafmm library (exafmm-minimal) to Runtime_X. FMM is a popular
solver for the matrix vector products arising from the solution of certain boundary integral equations(abduljabbar_sisc19_bemfmm). It is originally used to efficiently solve the quadratic -Body problem that appears from particle/gravitational simulations (e.g. from molecular dynamics and astrophysics). This implementation is tree-based (see Figure 8(c)). We set the leaf cell size to 64 particles. The STA in this case leverages the Cartesian coordinates of the underlying tree cell.
Recursive DAG - MatMul: this is a cache-oblivious divide-and-conquer (cache-oblivious-frigo) implementation of the dense matrix multiplication. It is favorable to task-parallelism since the recursive subdivision to smaller blocks can be assigned to tasks with respective dependencies (see Figure 8(d)). The block size for the smallest block has been set to 128-256 across all experiments. The STA is the block indices per each level of the recursion tree.
In this section, we evaluate the model’s performance to assess whether it is successful in adapting the granted resources to the task’s and DAG’s requirements. We share the obtained empirical results comparing ARMS to the baseline schedulers. Then, we analyze the achieved performance gains by showing the schedule map for ARMS pertaining to a specific task type and location, and how locality is preserved adaptively. Also, we explain the gains by relating them to the proportion of the cumulative work time to the overall scheduling time.
5.1. The Adaptive Resource-Moldable Selection
We refer back to the highlighted motivational example discussed in Section 1.1 discussed in Figure 2, where we see that locality maximizing using a single threaded task may result in a sub-optimal schedule. Not only ARMS is able to place the dependencies on an efficient target NUMA node, but it is also able to maximize cache utilization by the online tuning of the resource width to match the task’s requirement. Therefore, we evaluate whether the scheduler is able to adaptively tune locality depending on the initial location and size of the task’s data.
In this experiment, we create a chain of task dependencies exhibiting streaming behavior as shown in Figure 7. The DAG is composed of as many chains as there are NUMA domains (i.e. 2 chains in this experiment) and each performance model is referenced by its STA. As a simplification, each model reflects the tasks whose data belong to a distinct NUMA node. Each chain’s head task and data are initially pinned to a NUMA domain. Figure 10 displays the schedule trace for a single chain of 1000 tasks . The -axis is the resource width, the -axis is the thread, and the -axis is the frequency of selection. To simplify the -axis, we label the NUMA node id that encloses the set of 16 threads in the id range. Also, we color the node where the task is initialized in green.
Memory-Intensive Schedule Map: in Figure 10(a), we expect that the dependent tasks will be scheduled to fit in the private L1 cache of the STA-mapped thread (). Also, we expect that since the tasks are latency-bound, the scheduler should decide most of the time to preserve the locality. Using Algorithm 1, this is evident in more than of the scheduling decisions. The second to highest frequency is observed for , which is still reasonable as the data size is exactly at the limit of the L1 cache. To test the automatic resource adaptation capability, we increase the problem size such that it does not fit in the private L2 cache (¿=1024Kb) as shown Figure 10(b). In this case, the scheduler opts for selecting the entire NUMA node (colored in green) to utilize the L3 cache totalling 22 MB (), while maximizing locality at the NUMA level.
Compute-Intensive Schedule Map: we also check how ARMS adjusts the schedule up to the compute/resource requirements of a chain of compute-heavy tasks which constitute a single-precision direct -Body code (direct-n-body-Spur99). In the small case shown by Figure 10(c), where tasks exceed the private L1 cache size (2xL1), the scheduler conservatively picks the locality maximizing places with width (). However, the chain of large tasks (Figure 10(d)), is spread within (green region) and outside (red region) of the NUMA node containing the data, and the largest available resource widths () are chosen to maximize floating point capability.
5.2. Parallelism vs Locality under Arms
|The DAG Parallelism|
One of the important properties to study with respect to dynamic locality-aware scheduling is how the scheduler balances the trade-off between locality and parallelism, which is a known challenge in scheduling task-based programs. This is because of the fact that increasing the number of threads decreases the apparent spatial locality since access streams from independent cores are interleaved (locality_parallelism_Jeong12). Since ARMS addresses this trade-off by adjusting the task’s resource width, we analyze how it reacts to a wide range of DAG parallelisms spanning (2 - 256). The change in the DAG parallelism is highly evident in divide-and-conquer computations (a special case of fully-strict computations (blumofe-jacm99)) as we go up the recursion tree, where parallelism gets lower. Also, the DAG parallelism changes due to factors such as load-imbalance, dependency checking, synchronization and so on.
Impact of Changing the DAG Parallelism: to simplify the analysis, we see how ARMS behaves with different DAG parallelisms using the synthetic benchmark depicted by Figure 7. We fix the number of tasks to 50k, so that we are able to monitor the changes in performance as a function of the DAG parallelism as shown in Figure 9, with N=128 for the MatMul case and N=512 for the Stream Copy case. Then, we compare to ADWS as a representative of locality-aware schemes. For the challenging case of lower parallelisms (2 - 8), ARMS outperforms ADWS by approximately (3.5, 3, 2.5) in the compute and memory-intensive cases shown in Figures 9(a) and 9(b), respectively. However, the gap is closed as the machine hardware parallelism is reached, with a slight degradation for higher parallel slackness than 32. This shows that ARMS has a sustainable behavior across different parallelisms and is able to effectively adapt to the case when locality preserving is not enough to achieve higher throughput. Since scheduling decisions are made on a per-task basis following the depicted online performance model, the aggregate effect on performance of mixing Copy and MatMul tasks combines the trends from the individual cases as shown by Figure 9(c).
Analyzing ARMS’s Performance Gain: to demonstrate the sources of the gain, Table 6 prints the trace for the resource width choice of a single chain of MatMul task (N=128) across different runs that differ by the DAG parallelism. Since ARMS minimizes the parallel cost (), it is able to dynamically aggregate resources (i.e. increase the width) to the task at a lower DAG parallelism. For example, ARMS detects that the parallel cost of using 8 threads is lower when the DAG parallelism is 2 in 99.7% of the cases. Note that the leader thread of this task has an id of 8 as depicted by the STA. A step-wise increase in the task resource width occurs until the DAG parallelism is 32 which matches the machine’s parallelism. Beyond this point, the automatic width choice is 1. In the following sections, we study and analyze the effectiveness of these techniques in the context of different classes of applications.
5.3. Application Performance Evaluation
In this section, we evaluate the performance of ARMS against the baselines schedulers using various application DAGs. The applications and baselines have been described in Section 4. We also assess the impact of
moldability on ARMS by studying the two scheduling variants : ARMS-M and ARMS-1.
This evaluation intends to showcase the consistent enhancement that ARMS exhibits across different application classes. The granularity of the task creation across all benchmarks have been configured to be in the range of (2 - 4) L1 caches (64Kb - 256Kb) and within a private L2 cache ( 2048Kb). The initial resource width is set to 1 for all tasks (1-to-1 assignment).
2D-Stencil: this DAG consists of a copy task and a 5-point stencil compute grid task (Figure 8(a)) iterating 2000 times. Experiments incrementally double the mesh resolution. This application has a clear data reuse pattern across the iterations, however, the data does not fit within a single L1 cache, so moldability has a consistent improvement from 1.5x - 2x over the best baseline (ADWS) as shown by Figure 11(a). Based on our traces, ARMS molds to 2 cores in more than 90% of the scheduling decisions of the stencil compute tasks. The improvement also maps to up to an order of magnitude reduction in the application’s L2 misses (Figure 12(a)).
MatMul and SparseLU: in the case of MatMul, the improvement takes place at a relatively larger matrix size. Since we set the leaf block size to 128, the first two cases result in 3 - 4 subdivisions, which is not enough to train the model. However, ARMS still performs better than RWS and ADWS from a matrix size of 2048 (Figures 11(b) and 12(b)). Similar improvement, shown in Figure 11(d), are achieved in SparseLU with 64x64 blocks.
FMM: Last but not least, considering a highly irregular DAG structure like the FMM’s DAG (Figure 8(c)), ARMS-M, ARMS-1 and ADWS behave like locality-aware work-stealing schedulers due to the high parallelism and compute intensity of FMM kernels as we can observe from Figure 11(c). However, this only shows that ARMS does best effort at either matching or outperforming the baselines without user-level computational hints or prior workload assumptions.
ARMS is effective in providing better or comparable schedules to the state-of-the-art. Molding a task to multiple threads in a locality aware-manner is especially profitable in applications with uniform data reuse patterns such as Stencil and recursive MatMul (N¿=2048) as demonstrated from the behaviors in Figure 12. However, due to constructing locality-adaptive performance model, ARMS-M behaves like non-moldable locality-aware schedulers (ARMS-1and ADWS) in FMM and SparseLU. ARMS-M automatically achieves this without needing to change the user-code or having to change the scheduling options.
6. Related Work
There has been extensive literature addressing data locality concerns in HPC applications and systems. On the theoretical level, models that predict locality for parallel programs by means of reuse distance analysis have been discussed in (ding-pldi03; ding-mrtr09; cascaval-ics03). Such models aid the programmers in understanding the memory access patterns of their codes. Additionally, data-centric programming models (e.g. Legion (legion-bauer-sc-12)) and language extensions (e.g. Hierarchical Place Trees (HPT) (hpt-yonghong-lcpc-2010)) require the domain expert programmer to analyze the data access regions and express the program using their interfaces, and to make explicit assignments to execution places. Despite their effectiveness, programmer level techniques (e.g. loop tiling or hierarchical cache optimizations (Kowarschik2003)) are known to have limited portability across platforms. Hence, high-level abstractions in data locality have received a growing attention from the HPC research community (unat-tpds17; bauer-sc12). These abstractions are meant to adopt existing features from parallel execution frameworks, such as the affinity partitioner and task arenas from Intel TBB (Thread Building Blocks), places and processor binding from OpenMP, topology and communication reducing in MPI (Message Passing Interface) (bosilca-ipdpsw11; abduljabbar_isc17_comm_reducing_hsdx).
As discussed, many task-based parallel libraries embrace random work-stealing scheduling mechanisms, which are provably known to be effective in maximizing dynamic load-balancing. However, they are bound to suffer from scalability issues with memory-bound tasks. To address these, techniques have been proposed to achieve locality-aware work-stealing behavior such as the Almost Deterministic Work Stealing (ADWS) (acct-19-vikranth). This is done via deterministic mapping of tasks to resources that is applied based on the programmer’s annotation of workload size. Also, the paper discusses an approach to achieve hierarchical localized work stealing. Another technique is highlighted in Scalable Locality-aware Adaptive Work-stealing Scheduler (SLAW), which also allocates the resources based on user annotations of locality (a place index) (guo-ipdps10). Last but not least, XKaapi (gautier-ipdps13) is a locality-aware work stealing scheduler that supports heterogeneous data flow programming targeting both CPUs and GPUs.
While these are effective techniques in reducing the side effects of random work-stealing, none of them consider the difficulty of obtaining a static classification of applications into memory and compute intensive classes. We think that ARMS is not completely orthogonal to the literature above, as it still maximizes locality when deemed necessary by the online model. However, ARMS is not yet-another locality-aware scheduler, but it rather adopts a dynamic strategy and opts for locality maximizing by understanding the behavior of a task at runtime. We believe that this strategy is only more realistic considering the evolving complexity of applications and hardware.
In this paper, we presented a novel scheduling strategy that dynamically tunes its locality-awareness. This is achieved without statically establishing locality-aware decisions irrespective of the underlying task’s arithmetic and memory footprints, a choice that can lead to suboptimal performance. By adopting an online performance model paired with platform-independent topology information, the scheduler is able to maximize locality for the latency-intensive tasks, and to relax it for the compute-intensive counterpart. The resource moldability feature allows to adapt the granted resources to the task’s requirements and to reduce the cache misses, and achieves performance gains on a variety of application DAGs.