Auto Adaptive Irregular OpenMP Loops

OpenMP is a standard for the parallelization due to the ease in programming parallel-for loops in a fork-join manner. Many shared-memory applications are implemented using this model despite not being ideal for applications with high load imbalance, such as those that make irregular memory accesses. One parameter, i.e., chunk size, is made available to users in order to mitigate performance loss. However, this parameter is dependent on architecture, system load, application, and input; making it difficult to tune. We present an OpenMP scheduler that does an adaptive tuning for chunk size for unbalanced applications that make irregular memory accesses. In particular, this method(iCh) uses work-stealing for imbalance and adapts chunk size using a force-feedback model that approximates variance of task length in a chunk. This scheduler has low overhead and allows for active load balancing while the applications are running. We demonstrate this using both sparse matrix-vector multiplication (spmv) and Betweenness Centrality (bc) and show that iCh can achieve average speedups close (i.e., within 1.061x for spmv and 1.092x for bc) of either OpenMP loops scheduled with dynamic or work-stealing methods that had chunk size tuned offline.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/16/2019

On the Benefits of Anticipating Load Imbalance for Performance Optimization of Parallel Applications

In parallel iterative applications, computational efficiency is essentia...
04/12/2020

GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux Applications

We present a parallel profiling tool, GAPP, that identifies serializatio...
11/20/2019

An Adaptive Load Balancer For Graph Analytical Applications on GPUs

Load balancing graph analytics workloads on GPUs is difficult because of...
09/01/2020

Helper Without Threads: Customized Prefetching for Delinquent Irregular Loads

The growing memory footprints of cloud and big data applications mean th...
12/14/2018

Impact of Traditional Sparse Optimizations on a Migratory Thread Architecture

Achieving high performance for sparse applications is challenging due to...
11/30/2021

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

We present Atos, a task-parallel GPU dynamic scheduling framework that i...
04/16/2019

Calculation of distributed system imbalance in condition of multifractal load

The method of calculating a distributed system imbalance based on the ca...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The traditional fork-join model of programming has remained popular due to the ease of expressing loops that are rich with parallelism in scientific applications. OpenMP is a popular programming interface due to making use of the fork-join model via programming pragmas, and making the parallelization of loop effortless [7]. However, the interface has limited options on the assignment of tasks to threads when the tasks have a wide variation in time and resources. The only option that the runtime user is given access to is a shared static chunk size, in which the user can define the number of tasks each thread should process before requesting more from a centralized queue. In order to combat the limitations of this scheduling method, many modern programs are being converted to use dynamic task queues in a tasking model, and additional support for this model has been added to OpenMP [18, 17]. However, dynamic task queues come with a certain amount of overhead and are not as well supported by legacy applications.

In this work, we provide an OpenMP parallel-for schedule with independent chunk size per thread and auto-tuned chunk size that works with work-stealing to allow users to continue to use the favored fork-join model on applications that have tasks lengths that vary throughout execution or when they cannot pay for the overhead of tuning for chunk size. We call this method iCh (irregular Chunk), and this method provides a middle ground between traditional fork-join OpenMP scheduling with OpenMP’s built-in dynamic scheduler that uses static chunk size and dynamic task queues. In particular, this method is targeted towards applications that have tasks that are unbalanced and make many irregular accesses. Some of the common applications in this area are sparse linear algebra kernels (e.g., sparse matrix-vector multiplication and sparse triangular solve) and graph algorithms (e.g., Betweenness Centrality). Despite these kinds of codes being predominant in high-performance computing, most schedules are not designed with them in mind.

Currently, tuning the parameters for modern many-core systems for optimal performance can be difficult and not be portable between machines, see Section 5. Making the chunk size smaller allows for more flexibility when the runtimes of individual tasks in the chunk various greatly, but at the cost of making more requests from the centralized queue. Making the chunk size larger reduces the time in making requests from a centralized queue, but can result in more load imbalance. Additionally, a single application may have multiple phases, i.e., a subsection of code that has its own unique performance and energy concern [3], and these phases may need to have its own chunk size to have good performance. Therefore, the choice of best chunk size is dependent on the implemented algorithm (), the hardware microarchitecture (), input (), and current system load (), i.e., Tuning such a chunk size may be impractical and not reflective of the true system under load. As such, a light-weight auto-tuning like iCh provides one solution.

2 Background

This section describes the current issues and room for improvement related to the adaptive scheduling of tasks with irregular accesses and execution times.

(a) arabic-2005 in natural ordering
(b) arabic-2005 in RCM ordering
(c) Number of rows binned by together based on nonzero count in increments of 50 for arabic-2005 (y-axis in log-scale)
Figure 1: Representations of irregular inputs

Irregular kernels. Many common applications require calls to kernels (i.e., important common implementations of key algorithms) such as those that deal with sparse matrices and graph algorithms. Examples include graph mining and web crawls. However, these kernels require a great deal of tuning based on both the computer system and algorithm input to perform optimally. Many different programming models are used to implement these kernels, but one of the most common is a fork-join model. Additionally, many of these kernels are memory-bound even when optimally programmed [23]. This means that many memory requests are already waiting to be fulfilled, and additional requests will have high latency on an already busy memory system.

Sparse Matrix-Vector Multiplication (spmv). spmv is a highly studied and optimized kernel due to its importance in many applications [21, 12, 19, 16]. However, the irregular structure of the sparse coefficient matrix makes this difficult. If a one-dimensional layout is applied, the smallest task of work is the multiplications of all nonzeros in a matrix row by the corresponding vector entries summed together at the end. Figure (a)a presents the nonzero structure (i.e., blue representing nonzero entries and white representing zero entries) of the input matrix arabic-2005 in its natural order. The natural order is the one provided by the input file, and many times this ordering has some relationship to how elements are normally processed or the layout on the system. From afar, a static assignment of rows may seem like a local choice. To investigate, we bin rows based on nonzero counts with increments of 50 together, such that the first bin counts the rows with 1-50 nonzeros and the second bin counts the rows with 51-100 nonzeros. In Fig. (c)c, we provide the tally of the number of rows in each bin (provided in log scale) for the first 50 bins. We note how much variation exists () and imbalance of work, such as the last two dots representing bin 49 and 50. Additionally, matrices are often preordered based on application to provide some structure, such as permuting nonzeros towards the diagonal or into a block structure. One such common permutation is RCM [6]. This little structure can provide some benefits to hand-tuned codes [19, 12, 2] that can use the newly found structure to better load balance work. However, Fig. (b)b shows that this could make balancing even harder if rows were assigned linearly. Though orderings like RCM may improve execution time [6, 10, 12], these orderings may make tuning for chunk size more important.

Betweenness Centrality (BC). The BC metric captures the importance of nodes in a graph and concerns the ratio of all shortest paths to those that pass through a given node. State-of-the-art implementations of BC are normally implemented as multiple parallel breadth-first searches [10, 4, 5, 14]. Therefore, the work of a task will depend on the number of neighboring nodes and the nodes currently queued in the search front. However, relatively good speedups can be achieved with input reorderings and smart chunk size.

Irregular kernel insight. Despite the irregular nature based on input, there does exist some local structure normally. For example, rows or subblocks within a matrix that have a large number of nonzero could be grouped based on some ordering. The same can also be applied to graph algorithms like BC. Even if the given input does not come with this structure, the input can be permuted to have it. This type of permutation or reordering is commonly done to improve performance [10, 19, 16, 12]. Therefore, a thread could, in fact, adapt its own chunk size to fit the local task length. Moreover, if a thread is finished with its own work, it could be intelligent in stealing work based on the workload of others. This does require some computation overheads such as keeping track of workload and communicating this workload to nearby neighbors. Since most of these applications are memory-bound, there does exist a certain amount of availability of computational resources and time during computing.

3 Adaptive Runtime Chunk for Irregular Applications

The following steps are considered to construct an adaptive schedule.

Initialization. Standard methods like dynamic scheduling in libgomp use a centralized queue and a single chunk size for all threads, but do not scale well with the number of tasks and threads needing to service many-core systems. Therefore, local queues are constructed for each thread that we denote as where is the thread id for threads used. A local structure that is memory aligned and allocated using first-touch allocation policy contains a pointer to the local queue, local counter (), and a variable used to calculate chunk size (). The tasks are evenly distributed to tasking queues such that where is the total number of tasks. Additionally, and such that the initial chunk size is , i.e., . The rationale for this choice is that the scheduler wants to allow a chunk size small enough that - threads could steal from later. Moreover, the chunk size is smaller as increases and allows for the variation of tasks that come with more threads.

Local adaption. In traditional work-stealing methods, the chunk size is fixed, and any load imbalance is mitigated through work-stealing once all the tasks in the initial queue are already executed [9]. However, a thread can only steal work that is not already being actively processed, i.e., not in the active chunk. Therefore, making the chunk size too large to start will result in a load imbalance that the scheduler may not be able to recover from using work-stealing. Additionally, making the chunk size too small would result in added overhead and possibly more time to converge.

In iCh, work-stealing is still the workhorse for imbalance as in work-stealing methods seen in the next step. However, iCh tries to locally adapt chunk size to better fit the variation in execution time of tasks, not the load balance. This variation is very important in irregular applications as tasks may vary greatly in the number of float-point operations and memory requests. Additionally, a single core that is mapped to a local thread, and its queue can vary in both voltage, frequency, and memory bandwidth due to load on the system [3]. Because of all these variations, a static shared chunk size has limitations. Despite iCh’s goal, it does have an implicit impact on load-balancing as it reflects the arguments related to chunk size in the previous paragraph.

This method tries to capture variation into three categories: high, normal, and low. If high, the task’s work length varies more than if low, and a smaller chunk size

will allow for more adaption and possibly work-stealing. The thought process is the opposite of that for low. The calculation of “true” variation is very expensive as this requires accurate measurements of time, operations, and memory results, in addition to a global view of the average. Therefore, a very rudimentary estimate is used as follows. The local variable

keeps track of the running total of the number of tasks completed after a whole chunk is finished to estimate task length and limit the number of writes. After completion of the assigned chuck, the local thread will determine its load relative to other threads using

. A thread classifies its load relative to other threads as:

In particular, this approximation is simply comparing the thread’s workload to the average workload of other threads. We note that if iCh’s goal for chunk size was load balance, then high and low classifications would be flipped, as a thread that does fewer iterations than average would have heavier tasks. The parameter is added to allow for slight variation and to reduce the number of times chunk size is updated. Through trails, we show in Section 3 that an (i.e., of the current average) is generally sufficient, and minor changes to has little change to runtime for our kernels. For simplicity, we reference in the remainder of the paper by only the percentage, e.g., . This observation allows iCh to be used across different applications, systems, and inputs without hand-tuning by the user to achieve “good” speedups. Moreover, is the running total and is fixed. As a result of this relationship, iCh is more likely to only adapt chunk size in the beginning due to extremely large variance, and the possibility of adapting due to smaller variance increases with execution.

As noted previously, is used to directly adjust the chunk size, i.e., . After classification, is adjusted as follows. If the thread is under low variation, the number of tasks in a chunk is increased by allowing . If the thread is under heavy variation, the number of tasks in the chunk is decreased by allowing . These increase and decrease are in contrast to what most may consider. In particular, this update is because of the optimization goal. The optimization goal is not to have the chunk size converge for each thread to the same value. In contrast, the goal is for the local chunk size to be adapted to the variation.

Remote work-stealing. At some point, the local queues will start to run out of work, and then work-stealing is used. Many implementations of work-stealing methods will fix a chunk size, and use the THE protocol [11, 9] to try to steal and back off if there exists a problem while trying to minimize the number of locks required. A victim is normally picked at random, and the stealing thread will normally try to steal half the remaining work from the victim.

The iCh method is very similar to the traditional method above. A victim is selected at random and half of the victim’s remaining tasks are stolen. Additionally, the stealing thread’s and are updated based on the victim’s and by taking the average of both, i.e., and . The reasoning for this is as follows. The stealing thread knows some information from the victim. However, the stealing thread does not know the accuracy of the information and tries to average out the uncertainty with its knowledge.

4 Experimental Setup

Test system. Bridges-RM at Pittsburgh SuperComputing Center [20] is used for testing. The system contains two Intel Xeon E5-2695 v3 (Haswell) processors each with 14 cores and 128GB DDR4-2133. Other microarchitecture, such as Intel Skylake, were also tested, but results did not vary much. We implement iCh inside of GNU libgomp. Codes on Haswell are compiled with GCC 4.8.5 (OpenMP 3.1). OpenMP threads are bound to core with OMP_PROC_BIND=true and OMP_PLACES=cores.

Test inputs. The same test suite of inputs is used for both spmv and BC. Table 1 contains the inputs taken from the SuiteSparse Collection [8], where the number of vertices and edges are reported in millions (i.e., ). Inputs are picked due to their size, variation of density, and application areas. In particular, four applications are of particular interest. These areas are: Freescale: a collection from circuit simulation of semiconductors; DIMACS: a collection from the DIMAC challenge that is designed to further the development of large graph algorithms; LAW: a collection of laboratory for web algorithms of web crawls to research data compression techniques; GenBank: a collection of protein k-mer graphs.

Furthermore, we report the average row density (), the ratio of the maximal number of outgoing edges for a vertex over the minimal number of outgoing edges for a vertex (), and the variance of the number of outgoing edges () for each input. These numbers provide a sense of how sparse and how unevenly work is distributed per vertex. Some inputs are very balanced, such as input I8 (hugebubbles). Others have more variance like input I2 (uk-2005).

Input Area
I1: FullChip Freescale 2.9 26.6 8.9 1.1e6 3.2e6
I2: circuit5M_dc Freescale 3.5 14.8 4.2 12 1
I3: wikipedia Gleich 3.5 45 12.6 1.8e5 6.2e4
I4: patents Pajek 3.7 14.9 3.9 762 31.5
I5: AS365 DIMACS 3.7 22.7 5.9 4.6 0.7
I6: delaunay_n23 DIMACS 8.3 50.3 5.9 7 1.7
I7: wb-edu Gleich 9.8 57.1 5.8 2.5e4 2.0e3
I8: hugebubbles-10 DIMACS 19.4 58.3 2.9 1 0
I9: arabic-2005 LAW 22.7 639.9 28.1 5.7e5 3.0e5
I10: road_usa DIMACS 23.9 57.7 2.4 4.5 0.8
I11: nlpkkt240 Schenk 27.9 760.6 27.1 4.6 4.8
I12: uk-2005 LAW 39.4 936.3 23.7 1.7e6 2.7e6
I13: kmer_P1a GenBank 139.3 297.8 2.1 20 0.4
I14: kmer_A2a GenBank 170.7 360.5 2.1 20 0.3
I15: kmer_V1r GenBank 214 465.4 2.1 4 0.3
Table 1: Input Graphs. Vertex and edge counts in millions. : average number of outgoing edges per vertex. : maximal number of outgoing edges over minimal number of outgoing edges. : variance of the number of outgoing edges.

5 Results

In this section, we observe the numerical results of using three different schedules for OpenMP for-loops. These are dynamic (Dyn), a work-stealing (WS), and iCh. OpenMP Guided and task models were also tested, but they did not provide any insight. The work-stealing method is the same used by iCh, but with a static chunk size. For both dynamic and work-stealing, we test with chunk size in the collection . The performance (i.e., Time where is the input, is the ordering, is the schedule, is chunk size, and is the number of cores) varies greatly with chunk size, application, and input. Therefore, we often speak of the best time for all tested chunk size as: Likewise, we define as the max and second best () that will be used throughout this section. Additionally, each timed experiment is repeated 10 times, and the time used in this section is the average time of the 10 runs. The number of runs is important for two reasons. The first is that all runs have a small amount of fluctuation due to the system. Additionally, all the scheduling tests may change slightly from run-to-run as victims are selected by random and read-write orders in dynamic.

(a) REO of spmv
(b) REO of BC
Figure 2: The percent relative difference between spmv and BC with best chunk size ran on inputs ordered with RCM and NAT.

Ordering. The execution time, energy usage, and scalability of most irregular applications are dependent on the input. For our two test applications, ordering is also important [6, 10]. In order to demonstrate this for our runs, we consider both the RCM ordering and natural ordering (NAT). We define the percent relative error due to ordering as: where is dynamic. Figure (a)a presents for spmv over the different number of cores with each dot representing one matrix from the test suite. Note that can be under for some inputs, but the majority of the inputs are larger. Figure (b)b presents for BC over the different number of cores with each dot representing one matrix from the test suite. Again, we notice a large error between RCM and NAT order. Overall, both spmv and BC are always faster when the input is ordered with RCM. The difference is also seen when the schedule used is WS in almost the same pattern when dynamic is used (not shown). Additionally, the variation in performance based on chunk size is higher when ordered with RCM than with NAT. This variation is partly due to RCM ordered inputs running faster, and so overheads are seen. Additionally, many of the inputs with NAT orderings have a more uniform random distribution of heavy and light tasks, and a chunk would more likely have both. Therefore, we will use inputs ordered with RCM for the remainder of this section.

Importance of chunk size. Now we analyze the importance that chunk size has on the performance of our benchmarks. In order to analyze, we use two similar metrics as done in the last subsection. The first is to analyze the largest difference that exists due to the chunk size by fixing the input and schedule. We define as: and further define - as: Likewise, we define - as:

(a)
(b)
Figure 3: The max and min percent relative difference between the application using the best chunk size and second best chunk size.

Figure (a)a presents - and - for both dynamic (i.e., - and -, respectively) and work-stealing for both spmv and BC. Note that we only run BC up to 24 cores due to issues with scaling throughout this paper. From the figure, we observe the worst case for dynamic on both spmv and BC is around or higher. This means that just selecting a chunk size without any tuning or thought can greatly influence the runtime of both these applications. On the other hand, for some inputs, the worst case is not so bad for BC as it is for spmv. For example, using 24 cores for BC, there is one input that percent relative error for time is only . In particular, this input is I1: FullChip, but the next smallest percent relative error for 24 cores is for I6: delaunay_n23.

Though the worst performance is a good argument on why chunk size needs to be tuned, it does not capture the difficulty of tuning. In our experiments, we use a relatively large search space for chunk size (i.e., ). The cost of generating such a large search space just for spmv and BC over the input for dynamic and work-stealing was on the order of a workweek of computing time for a fixed input ordering. As such, the search process would not scale for a larger search space. Additionally, one may argue that some intelligence, such as auto-tuning with a line search algorithm, could be used to determine chunk size. However, this type of method is still expensive, and may not provide an optimal chunk size. For example, consider spmv applied with work-stealing to input I9. In this case, the best chunk size on 28 cores is 128, but depending on how the line search was setup, the algorithm may never result in testing chunk size 128 as there are other suboptimal solutions between 128 and 512. Lastly, if this application and input pair is run only a few times, the cost of tuning would highly outweigh runtime with an untuned chunk size. To better demonstrate the difference of runtime of application with even semi-untuned chunk size, we define as: where is the second best runtime, and further define - as: Likewise. we define - the same but with .

Figure (b)b presents these two terms in regards to both dynamic (i.e., - and -) and work-stealing for both spmv and BC. Even though our chunk size search space is large, we observe that maximal relative error for the time in the best case can be over when dynamic is used as the scheduling method for spmv. For spmv, work-stealing has a small relative error in both the best and worst cases. However, this argument is flipped completely when comparing to BC. This flip further demonstrates how optimal chunk size is dependent on many parameters.

iCh sensitivity. Similar to dynamic and work-stealing, iCh is sensitive to application and input ordering. However, the only simple parameter in the algorithm for iCh is . We experiment with four different values for , namely , , , and . In doing so, we notice that the best runtime across all inputs, applications, and core counts tends to be when is either or . When , we observe the most cases with the worst runtimes. At , we have fewer best runtimes than either or , but not as many as , and the relative difference in time is better in these worst cases. Overall, we suggest using , though we have not tried them exhaustively. The overall number of best is the same for and over both applications. However, is better than for spmv, and has one better than in BC. Comparing the relative error between and , we find the maximal value to be with an average of . In comparing with from the previous subsection, this relative error is better than the Max-WS for larger core counts.

Max speedup. Here, we evaluate the ability of iCh to speedup an application. For this subsection, we fix the chunk size for dynamic and work-stealing to 128. Though the “optimal” chunk size is dependent on many factors, for most inputs a chunk size of 128 produced an optimal time for dynamic and work-stealing. In particular, for spmv, the best runtime was found for dynamic of the time and for work-stealing of the time with a chunk size of 128 on 28 cores. For BC, the best runtime was found of the time for dynamic and of the time for work-stealing with a chunk size of 128 on 24 cores. Additionally, we remind the reader that the goal of iCh is not to be the “optimal” or improve the runtime past what can be achieved by work-stealing if it was tuned. The goal of iCh is to have close to best performance without having to tune for a chunk size off-line before. We will use for all these tests, and define as: where is 128 when is dynamic or work-stealing.

I Schedule p Speedup SBar
I1 iCh-50% 28 11.10 10 0.0 0.5 .111*4 .5 /
WS,128 28 10.93 10 0.0 0.5 .109*4 .5 /
Dyn,128 28 8.68 10 0.0 0.5 .0868*4 .5 /
I2 iCh-50% 28 17.20 10 0.0 0.5 .172*4 .5 /
WS,128 28 18.11 10 0.0 0.5 .1811*4 .5 /
Dyn,128 28 14.22 10 0.0 0.5 .1422*4 .5 /
I3 iCh-50% 28 25.13 10 0.0 0.5 .2513*4 .5 /
WS,128 24 8.2 10 0.0 0.5 .082*4 .5 /
Dyn,128 28 17.3 10 0.0 0.5 .173*4 .5 /
I4 iCh-50% 28 18.5 10 0.0 0.5 .185*4 .5 /
WS,128 24 8.29 10 0.0 0.5 .0829*4 .5 /
Dyn,128 28 19.1 10 0.0 0.5 .191*4 .5 /
I5 iCh-50% 28 20.6 10 0.0 0.5 .206*4 .5 /
WS,128 28 19.75 10 0.0 0.5 .1975*4 .5 /
Dyn,128 28 15.99 10 0.0 0.5 .1599*4 .5 /
I6 iCh-50% 28 20.7 10 0.0 0.5 .207*4 .5 /
WS,128 28 21.94 10 0.0 0.5 .219*4 .5 /
Dyn,128 28 17.22 10 0.0 0.5 .1722*4 .5 /
I7 iCh-50% 28 20.6 10 0.0 0.5 .206*4 .5 /
WS,128 28 8.3 10 0.0 0.5 .083*4 .5 /
Dyn,128 28 16.28 10 0.0 0.5 .1628*4 .5 /
I8 iCh-50% 28 18.68 10 0.0 0.5 .1868*4 .5 /
WS,128 28 19.08 10 0.0 0.5 .1908*4 .5 /
Dyn,128 28 11.78 10 0.0 0.5 .1178*4 .5 /
I9 iCh-50% 28 19.26 10 0.0 0.5 .1926*4 .5 /
WS,128 28 20.37 10 0.0 0.5 .2037*4 .5 /
Dyn,128 28 18.88 10 0.0 0.5 .188*4 .5 /
I10 iCh-50% 28 21.69 10 0.0 0.5 .2169*4 .5 /
WS,128 28 10.3 10 0.0 0.5 .103*4 .5 /
Dyn,128 28 14.08 10 0.0 0.5 .14*4 .5 /
I11 iCh-50% 28 21.74 10 0.0 0.5 .2174*4 .5 /
WS,128 28 16.1 10 0.0 0.5 .161*4 .5 /
Dyn,128 24 16.34 10 0.0 0.5 .1634*4 .5 /
I12 iCh-50% 28 14.98 10 0.0 0.5 .1498*4 .5 /
WS,128 24 13.1 10 0.0 0.5 .131*4 .5 /
Dyn,128 28 15.93 10 0.0 0.5 .1593*4 .5 /
I13 iCh-50% 28 22.93 10 0.0 0.5 .2293*4 .5 /
WS,128 28 22.53 10 0.0 0.5 .2253*4 .5 /
Dyn,128 28 14.75 10 0.0 0.5 .1475*4 .5 /
I14 iCh-50% 28 22.43 10 0.0 0.5 .2243*4 .5 /
WS,128 28 22.44 10 0.0 0.5 .2244*4 .5 /
Dyn,128 28 14.3 10 0.0 0.5 .143*4 .5 /
I15 iCh-50% 28 21.33 10 0.0 0.5 .2133*4 .5 /
WS,128 28 23.78 10 0.0 0.5 .2378*4 .5 /
Dyn,128 28 19.09 10 0.0 0.5 .1909*4 .5 /
(a) Speedup spmv
I Schedule p Speedup SBar
I1 iCh-50% 16 4.1 10 0.0 0.5 .041*4 .5 /
WS,128 24 2.91 10 0.0 0.5 .029*4 .5 /
Dyn,128 24 2.54 10 0.0 0.5 .0254*4 .5 /
I2 iCh-50% 24 16.8 10 0.0 0.5 .168*4 .5 /
WS,128 24 15.9 10 0.0 0.5 .159*4 .5 /
Dyn,128 24 10.34 10 0.0 0.5 .1034*4 .5 /
I3 iCh-50% 28 17.3 10 0.0 0.5 .173*4 .5 /
WS,128 24 14.6 10 0.0 0.5 .146*4 .5 /
Dyn,128 24 12.15 10 0.0 0.5 .1215*4 .5 /
I4 iCh-50% 28 15.9 10 0.0 0.5 .159*4 .5 /
WS,128 24 12.66 10 0.0 0.5 .1266*4 .5 /
Dyn,128 24 3.21 10 0.0 0.5 .0321*4 .5 /
I5 iCh-50% 16 12.2 10 0.0 0.5 .122*4 .5 /
WS,128 24 9.63 10 0.0 0.5 .0963*4 .5 /
Dyn,128 24 7.06 10 0.0 0.5 .0706*4 .5 /
I6 iCh-50% 24 16.57 10 0.0 0.5 .1657*4 .5 /
WS,128 24 14.05 10 0.0 0.5 .1405*4 .5 /
Dyn,128 24 9.75 10 0.0 0.5 .0975*4 .5 /
I7 iCh-50% 24 14.3 10 0.0 0.5 .143*4 .5 /
WS,128 24 14.6 10 0.0 0.5 .146*4 .5 /
Dyn,128 24 12.15 10 0.0 0.5 .1215*4 .5 /
I8 iCh-50% 24 12.88 10 0.0 0.5 .1288*4 .5 /
WS,128 24 11.12 10 0.0 0.5 .1112*4 .5 /
Dyn,128 24 7.77 10 0.0 0.5 .0777*4 .5 /
I9 iCh-50% 24 20.32 10 0.0 0.5 .2232*4 .5 /
WS,128 24 16.23 10 0.0 0.5 .1623*4 .5 /
Dyn,128 24 7.55 10 0.0 0.5 .0755*4 .5 /
I10 iCh-50% 24 7.99 10 0.0 0.5 .0799*4 .5 /
WS,128 24 8.29 10 0.0 0.5 .0829*4 .5 /
Dyn,128 24 6.72 10 0.0 0.5 .0672*4 .5 /
I11 iCh-50% 24 20.1 10 0.0 0.5 .201*4 .5 /
WS,128 24 21.4 10 0.0 0.5 .214*4 .5 /
Dyn,128 24 14.56 10 0.0 0.5 .1456*4 .5 /
I12 iCh-50% 24 16.98 10 0.0 0.5 .1698*4 .5 /
WS,128 24 11.56 10 0.0 0.5 .1156*4 .5 /
Dyn,128 24 7.6 10 0.0 0.5 .076*4 .5 /
I13 iCh-50% 24 23.11 10 0.0 0.5 .2311*4 .5 /
WS,128 24 20.73 10 0.0 0.5 .2073*4 .5 /
Dyn,128 24 14.36 10 0.0 0.5 .1436*4 .5 /
I14 iCh-50% 24 22.01 10 0.0 0.5 .2201*4 .5 /
WS,128 24 21.18 10 0.0 0.5 .2118*4 .5 /
Dyn,128 24 18.01 10 0.0 0.5 .1801*4 .5 /
I15 iCh-50% 24 20.1 10 0.0 0.5 .201*4 .5 /
WS,128 24 21.75 10 0.0 0.5 .2175*4 .5 /
Dyn,128 24 18.01 10 0.0 0.5 .1801*4 .5 /
(b) Speedup BC
Table 2: Speedup with iCh, work-stealing (WS), and dynamic (Dyn). I is the input, p is the number of cores, and SBar is a bar graph of the speedup.

In Table (a)a, the speedups are presented for spmv for the three different scheduling methods. We note that in most cases iCh provides a speedup around as good or better than that from either dynamic or work-stealing. In several cases, such as I1 and I3, iCh has the best speedup. We believe that this behavior is an artifact that 128 is not the “optimal” chunk size for dynamic and work-stealing, despite offering the best speedup for the search space. For I1, iCh can achieve a better speedup than work-stealing for all chunk size tested, but we only tested a finite set. This result is opposed to that of I3 where the best chunk size discovered by the search space is 64 and results in a speedup of . Again, this speedup is still smaller than iCh, but we believe this due to not finding the best chunk size. For I2, we notice that work-stealing can achieve a speedup better than iCh, which is the largest difference when the chunk size is fixed at 128.

In Table (b)b, we observe the speedups for BC. The application is more interesting as three are more locks and updates that can stall the parallel execution than spmv. Therefore, BC is expected to not scale well. Overall, iCh still does very well, and the speedup for iCh is only smaller than 4 cases. In these four cases, the difference is very small. However, we notice something interesting. In two cases, i.e., I1 and I5, iCh can obtain its maximal speedup using 16 cores and not 24 cores. In both of these cases, the speedup is worse at 24 cores, yet iCh can have better speedup at 16 cores than dynamic and work-stealing on any core count tested. We believe that this is an artifact of iCh finding the best chunk size early and the application running out of parallelism on higher core counts. As the application runs out of work from parallelism, the overhead of iCh shows. However, the speedup for iCh on 24 cores for I1 is 2.98 and I5 is 10.3 which are both close to the best speedup found for work-stealing on the chunk size search space.

iCh optimal bound. Next, we want to bound how “bad” or how far the speedup from iCh is the best-found speedup of either work-stealing or dynamic overall chunk size. In doing so, we fix . We fix this value, because we present iCh as an auto-tuning algorithm that will not need user input, even though other values were tested and may provide a better speedup for iCh. We find that iCh is at most and on average from the best speedup from either dynamic and work-stealing on spmv. This means that iCh on average has about the same speedup as the best scheduling method tuned over our chunk size collection. For BC, we find that iCh is at most and on average from the best speedup from either dynamic and work-stealing. This worst case for BC is much more surprising than that from spmv. However, this number is driven by one case in which dynamic does extremely well for a chunk size of 32, and no other scheduling method and the chunk size can compare. Overall, Table (b)b provides a much better average look at the performance for iCh on BC.

6 Related Work

Work by Yong, Jin, and Zhang [24] add a dynamic history to decide about load-balance on distributed shared memory systems. The adaptive chunk size in the local adaptive schedule in our algorithm is an extension of this work. However, we note that we are optimizing for variance and they are for load-balance, and as a result, the inequalities in their classifications are in the opposite direction as iCh. In particular, the older work considered how to keep and update a history on distributed shared-memory systems, such as KSR-1 and Convex up to 16 CPUs, that could have high delays and a different memory system of today’s modern systems. In [1], loops are scheduled in a distributed fashion with MPI. The chunk size is determined by a direct fraction of the cumulative number of tasks complete and the processor speed. The KASS system [22] considers adaptively chunking together in the static or initialization phase. Chunks in the second (dynamic phase) are reduced fixedly based on information from past iteration runs but are not adapted within an iteration like iCh. Chunks are stolen if a queue runs out of its own chunks. A history-aware [13] studies chunk size from past iteration and the number of times the task will be ran using a much more complex “best-fit” approximation fit. This proves benefits for loops that are repeated, but iCh does not consider this as i regular kernels may not repeat loops as in spmv. Lastly, BinLPT [15] schedules irregular tasks from a loop using an estimate of the work in each loop and a maximal number of chunks provided by the user. This method is one of the newest and provides good performance in publication. In contrast, iCh wants to provide an easier method that does not require estimates of loop work and more user input.

7 Conclusion

This work develops an adaptive OpenMP loop scheduler for work imbalanced irregular applications by adaptively tuning chunk size and using work-stealing. The method uses a force-feedback control system that analyzes an approximation to the variance in task lengths assigned in a chunk. Though rudimentary, this system has relatively low overheads and allows for performance that is comparable to fine tuning over a large chunk size collection for sparse matrix-vector multiplication (spmv) and Betweenness Centrality (BC). In particular, we demonstrate that iCh is on average from the best speedup achieved by either the traditional dynamic or work-stealing schedules for spmv when these two schedules are tuned over a relatively large collection of chunk size. We also demonstrate that iCh is on average from the best speedup achieved by either the dynamic or work-stealing schedule for BC. Additionally, we observe that iCh can reduce the variation in runtime that exists in a work-stealing method that randomly selects its victim.

References

  • [1] I. Banicescu, V. Velusamy, and J. Devaprasad (2003) On the scalability of dynamic scheduling scientific applications with adaptive weighted factoring. Cluster Computing 6 (3), pp. 215–226. Cited by: §6.
  • [2] J. D. Booth, N. D. Ellingwood, H. K. Thornquist, and S. Rajamanickam (2017) Basker: parallel sparse LU factorization utilizing hierarchical parallelism and data layouts. Parallel Comput. 68, pp. 17–31. Cited by: §2.
  • [3] J. D. Booth, J. Kotra, H. Zhao, M. Kandemir, and P. Raghavan (2015-06)

    Phase detection with hidden markov models for DVFS on many-core processors

    .
    In 2015 IEEE 35th International Conference on Distributed Computing Systems, External Links: Document Cited by: §1, §3.
  • [4] U. Brandes (2001-06) A faster algorithm for betweenness centrality. The Journal of Mathematical Sociology 25 (2), pp. 163–177. External Links: Document Cited by: §2.
  • [5] U. Brandes (2008-05) On variants of shortest-path betweenness centrality and their generic computation. Social Networks 30 (2), pp. 136–145. External Links: Document Cited by: §2.
  • [6] E. Cuthill and J. McKee (1969) Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference, ACM ’69, New York, NY, USA, pp. 157–172. External Links: ISBN 9781450374934, Link, Document Cited by: §2, §5.
  • [7] L. Dagum and R. Menon (1998) OpenMP: an industry standard api for shared-memory programming. Computational Science & Engineering, IEEE 5 (1), pp. 46–55. Cited by: §1.
  • [8] T. A. Davis and Y. Hu (2011) The university of florida sparse matrix collection. ACM TOM 38 (1), pp. 1:1–1:25. Cited by: §4.
  • [9] M. Durand, F. Broquedis, T. Gautier, and B. Raffin (2013) An efficient OpenMP loop scheduler for irregular applications on large-scale NUMA machines. In OpenMP in the Era of Low Power Devices and Accelerators, pp. 141–155. External Links: Document Cited by: §3, §3.
  • [10] M. Frasca, K. Madduri, and P. Raghavan (2012-11) NUMA-aware graph mining techniques for performance and energy efficiency. In 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, External Links: Document Cited by: §2, §2, §2, §5.
  • [11] M. Frigo, C. E. Leiserson, and K. H. Randall (1998) The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation - PLDI '98, External Links: Document Cited by: §3.
  • [12] H. Kabir, J. D. Booth, and P. Raghavan (2014-12) A multilevel compressed sparse row format for efficient sparse computations on multicore processors. In 2014 21st International Conference on High Performance Computing (HiPC), External Links: Document Cited by: §2, §2.
  • [13] A. Kejariwal, A. Nicolau, and C. D. Polychronopoulos (2006) History-aware self-scheduling. In 2006 International Conference on Parallel Processing (ICPP’06), Vol. , pp. 185–192. Cited by: §6.
  • [14] K. Madduri, D. Ediger, K. Jiang, D. A. Bader, and D. Chavarria-Miranda (2009-05) A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In 2009 IEEE International Symposium on Parallel & Distributed Processing, External Links: Document Cited by: §2.
  • [15] P. H. Penna, A. T. A. Gomes, M. Castro, P. DM Plentz, H. C. Freitas, F. Broquedis, and J. Méhaut (2019) A comprehensive performance evaluation of the binlpt workload-aware loop scheduler. Concurrency and Computation: Practice and Experience 31 (18), pp. e5170. Cited by: §6.
  • [16] A. Pinar and M. T. Heath (1999) Improving performance of sparse matrix-vector multiplication. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99, External Links: Document Cited by: §2, §2.
  • [17] D. Schmidl, T. Cramer, S. Wienke, C. Terboven, and M. S. Müller (2013) Assessing the performance of openmp programs on the intel xeon phi. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par’13, Berlin, Heidelberg, pp. 547–558. External Links: ISBN 978-3-642-40046-9, Document Cited by: §1.
  • [18] D. Schmidl, P. Philippen, D. Lorenz, C. Rössel, M. Geimer, D. an Mey, B. Mohr, and F. Wolf (2012) Performance analysis techniques for task-based openmp applications. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World, IWOMP’12, Berlin, Heidelberg, pp. 196–209. External Links: ISBN 978-3-642-30960-1, Document Cited by: §1.
  • [19] S. Toledo (1997-11) Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development 41 (6), pp. 711–725. External Links: Document Cited by: §2, §2.
  • [20] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr (2014-Sept.-Oct.) XSEDE: accelerating scientific discovery. Computing in Science & Engineering 16 (5), pp. 62–74. Cited by: §4.
  • [21] R. W. Vuduc and H. Moon (2005-07) Fast sparse matrix-vector multiplication by exploiting variable block structure. Technical report Office of Scientific and Technical Information (OSTI). External Links: Document Cited by: §2.
  • [22] Y. Wang, W. Ji, F. Shi, Q. Zuo, and N. Deng (2012) Knowledge-based adaptive self-scheduling. In Network and Parallel Computing, 9th IFIP International Conference, NPC 2012, Gwangju, Korea, September 6-8, 2012. Proceedings, J. J. Park, A. Y. Zomaya, S. Yeo, and S. Sahni (Eds.), Lecture Notes in Computer Science, Vol. 7513, pp. 22–32. External Links: Document, Link Cited by: §6.
  • [23] S. Williams, A. Waterman, and D. Patterson (2009-04) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52 (4), pp. 65–76. External Links: ISSN 0001-0782, Document Cited by: §2.
  • [24] Y. Yan, C. Jin, and X. Zhang (1997) Adaptively scheduling parallel loops in distributed shared-memory systems. IEEE Transactions on Parallel and Distributed Systems 8 (1), pp. 70–81. External Links: Document Cited by: §6.