1 Introduction
The traditional forkjoin model of programming has remained popular due to the ease of expressing loops that are rich with parallelism in scientific applications. OpenMP is a popular programming interface due to making use of the forkjoin model via programming pragmas, and making the parallelization of loop effortless [7]. However, the interface has limited options on the assignment of tasks to threads when the tasks have a wide variation in time and resources. The only option that the runtime user is given access to is a shared static chunk size, in which the user can define the number of tasks each thread should process before requesting more from a centralized queue. In order to combat the limitations of this scheduling method, many modern programs are being converted to use dynamic task queues in a tasking model, and additional support for this model has been added to OpenMP [18, 17]. However, dynamic task queues come with a certain amount of overhead and are not as well supported by legacy applications.
In this work, we provide an OpenMP parallelfor schedule with independent chunk size per thread and autotuned chunk size that works with workstealing to allow users to continue to use the favored forkjoin model on applications that have tasks lengths that vary throughout execution or when they cannot pay for the overhead of tuning for chunk size. We call this method iCh (irregular Chunk), and this method provides a middle ground between traditional forkjoin OpenMP scheduling with OpenMP’s builtin dynamic scheduler that uses static chunk size and dynamic task queues. In particular, this method is targeted towards applications that have tasks that are unbalanced and make many irregular accesses. Some of the common applications in this area are sparse linear algebra kernels (e.g., sparse matrixvector multiplication and sparse triangular solve) and graph algorithms (e.g., Betweenness Centrality). Despite these kinds of codes being predominant in highperformance computing, most schedules are not designed with them in mind.
Currently, tuning the parameters for modern manycore systems for optimal performance can be difficult and not be portable between machines, see Section 5. Making the chunk size smaller allows for more flexibility when the runtimes of individual tasks in the chunk various greatly, but at the cost of making more requests from the centralized queue. Making the chunk size larger reduces the time in making requests from a centralized queue, but can result in more load imbalance. Additionally, a single application may have multiple phases, i.e., a subsection of code that has its own unique performance and energy concern [3], and these phases may need to have its own chunk size to have good performance. Therefore, the choice of best chunk size is dependent on the implemented algorithm (), the hardware microarchitecture (), input (), and current system load (), i.e., Tuning such a chunk size may be impractical and not reflective of the true system under load. As such, a lightweight autotuning like iCh provides one solution.
2 Background
This section describes the current issues and room for improvement related to the adaptive scheduling of tasks with irregular accesses and execution times.
Irregular kernels. Many common applications require calls to kernels (i.e., important common implementations of key algorithms) such as those that deal with sparse matrices and graph algorithms. Examples include graph mining and web crawls. However, these kernels require a great deal of tuning based on both the computer system and algorithm input to perform optimally. Many different programming models are used to implement these kernels, but one of the most common is a forkjoin model. Additionally, many of these kernels are memorybound even when optimally programmed [23]. This means that many memory requests are already waiting to be fulfilled, and additional requests will have high latency on an already busy memory system.
Sparse MatrixVector Multiplication (spmv). spmv is a highly studied and optimized kernel due to its importance in many applications [21, 12, 19, 16]. However, the irregular structure of the sparse coefficient matrix makes this difficult. If a onedimensional layout is applied, the smallest task of work is the multiplications of all nonzeros in a matrix row by the corresponding vector entries summed together at the end. Figure (a)a presents the nonzero structure (i.e., blue representing nonzero entries and white representing zero entries) of the input matrix arabic2005 in its natural order. The natural order is the one provided by the input file, and many times this ordering has some relationship to how elements are normally processed or the layout on the system. From afar, a static assignment of rows may seem like a local choice. To investigate, we bin rows based on nonzero counts with increments of 50 together, such that the first bin counts the rows with 150 nonzeros and the second bin counts the rows with 51100 nonzeros. In Fig. (c)c, we provide the tally of the number of rows in each bin (provided in log scale) for the first 50 bins. We note how much variation exists () and imbalance of work, such as the last two dots representing bin 49 and 50. Additionally, matrices are often preordered based on application to provide some structure, such as permuting nonzeros towards the diagonal or into a block structure. One such common permutation is RCM [6]. This little structure can provide some benefits to handtuned codes [19, 12, 2] that can use the newly found structure to better load balance work. However, Fig. (b)b shows that this could make balancing even harder if rows were assigned linearly. Though orderings like RCM may improve execution time [6, 10, 12], these orderings may make tuning for chunk size more important.
Betweenness Centrality (BC). The BC metric captures the importance of nodes in a graph and concerns the ratio of all shortest paths to those that pass through a given node. Stateoftheart implementations of BC are normally implemented as multiple parallel breadthfirst searches [10, 4, 5, 14]. Therefore, the work of a task will depend on the number of neighboring nodes and the nodes currently queued in the search front. However, relatively good speedups can be achieved with input reorderings and smart chunk size.
Irregular kernel insight. Despite the irregular nature based on input, there does exist some local structure normally. For example, rows or subblocks within a matrix that have a large number of nonzero could be grouped based on some ordering. The same can also be applied to graph algorithms like BC. Even if the given input does not come with this structure, the input can be permuted to have it. This type of permutation or reordering is commonly done to improve performance [10, 19, 16, 12]. Therefore, a thread could, in fact, adapt its own chunk size to fit the local task length. Moreover, if a thread is finished with its own work, it could be intelligent in stealing work based on the workload of others. This does require some computation overheads such as keeping track of workload and communicating this workload to nearby neighbors. Since most of these applications are memorybound, there does exist a certain amount of availability of computational resources and time during computing.
3 Adaptive Runtime Chunk for Irregular Applications
The following steps are considered to construct an adaptive schedule.
Initialization. Standard methods like dynamic scheduling in libgomp use a centralized queue and a single chunk size for all threads, but do not scale well with the number of tasks and threads needing to service manycore systems. Therefore, local queues are constructed for each thread that we denote as where is the thread id for threads used. A local structure that is memory aligned and allocated using firsttouch allocation policy contains a pointer to the local queue, local counter (), and a variable used to calculate chunk size (). The tasks are evenly distributed to tasking queues such that where is the total number of tasks. Additionally, and such that the initial chunk size is , i.e., . The rationale for this choice is that the scheduler wants to allow a chunk size small enough that  threads could steal from later. Moreover, the chunk size is smaller as increases and allows for the variation of tasks that come with more threads.
Local adaption. In traditional workstealing methods, the chunk size is fixed, and any load imbalance is mitigated through workstealing once all the tasks in the initial queue are already executed [9]. However, a thread can only steal work that is not already being actively processed, i.e., not in the active chunk. Therefore, making the chunk size too large to start will result in a load imbalance that the scheduler may not be able to recover from using workstealing. Additionally, making the chunk size too small would result in added overhead and possibly more time to converge.
In iCh, workstealing is still the workhorse for imbalance as in workstealing methods seen in the next step. However, iCh tries to locally adapt chunk size to better fit the variation in execution time of tasks, not the load balance. This variation is very important in irregular applications as tasks may vary greatly in the number of floatpoint operations and memory requests. Additionally, a single core that is mapped to a local thread, and its queue can vary in both voltage, frequency, and memory bandwidth due to load on the system [3]. Because of all these variations, a static shared chunk size has limitations. Despite iCh’s goal, it does have an implicit impact on loadbalancing as it reflects the arguments related to chunk size in the previous paragraph.
This method tries to capture variation into three categories: high, normal, and low. If high, the task’s work length varies more than if low, and a smaller chunk size
will allow for more adaption and possibly workstealing. The thought process is the opposite of that for low. The calculation of “true” variation is very expensive as this requires accurate measurements of time, operations, and memory results, in addition to a global view of the average. Therefore, a very rudimentary estimate is used as follows. The local variable
keeps track of the running total of the number of tasks completed after a whole chunk is finished to estimate task length and limit the number of writes. After completion of the assigned chuck, the local thread will determine its load relative to other threads using. A thread classifies its load relative to other threads as:
In particular, this approximation is simply comparing the thread’s workload to the average workload of other threads. We note that if iCh’s goal for chunk size was load balance, then high and low classifications would be flipped, as a thread that does fewer iterations than average would have heavier tasks. The parameter is added to allow for slight variation and to reduce the number of times chunk size is updated. Through trails, we show in Section 3 that an (i.e., of the current average) is generally sufficient, and minor changes to has little change to runtime for our kernels. For simplicity, we reference in the remainder of the paper by only the percentage, e.g., . This observation allows iCh to be used across different applications, systems, and inputs without handtuning by the user to achieve “good” speedups. Moreover, is the running total and is fixed. As a result of this relationship, iCh is more likely to only adapt chunk size in the beginning due to extremely large variance, and the possibility of adapting due to smaller variance increases with execution.
As noted previously, is used to directly adjust the chunk size, i.e., . After classification, is adjusted as follows. If the thread is under low variation, the number of tasks in a chunk is increased by allowing . If the thread is under heavy variation, the number of tasks in the chunk is decreased by allowing . These increase and decrease are in contrast to what most may consider. In particular, this update is because of the optimization goal. The optimization goal is not to have the chunk size converge for each thread to the same value. In contrast, the goal is for the local chunk size to be adapted to the variation.
Remote workstealing. At some point, the local queues will start to run out of work, and then workstealing is used. Many implementations of workstealing methods will fix a chunk size, and use the THE protocol [11, 9] to try to steal and back off if there exists a problem while trying to minimize the number of locks required. A victim is normally picked at random, and the stealing thread will normally try to steal half the remaining work from the victim.
The iCh method is very similar to the traditional method above. A victim is selected at random and half of the victim’s remaining tasks are stolen. Additionally, the stealing thread’s and are updated based on the victim’s and by taking the average of both, i.e., and . The reasoning for this is as follows. The stealing thread knows some information from the victim. However, the stealing thread does not know the accuracy of the information and tries to average out the uncertainty with its knowledge.
4 Experimental Setup
Test system. BridgesRM at Pittsburgh SuperComputing Center [20] is used for testing. The system contains two Intel Xeon E52695 v3 (Haswell) processors each with 14 cores and 128GB DDR42133. Other microarchitecture, such as Intel Skylake, were also tested, but results did not vary much. We implement iCh inside of GNU libgomp. Codes on Haswell are compiled with GCC 4.8.5 (OpenMP 3.1). OpenMP threads are bound to core with OMP_PROC_BIND=true and OMP_PLACES=cores.
Test inputs. The same test suite of inputs is used for both spmv and BC. Table 1 contains the inputs taken from the SuiteSparse Collection [8], where the number of vertices and edges are reported in millions (i.e., ). Inputs are picked due to their size, variation of density, and application areas. In particular, four applications are of particular interest. These areas are: Freescale: a collection from circuit simulation of semiconductors; DIMACS: a collection from the DIMAC challenge that is designed to further the development of large graph algorithms; LAW: a collection of laboratory for web algorithms of web crawls to research data compression techniques; GenBank: a collection of protein kmer graphs.
Furthermore, we report the average row density (), the ratio of the maximal number of outgoing edges for a vertex over the minimal number of outgoing edges for a vertex (), and the variance of the number of outgoing edges () for each input. These numbers provide a sense of how sparse and how unevenly work is distributed per vertex. Some inputs are very balanced, such as input I8 (hugebubbles). Others have more variance like input I2 (uk2005).
Input  Area  
I1: FullChip  Freescale  2.9  26.6  8.9  1.1e6  3.2e6 
I2: circuit5M_dc  Freescale  3.5  14.8  4.2  12  1 
I3: wikipedia  Gleich  3.5  45  12.6  1.8e5  6.2e4 
I4: patents  Pajek  3.7  14.9  3.9  762  31.5 
I5: AS365  DIMACS  3.7  22.7  5.9  4.6  0.7 
I6: delaunay_n23  DIMACS  8.3  50.3  5.9  7  1.7 
I7: wbedu  Gleich  9.8  57.1  5.8  2.5e4  2.0e3 
I8: hugebubbles10  DIMACS  19.4  58.3  2.9  1  0 
I9: arabic2005  LAW  22.7  639.9  28.1  5.7e5  3.0e5 
I10: road_usa  DIMACS  23.9  57.7  2.4  4.5  0.8 
I11: nlpkkt240  Schenk  27.9  760.6  27.1  4.6  4.8 
I12: uk2005  LAW  39.4  936.3  23.7  1.7e6  2.7e6 
I13: kmer_P1a  GenBank  139.3  297.8  2.1  20  0.4 
I14: kmer_A2a  GenBank  170.7  360.5  2.1  20  0.3 
I15: kmer_V1r  GenBank  214  465.4  2.1  4  0.3 
5 Results
In this section, we observe the numerical results of using three different schedules for OpenMP forloops. These are dynamic (Dyn), a workstealing (WS), and iCh. OpenMP Guided and task models were also tested, but they did not provide any insight. The workstealing method is the same used by iCh, but with a static chunk size. For both dynamic and workstealing, we test with chunk size in the collection . The performance (i.e., Time where is the input, is the ordering, is the schedule, is chunk size, and is the number of cores) varies greatly with chunk size, application, and input. Therefore, we often speak of the best time for all tested chunk size as: Likewise, we define as the max and second best () that will be used throughout this section. Additionally, each timed experiment is repeated 10 times, and the time used in this section is the average time of the 10 runs. The number of runs is important for two reasons. The first is that all runs have a small amount of fluctuation due to the system. Additionally, all the scheduling tests may change slightly from runtorun as victims are selected by random and readwrite orders in dynamic.
Ordering. The execution time, energy usage, and scalability of most irregular applications are dependent on the input. For our two test applications, ordering is also important [6, 10]. In order to demonstrate this for our runs, we consider both the RCM ordering and natural ordering (NAT). We define the percent relative error due to ordering as: where is dynamic. Figure (a)a presents for spmv over the different number of cores with each dot representing one matrix from the test suite. Note that can be under for some inputs, but the majority of the inputs are larger. Figure (b)b presents for BC over the different number of cores with each dot representing one matrix from the test suite. Again, we notice a large error between RCM and NAT order. Overall, both spmv and BC are always faster when the input is ordered with RCM. The difference is also seen when the schedule used is WS in almost the same pattern when dynamic is used (not shown). Additionally, the variation in performance based on chunk size is higher when ordered with RCM than with NAT. This variation is partly due to RCM ordered inputs running faster, and so overheads are seen. Additionally, many of the inputs with NAT orderings have a more uniform random distribution of heavy and light tasks, and a chunk would more likely have both. Therefore, we will use inputs ordered with RCM for the remainder of this section.
Importance of chunk size. Now we analyze the importance that chunk size has on the performance of our benchmarks. In order to analyze, we use two similar metrics as done in the last subsection. The first is to analyze the largest difference that exists due to the chunk size by fixing the input and schedule. We define as: and further define  as: Likewise, we define  as:
Figure (a)a presents  and  for both dynamic (i.e.,  and , respectively) and workstealing for both spmv and BC. Note that we only run BC up to 24 cores due to issues with scaling throughout this paper. From the figure, we observe the worst case for dynamic on both spmv and BC is around or higher. This means that just selecting a chunk size without any tuning or thought can greatly influence the runtime of both these applications. On the other hand, for some inputs, the worst case is not so bad for BC as it is for spmv. For example, using 24 cores for BC, there is one input that percent relative error for time is only . In particular, this input is I1: FullChip, but the next smallest percent relative error for 24 cores is for I6: delaunay_n23.
Though the worst performance is a good argument on why chunk size needs to be tuned, it does not capture the difficulty of tuning. In our experiments, we use a relatively large search space for chunk size (i.e., ). The cost of generating such a large search space just for spmv and BC over the input for dynamic and workstealing was on the order of a workweek of computing time for a fixed input ordering. As such, the search process would not scale for a larger search space. Additionally, one may argue that some intelligence, such as autotuning with a line search algorithm, could be used to determine chunk size. However, this type of method is still expensive, and may not provide an optimal chunk size. For example, consider spmv applied with workstealing to input I9. In this case, the best chunk size on 28 cores is 128, but depending on how the line search was setup, the algorithm may never result in testing chunk size 128 as there are other suboptimal solutions between 128 and 512. Lastly, if this application and input pair is run only a few times, the cost of tuning would highly outweigh runtime with an untuned chunk size. To better demonstrate the difference of runtime of application with even semiuntuned chunk size, we define as: where is the second best runtime, and further define  as: Likewise. we define  the same but with .
Figure (b)b presents these two terms in regards to both dynamic (i.e.,  and ) and workstealing for both spmv and BC. Even though our chunk size search space is large, we observe that maximal relative error for the time in the best case can be over when dynamic is used as the scheduling method for spmv. For spmv, workstealing has a small relative error in both the best and worst cases. However, this argument is flipped completely when comparing to BC. This flip further demonstrates how optimal chunk size is dependent on many parameters.
iCh sensitivity. Similar to dynamic and workstealing, iCh is sensitive to application and input ordering. However, the only simple parameter in the algorithm for iCh is . We experiment with four different values for , namely , , , and . In doing so, we notice that the best runtime across all inputs, applications, and core counts tends to be when is either or . When , we observe the most cases with the worst runtimes. At , we have fewer best runtimes than either or , but not as many as , and the relative difference in time is better in these worst cases. Overall, we suggest using , though we have not tried them exhaustively. The overall number of best is the same for and over both applications. However, is better than for spmv, and has one better than in BC. Comparing the relative error between and , we find the maximal value to be with an average of . In comparing with from the previous subsection, this relative error is better than the MaxWS for larger core counts.
Max speedup. Here, we evaluate the ability of iCh to speedup an application. For this subsection, we fix the chunk size for dynamic and workstealing to 128. Though the “optimal” chunk size is dependent on many factors, for most inputs a chunk size of 128 produced an optimal time for dynamic and workstealing. In particular, for spmv, the best runtime was found for dynamic of the time and for workstealing of the time with a chunk size of 128 on 28 cores. For BC, the best runtime was found of the time for dynamic and of the time for workstealing with a chunk size of 128 on 24 cores. Additionally, we remind the reader that the goal of iCh is not to be the “optimal” or improve the runtime past what can be achieved by workstealing if it was tuned. The goal of iCh is to have close to best performance without having to tune for a chunk size offline before. We will use for all these tests, and define as: where is 128 when is dynamic or workstealing.


In Table (a)a, the speedups are presented for spmv for the three different scheduling methods. We note that in most cases iCh provides a speedup around as good or better than that from either dynamic or workstealing. In several cases, such as I1 and I3, iCh has the best speedup. We believe that this behavior is an artifact that 128 is not the “optimal” chunk size for dynamic and workstealing, despite offering the best speedup for the search space. For I1, iCh can achieve a better speedup than workstealing for all chunk size tested, but we only tested a finite set. This result is opposed to that of I3 where the best chunk size discovered by the search space is 64 and results in a speedup of . Again, this speedup is still smaller than iCh, but we believe this due to not finding the best chunk size. For I2, we notice that workstealing can achieve a speedup better than iCh, which is the largest difference when the chunk size is fixed at 128.
In Table (b)b, we observe the speedups for BC. The application is more interesting as three are more locks and updates that can stall the parallel execution than spmv. Therefore, BC is expected to not scale well. Overall, iCh still does very well, and the speedup for iCh is only smaller than 4 cases. In these four cases, the difference is very small. However, we notice something interesting. In two cases, i.e., I1 and I5, iCh can obtain its maximal speedup using 16 cores and not 24 cores. In both of these cases, the speedup is worse at 24 cores, yet iCh can have better speedup at 16 cores than dynamic and workstealing on any core count tested. We believe that this is an artifact of iCh finding the best chunk size early and the application running out of parallelism on higher core counts. As the application runs out of work from parallelism, the overhead of iCh shows. However, the speedup for iCh on 24 cores for I1 is 2.98 and I5 is 10.3 which are both close to the best speedup found for workstealing on the chunk size search space.
iCh optimal bound. Next, we want to bound how “bad” or how far the speedup from iCh is the bestfound speedup of either workstealing or dynamic overall chunk size. In doing so, we fix . We fix this value, because we present iCh as an autotuning algorithm that will not need user input, even though other values were tested and may provide a better speedup for iCh. We find that iCh is at most and on average from the best speedup from either dynamic and workstealing on spmv. This means that iCh on average has about the same speedup as the best scheduling method tuned over our chunk size collection. For BC, we find that iCh is at most and on average from the best speedup from either dynamic and workstealing. This worst case for BC is much more surprising than that from spmv. However, this number is driven by one case in which dynamic does extremely well for a chunk size of 32, and no other scheduling method and the chunk size can compare. Overall, Table (b)b provides a much better average look at the performance for iCh on BC.
6 Related Work
Work by Yong, Jin, and Zhang [24] add a dynamic history to decide about loadbalance on distributed shared memory systems. The adaptive chunk size in the local adaptive schedule in our algorithm is an extension of this work. However, we note that we are optimizing for variance and they are for loadbalance, and as a result, the inequalities in their classifications are in the opposite direction as iCh. In particular, the older work considered how to keep and update a history on distributed sharedmemory systems, such as KSR1 and Convex up to 16 CPUs, that could have high delays and a different memory system of today’s modern systems. In [1], loops are scheduled in a distributed fashion with MPI. The chunk size is determined by a direct fraction of the cumulative number of tasks complete and the processor speed. The KASS system [22] considers adaptively chunking together in the static or initialization phase. Chunks in the second (dynamic phase) are reduced fixedly based on information from past iteration runs but are not adapted within an iteration like iCh. Chunks are stolen if a queue runs out of its own chunks. A historyaware [13] studies chunk size from past iteration and the number of times the task will be ran using a much more complex “bestfit” approximation fit. This proves benefits for loops that are repeated, but iCh does not consider this as i regular kernels may not repeat loops as in spmv. Lastly, BinLPT [15] schedules irregular tasks from a loop using an estimate of the work in each loop and a maximal number of chunks provided by the user. This method is one of the newest and provides good performance in publication. In contrast, iCh wants to provide an easier method that does not require estimates of loop work and more user input.
7 Conclusion
This work develops an adaptive OpenMP loop scheduler for work imbalanced irregular applications by adaptively tuning chunk size and using workstealing. The method uses a forcefeedback control system that analyzes an approximation to the variance in task lengths assigned in a chunk. Though rudimentary, this system has relatively low overheads and allows for performance that is comparable to fine tuning over a large chunk size collection for sparse matrixvector multiplication (spmv) and Betweenness Centrality (BC). In particular, we demonstrate that iCh is on average from the best speedup achieved by either the traditional dynamic or workstealing schedules for spmv when these two schedules are tuned over a relatively large collection of chunk size. We also demonstrate that iCh is on average from the best speedup achieved by either the dynamic or workstealing schedule for BC. Additionally, we observe that iCh can reduce the variation in runtime that exists in a workstealing method that randomly selects its victim.
References
 [1] (2003) On the scalability of dynamic scheduling scientific applications with adaptive weighted factoring. Cluster Computing 6 (3), pp. 215–226. Cited by: §6.
 [2] (2017) Basker: parallel sparse LU factorization utilizing hierarchical parallelism and data layouts. Parallel Comput. 68, pp. 17–31. Cited by: §2.

[3]
(201506)
Phase detection with hidden markov models for DVFS on manycore processors
. In 2015 IEEE 35th International Conference on Distributed Computing Systems, External Links: Document Cited by: §1, §3.  [4] (200106) A faster algorithm for betweenness centrality. The Journal of Mathematical Sociology 25 (2), pp. 163–177. External Links: Document Cited by: §2.
 [5] (200805) On variants of shortestpath betweenness centrality and their generic computation. Social Networks 30 (2), pp. 136–145. External Links: Document Cited by: §2.
 [6] (1969) Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th National Conference, ACM ’69, New York, NY, USA, pp. 157–172. External Links: ISBN 9781450374934, Link, Document Cited by: §2, §5.
 [7] (1998) OpenMP: an industry standard api for sharedmemory programming. Computational Science & Engineering, IEEE 5 (1), pp. 46–55. Cited by: §1.
 [8] (2011) The university of florida sparse matrix collection. ACM TOM 38 (1), pp. 1:1–1:25. Cited by: §4.
 [9] (2013) An efficient OpenMP loop scheduler for irregular applications on largescale NUMA machines. In OpenMP in the Era of Low Power Devices and Accelerators, pp. 141–155. External Links: Document Cited by: §3, §3.
 [10] (201211) NUMAaware graph mining techniques for performance and energy efficiency. In 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, External Links: Document Cited by: §2, §2, §2, §5.
 [11] (1998) The implementation of the cilk5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation  PLDI '98, External Links: Document Cited by: §3.
 [12] (201412) A multilevel compressed sparse row format for efficient sparse computations on multicore processors. In 2014 21st International Conference on High Performance Computing (HiPC), External Links: Document Cited by: §2, §2.
 [13] (2006) Historyaware selfscheduling. In 2006 International Conference on Parallel Processing (ICPP’06), Vol. , pp. 185–192. Cited by: §6.
 [14] (200905) A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. In 2009 IEEE International Symposium on Parallel & Distributed Processing, External Links: Document Cited by: §2.
 [15] (2019) A comprehensive performance evaluation of the binlpt workloadaware loop scheduler. Concurrency and Computation: Practice and Experience 31 (18), pp. e5170. Cited by: §6.
 [16] (1999) Improving performance of sparse matrixvector multiplication. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM)  Supercomputing '99, External Links: Document Cited by: §2, §2.
 [17] (2013) Assessing the performance of openmp programs on the intel xeon phi. In Proceedings of the 19th International Conference on Parallel Processing, EuroPar’13, Berlin, Heidelberg, pp. 547–558. External Links: ISBN 9783642400469, Document Cited by: §1.
 [18] (2012) Performance analysis techniques for taskbased openmp applications. In Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World, IWOMP’12, Berlin, Heidelberg, pp. 196–209. External Links: ISBN 9783642309601, Document Cited by: §1.
 [19] (199711) Improving the memorysystem performance of sparsematrix vector multiplication. IBM Journal of Research and Development 41 (6), pp. 711–725. External Links: Document Cited by: §2, §2.
 [20] (2014Sept.Oct.) XSEDE: accelerating scientific discovery. Computing in Science & Engineering 16 (5), pp. 62–74. Cited by: §4.
 [21] (200507) Fast sparse matrixvector multiplication by exploiting variable block structure. Technical report Office of Scientific and Technical Information (OSTI). External Links: Document Cited by: §2.
 [22] (2012) Knowledgebased adaptive selfscheduling. In Network and Parallel Computing, 9th IFIP International Conference, NPC 2012, Gwangju, Korea, September 68, 2012. Proceedings, J. J. Park, A. Y. Zomaya, S. Yeo, and S. Sahni (Eds.), Lecture Notes in Computer Science, Vol. 7513, pp. 22–32. External Links: Document, Link Cited by: §6.
 [23] (200904) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52 (4), pp. 65–76. External Links: ISSN 00010782, Document Cited by: §2.
 [24] (1997) Adaptively scheduling parallel loops in distributed sharedmemory systems. IEEE Transactions on Parallel and Distributed Systems 8 (1), pp. 70–81. External Links: Document Cited by: §6.
Comments
There are no comments yet.